muupan / async-rl Goto Github PK
View Code? Open in Web Editor NEWReplicating "Asynchronous Methods for Deep Reinforcement Learning" (http://arxiv.org/abs/1602.01783)
License: MIT License
Replicating "Asynchronous Methods for Deep Reinforcement Learning" (http://arxiv.org/abs/1602.01783)
License: MIT License
Hey,
so the color transform that you use is incorrect (for example in Seaquest it sometimes causes the fishes to disappear).
Lines 67 to 68 in 12dac59
You can get the correct one from the repo Deepmind uses:
https://github.com/torch/image/blob/9f65c30167b2048ecbe8b7befdc6b2d6d12baee9/generic/image.c#L2105)
Let me know how much this improves your results if you decide to adopt this. For me the difference was massive.
Hello,
I have stability issue when increasing t_max (i am trying to learn with torcs racing game where possibly t_max=5 might be too small )
In a3c.py, it seems that total_loss is not normalized by number of frames. Is this normal? Is it the reason why you need to call the GradientClipping optimization hook?
Hi, this is awesome. So far, it performs the best I can find to match original score in the paper. Thanks for your sharing!
Just one question, I saw GPU setting in your code. Wondering if you have ever tested it on GPU? I'm curious if it'll be even faster than aws C4.8xLarge.
Thanks.
e.g. like what simple_dqn does here:
https://github.com/tambetm/simple_dqn
./play.sh snapshots/breakout_77.pkl
I trained the model for 3000000 iterations and saved the model as "3000000.h5". But when I try to evaluate using demo_a3c_ale.py, it turns out there is an error there saying "Value Error (inconsistent group is specified)" at line 61: serializers.load_hdf5(args.model, model).
When I try to run the saved model as :
python demo_a3c_ale.py ../roms/breakout.bin trained_model/breakout_ff/80000000_finish.h5
I get an error :
ImportError: No module named 'ale_python_interface'
which is because python3.5 (which I am using now by default) does not have access to ALE.
So I installed ALE but at the ale_python_interface step, that is when we have to execute
pip install --user .
I instead execute
python3.5 -m pip install --user .
since we want python3.5 to get access to ALE. This results in a syntax error since the ALE's python code still has python2.7 syntax. How do we fix this issue?
Thanks!
Hi,
This looks great. How would you go about and adapt this to Open AI Gym for example?
Can you please provide a set of places where changes have to be made?
How Generic is the code to adapt to any environment?
Hi,
I just noticed:
https://github.com/muupan/async-rl/blob/master/ale.py#L115
each training action is taken 4x times to the game environment?
e.g. user pressed 'down' once, but in your simulated training the environment to take 'down' action 4 times!
I wonder why? and will the result differ from the original paper.
In:
python demo_a3c_ale.py [--use-lstm]
I've been looking at what to put in for quite a while but I have just found it's related to ALE. I've tried putting either 'breakout' or the path to the ALE repo in my desktop but neither works; what should I put?
Thanks!
I should support A3C LSTM.
Hey,
first off: great work. I just re-implemented the paper myself using tensorflow and your code provided great "side-information" for doing so :).
In the process I also realized that there may be two subtle bugs in your implementation (although I have never used chainer before so I might be misunderstanding things):
pi_loss -= log_prob * float(advantage.data)
advantage = R - v
v
comes from self.past_values[i]
which, in turn, is the output of the value network. As I wrote I am no expert regarding chainer but you need to make sure that no gradient flows through v
here (as the value function should only be updated according to the v_loss
in your code). In theano/tensorflow this would be handled with a disconnected_grad()
or stop_gradient()
operation respectively.I will push my implementation to github sometime during this week as soon as I have more thoroughly tested it and can then reference it here for you to compare.
v_loss += (v - R) ** 2 / 2
But the original paper just calculate the derivative of the (V-R)^2 right?
Hi there -
I forked your code to work on Super Mario Bros :)
I'm using a Nintendo emulator that I modified to allow for programmatic control by the agent (similar to the Arcade Learning Environment).
I've been having problems with the spawned FCEUX processes silently crashing. I'm wondering if there might be a race condition as they update the shared model... Did you run into similar issues with the ALE?
Thanks for providing your code!
Hi,
I am running the A3C-lSTM model on the game space invaders. But as can be seen in the scores dump the model does not seem to learn anything. In comparison, the scores file in the already saved model
seems to indicate much faster learning. My question then is, what were the hyper-parameters used to arrive at the saved model which is in the repository?
I need to turn it off to make the environment more equivalent to theirs.
Can, I modify the code to a completely different environment, rewards, states, actions etc? Which file should I start looking at first?
You are computing entropy in policy_output.py
like:
- probs * log_probs
with a minus sign. This is expected to be positive (non-negative to be precise).
You are then computing pi_loss
in a3c.py
with a loop and subtracting terms:
for ...:
pi_loss -= log_prob * advantage # sign (rhs) = sign(-advantage).
pi_loss -= self.beta * entropy # sign (rhs) = 1.
v_loss += (v - R) ** 2 / 2
And finally you take loss
as a (weighted) sum of pi_loss
and v_loss
.
Are you sure about this? It seems to me like you should add up pi_loss
with +=
on both the terms in the loop?
Even after saving the final model 80000000_finish.h5
, some processes continue running. This issue is maybe related to #5 .
I have some questions in mind about the specific setup of the environment. I'm not sure did you check with the authors on these choice.
Btw you're not using the frame_skip parameter anywhere but a magic number 4. You might want to fix that.
Great work!
Hello,
due to my poor cpu, a function to resume training from existing models is needed.
Thank you
As the other implementation do?
https://github.com/miyosuda/async_deep_reinforce
If not, what is needed to be done to be able to run on GPU?
Thanks.
Hi there, cool project! I'm trying to reproduce the A3C results with my own implementation and have two questions regarding the Dr. Mnih confirmed parameters on the Wiki page: (1) There was no loss clipping. The A3C paper does mention gradient clipping however which is very similar I believe. (2) In the original DQN paper they normalized rewards by sign(R(s))
rather than max(0, min(R(s), 1)
as listed in the Wiki. Could you provide some clarification on these two points, please?
The loss function v_loss
is accumulated like
v_loss += (v - R) ** 2 / 2
but then it is scaled with v_loss *= self.v_loss_coef
where v_loss_coef
is 0.5 by default.
Is there a reason why we're scaling it twice, termwise and also the final sum?
As titled. I found the console just print out scores too fast, where can we find the score VS training iteration records?
In scores.txt
of the current uploaded trained model, evaluation results at 55000000
and 56000000
are missing.
I don't know why and whether it can affect performance. I need to check.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.