muupan / async-rl Goto Github PK

View Code? Open in Web Editor NEW

401.0 401.0 83.0 9.55 MB

Replicating "Asynchronous Methods for Deep Reinforcement Learning" (http://arxiv.org/abs/1602.01783)

License: MIT License

Python 100.00%

async-rl's People

Contributors

Stargazers

Watchers

async-rl's Issues

Color transform is incorrect.

Hey,

so the color transform that you use is incorrect (for example in Seaquest it sometimes causes the fishes to disappear).

async-rl/ale.py

Lines 67 to 68 in 12dac59

 img = rgb_img[:, :, 0] * 0.2126 + rgb_img[:, :, 1] * \ 

 0.0722 + rgb_img[:, :, 2] * 0.7152

You can get the correct one from the repo Deepmind uses:
https://github.com/torch/image/blob/9f65c30167b2048ecbe8b7befdc6b2d6d12baee9/generic/image.c#L2105)

Let me know how much this improves your results if you decide to adopt this. For me the difference was massive.

t_max = 1000 , loss normalization

Hello,

I have stability issue when increasing t_max (i am trying to learn with torcs racing game where possibly t_max=5 might be too small )
In a3c.py, it seems that total_loss is not normalized by number of frames. Is this normal? Is it the reason why you need to call the GradientClipping optimization hook?

Running on GPU

Hi, this is awesome. So far, it performs the best I can find to match original score in the paper. Thanks for your sharing!

Just one question, I saw GPU setting in your code. Wondering if you have ever tested it on GPU? I'm curious if it'll be even faster than aws C4.8xLarge.

Thanks.

add Play game with visualization?

e.g. like what simple_dqn does here:

https://github.com/tambetm/simple_dqn

./play.sh snapshots/breakout_77.pkl

cannot evaluate on trained model

I trained the model for 3000000 iterations and saved the model as "3000000.h5". But when I try to evaluate using demo_a3c_ale.py, it turns out there is an error there saying "Value Error (inconsistent group is specified)" at line 61: serializers.load_hdf5(args.model, model).

Installation: ImportError: No module named 'ale_python_interface'

When I try to run the saved model as :

python demo_a3c_ale.py ../roms/breakout.bin trained_model/breakout_ff/80000000_finish.h5

I get an error :

ImportError: No module named 'ale_python_interface'

which is because python3.5 (which I am using now by default) does not have access to ALE.
So I installed ALE but at the ale_python_interface step, that is when we have to execute

pip install --user .

I instead execute

python3.5 -m pip install --user .

since we want python3.5 to get access to ALE. This results in a syntax error since the ALE's python code still has python2.7 syntax. How do we fix this issue?

Thanks!

How to adapt this code to a new environment?

Hi,

This looks great. How would you go about and adapt this to Open AI Gym for example?

Can you please provide a set of places where changes have to be made?

How Generic is the code to adapt to any environment?

the action (x4) semantics different?

Hi,

I just noticed:

https://github.com/muupan/async-rl/blob/master/ale.py#L115

each training action is taken 4x times to the game environment?

e.g. user pressed 'down' once, but in your simulated training the environment to take 'down' action 4 times!

I wonder why? and will the result differ from the original paper.

What to put in <path-to-rom>

In:
python demo_a3c_ale.py [--use-lstm]

I've been looking at what to put in for quite a while but I have just found it's related to ALE. I've tried putting either 'breakout' or the path to the ALE repo in my desktop but neither works; what should I put?

Thanks!

Not sample efficient enough

From Figure 6 in the paper, their A3C only needs 20 epochs (20 million steps) to achieve average scores of around 400 at Breakout. My current implementation needs more.

A3C LSTM

I should support A3C LSTM.

Potential errors in the loss funktion

Hey,
first off: great work. I just re-implemented the paper myself using tensorflow and your code provided great "side-information" for doing so :).
In the process I also realized that there may be two subtle bugs in your implementation (although I have never used chainer before so I might be misunderstanding things):

You use the log_probabilities directly when computing the loss for the actor
#103 a3c.py :
pi_loss -= log_prob * float(advantage.data)
I believe this is incorrect as you should multiply log_prob with a one-hot encoding of the action (since only one of the actions was selected)
To compute the advantage you use the form #97 a3c.py:
advantage = R - v
where v comes from self.past_values[i] which, in turn, is the output of the value network. As I wrote I am no expert regarding chainer but you need to make sure that no gradient flows through v here (as the value function should only be updated according to the v_loss in your code). In theano/tensorflow this would be handled with a disconnected_grad() or stop_gradient() operation respectively.

I will push my implementation to github sometime during this week as soon as I have more thoroughly tested it and can then reference it here for you to compare.

Why the value loss need to devide 2 in line 108 of a3c.py

v_loss += (v - R) ** 2 / 2

But the original paper just calculate the derivative of the (V-R)^2 right?

Crashes of Spawned Proceeses

Hi there -

I forked your code to work on Super Mario Bros :)

I'm using a Nintendo emulator that I modified to allow for programmatic control by the agent (similar to the Arcade Learning Environment).

I've been having problems with the spawned FCEUX processes silently crashing. I'm wondering if there might be a race condition as they update the shared model... Did you run into similar issues with the ALE?

Thanks for providing your code!

Non-performant A3C-LSTM model for Space Invaders

Hi,

I am running the A3C-lSTM model on the game space invaders. But as can be seen in the scores dump the model does not seem to learn anything. In comparison, the scores file in the already saved model
seems to indicate much faster learning. My question then is, what were the hyper-parameters used to arrive at the saved model which is in the repository?

ALE's color_averaging option is on by mistake

I need to turn it off to make the environment more equivalent to theirs.

I only have the machine with 4 cpus with each one has 8 cores. Can I train with 32 processes or only 8 processes?

How adaptable is this to a completely different environment?

Can, I modify the code to a completely different environment, rewards, states, actions etc? Which file should I start looking at first?

Sign of pi_loss?

You are computing entropy in policy_output.py like:

- probs * log_probs

with a minus sign. This is expected to be positive (non-negative to be precise).

You are then computing pi_loss in a3c.py with a loop and subtracting terms:

for ...:
    pi_loss -= log_prob * advantage # sign (rhs) = sign(-advantage).
    pi_loss -= self.beta * entropy # sign (rhs) = 1. 
    v_loss += (v - R) ** 2 / 2

And finally you take loss as a (weighted) sum of pi_loss and v_loss.

Are you sure about this? It seems to me like you should add up pi_loss with += on both the terms in the loop?

a3c_ale.py won't completely quit after finishing

Even after saving the final model 80000000_finish.h5, some processes continue running. This issue is maybe related to #5 .

About the ALE settings

I have some questions in mind about the specific setup of the environment. I'm not sure did you check with the authors on these choice.

repeat_action_probability: The ALE Manual strongly suggests using the default 0.25. Is 0.0 a reasonable choice? Will 0.0 make it easier to learn?
treat_life_lost_as_terminal: This option would definitely make things much easier. Did the original paper use a similar setup?

Btw you're not using the frame_skip parameter anywhere but a magic number 4. You might want to fix that.
Great work!

resume training from existing models

Hello,

due to my poor cpu, a function to resume training from existing models is needed.

Thank you

Continous control

Can this program run on GPU?

As the other implementation do?

https://github.com/miyosuda/async_deep_reinforce

If not, what is needed to be done to be able to run on GPU?

Thanks.

Gradient clipping and reward normalization parameters

Hi there, cool project! I'm trying to reproduce the A3C results with my own implementation and have two questions regarding the Dr. Mnih confirmed parameters on the Wiki page: (1) There was no loss clipping. The A3C paper does mention gradient clipping however which is very similar I believe. (2) In the original DQN paper they normalized rewards by sign(R(s)) rather than max(0, min(R(s), 1) as listed in the Wiki. Could you provide some clarification on these two points, please?

Trivial scaling question

The loss function v_loss is accumulated like

v_loss += (v - R) ** 2 / 2

but then it is scaled with v_loss *= self.v_loss_coef where v_loss_coef is 0.5 by default.

Is there a reason why we're scaling it twice, termwise and also the final sum?

where is the final model and score record saved?

As titled. I found the console just print out scores too fast, where can we find the score VS training iteration records?

Some evaluation results are missing

In scores.txt of the current uploaded trained model, evaluation results at 55000000 and 56000000 are missing.

async-rl/trained_model/breakout/scores.txt

Line 55 in 0ec501c

54000000 41383.44816946983 448.7 408.0 133.6006071177157

I don't know why and whether it can affect performance. I need to check.

	img = rgb_img[:, :, 0] * 0.2126 + rgb_img[:, :, 1] * \
	0.0722 + rgb_img[:, :, 2] * 0.7152

muupan / async-rl Goto Github PK

async-rl's People

Contributors

Stargazers

Watchers

Forkers

async-rl's Issues

Recommend Projects

Recommend Topics

Recommend Org