Giter VIP home page Giter VIP logo

async-rl's People

Contributors

muupan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

async-rl's Issues

Color transform is incorrect.

Hey,

so the color transform that you use is incorrect (for example in Seaquest it sometimes causes the fishes to disappear).

async-rl/ale.py

Lines 67 to 68 in 12dac59

img = rgb_img[:, :, 0] * 0.2126 + rgb_img[:, :, 1] * \
0.0722 + rgb_img[:, :, 2] * 0.7152

You can get the correct one from the repo Deepmind uses:
https://github.com/torch/image/blob/9f65c30167b2048ecbe8b7befdc6b2d6d12baee9/generic/image.c#L2105)

Let me know how much this improves your results if you decide to adopt this. For me the difference was massive.

t_max = 1000 , loss normalization

Hello,

I have stability issue when increasing t_max (i am trying to learn with torcs racing game where possibly t_max=5 might be too small )
In a3c.py, it seems that total_loss is not normalized by number of frames. Is this normal? Is it the reason why you need to call the GradientClipping optimization hook?

Running on GPU

Hi, this is awesome. So far, it performs the best I can find to match original score in the paper. Thanks for your sharing!

Just one question, I saw GPU setting in your code. Wondering if you have ever tested it on GPU? I'm curious if it'll be even faster than aws C4.8xLarge.

Thanks.

cannot evaluate on trained model

I trained the model for 3000000 iterations and saved the model as "3000000.h5". But when I try to evaluate using demo_a3c_ale.py, it turns out there is an error there saying "Value Error (inconsistent group is specified)" at line 61: serializers.load_hdf5(args.model, model).

Installation: ImportError: No module named 'ale_python_interface'

When I try to run the saved model as :

python demo_a3c_ale.py ../roms/breakout.bin trained_model/breakout_ff/80000000_finish.h5

I get an error :

ImportError: No module named 'ale_python_interface'

which is because python3.5 (which I am using now by default) does not have access to ALE.
So I installed ALE but at the ale_python_interface step, that is when we have to execute

pip install --user .

I instead execute

python3.5 -m pip install --user .

since we want python3.5 to get access to ALE. This results in a syntax error since the ALE's python code still has python2.7 syntax. How do we fix this issue?

Thanks!

How to adapt this code to a new environment?

Hi,

This looks great. How would you go about and adapt this to Open AI Gym for example?

Can you please provide a set of places where changes have to be made?

How Generic is the code to adapt to any environment?

What to put in <path-to-rom>

In:
python demo_a3c_ale.py [--use-lstm]

I've been looking at what to put in for quite a while but I have just found it's related to ALE. I've tried putting either 'breakout' or the path to the ALE repo in my desktop but neither works; what should I put?

Thanks!

Not sample efficient enough

From Figure 6 in the paper, their A3C only needs 20 epochs (20 million steps) to achieve average scores of around 400 at Breakout. My current implementation needs more.
2016-05-08 18 10 18

Potential errors in the loss funktion

Hey,
first off: great work. I just re-implemented the paper myself using tensorflow and your code provided great "side-information" for doing so :).
In the process I also realized that there may be two subtle bugs in your implementation (although I have never used chainer before so I might be misunderstanding things):

  1. You use the log_probabilities directly when computing the loss for the actor
    #103 a3c.py :
    pi_loss -= log_prob * float(advantage.data)
    I believe this is incorrect as you should multiply log_prob with a one-hot encoding of the action (since only one of the actions was selected)
  2. To compute the advantage you use the form #97 a3c.py:
    advantage = R - v
    where v comes from self.past_values[i] which, in turn, is the output of the value network. As I wrote I am no expert regarding chainer but you need to make sure that no gradient flows through v here (as the value function should only be updated according to the v_loss in your code). In theano/tensorflow this would be handled with a disconnected_grad() or stop_gradient() operation respectively.

I will push my implementation to github sometime during this week as soon as I have more thoroughly tested it and can then reference it here for you to compare.

Crashes of Spawned Proceeses

Hi there -

I forked your code to work on Super Mario Bros :)

I'm using a Nintendo emulator that I modified to allow for programmatic control by the agent (similar to the Arcade Learning Environment).

I've been having problems with the spawned FCEUX processes silently crashing. I'm wondering if there might be a race condition as they update the shared model... Did you run into similar issues with the ALE?

Thanks for providing your code!

Sign of pi_loss?

You are computing entropy in policy_output.py like:

- probs * log_probs

with a minus sign. This is expected to be positive (non-negative to be precise).

You are then computing pi_loss in a3c.py with a loop and subtracting terms:

for ...:
    pi_loss -= log_prob * advantage # sign (rhs) = sign(-advantage).
    pi_loss -= self.beta * entropy # sign (rhs) = 1. 
    v_loss += (v - R) ** 2 / 2

And finally you take loss as a (weighted) sum of pi_loss and v_loss.

Are you sure about this? It seems to me like you should add up pi_loss with += on both the terms in the loop?

About the ALE settings

I have some questions in mind about the specific setup of the environment. I'm not sure did you check with the authors on these choice.

  • repeat_action_probability: The ALE Manual strongly suggests using the default 0.25. Is 0.0 a reasonable choice? Will 0.0 make it easier to learn?
  • treat_life_lost_as_terminal: This option would definitely make things much easier. Did the original paper use a similar setup?

Btw you're not using the frame_skip parameter anywhere but a magic number 4. You might want to fix that.
Great work!

Gradient clipping and reward normalization parameters

Hi there, cool project! I'm trying to reproduce the A3C results with my own implementation and have two questions regarding the Dr. Mnih confirmed parameters on the Wiki page: (1) There was no loss clipping. The A3C paper does mention gradient clipping however which is very similar I believe. (2) In the original DQN paper they normalized rewards by sign(R(s)) rather than max(0, min(R(s), 1) as listed in the Wiki. Could you provide some clarification on these two points, please?

Trivial scaling question

The loss function v_loss is accumulated like

v_loss += (v - R) ** 2 / 2

but then it is scaled with v_loss *= self.v_loss_coef where v_loss_coef is 0.5 by default.

Is there a reason why we're scaling it twice, termwise and also the final sum?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.