http://arxiv.org/pdf/16

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Someone is trying to replicate this and after skimming through <a class="issue-link js

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Implement asynchronous methods about atari HOT 14 CLOSED

kaixhin commented on July 20, 2024

Implement asynchronous methods

from atari.

Comments (14)

lake4790k commented on July 20, 2024 1

Ready for the next method based on A3C...

from atari.

Kaixhin commented on July 20, 2024

I currently have no plans to implement A3C in this repo because it is quite different, rather than a relatively simple addition to the original DQN. You're welcome to submit a PR if you manage to come up with an -asynchronous true-style option.

Edit: Actually, asynchronous one-step Q-learning could be in scope. And I agree that 'threads.sharedserialize' would be one part of the solution. But running multiple (presumably physical) threads multicore and coordinating that with a master thread may not be possible with threads?

from atari.

commented on July 20, 2024

@Kaixhin @lake4790k I was actually working off this repo as a starting point not long ago to attempt to do this (maybe a month?) I evaluated threads and concluded it was not a very good option for one step or n-step due to the way upvalues work within the threads (in particular, the singleton instance+step counts within those algo's in particular cause a number of issues). If you're interested in collaborating, I can invite you to the library I was working on this in. I think that lua---parallel can handle a3c better, and was utilizing a structure similar to the original deepmind code.

Edit: To add to @Kaixhin's edit, this is possible. You can pretty easily run multiple instances of the Atari code in parallel. Just serialization at a given step is neither straightforward nor painless between threads.

from atari.

lake4790k commented on July 20, 2024

@Kaixhin I think these async methods can be done in pure torch using threads and sharedserialize. Some implementation remarks (don't read this if you want to give it a try first, it's an interesting exercise...):

T (global shared counter): this would be an atomic variable (eg. c++ std::atomic) that is not available in lua/torch. In the paper this is used to update targetTheta from the threads. This is done less frequently (eg. Nth frames) Instead one can update targetTheta from a non learning thread at fixed time intervals (ie. sleep for x secs) in an unscyhronized way with the same effect, doing at exactly on every Nth step is not necessary.

A network with the global shared theta is created first, then a thread pool is started and the learner threads in the pool do a clone('weight','bias'), so dTheta (=accumulation of gradients and all other internal state) is per thread, but they share the Storage of theta. To acquire a flattened dTheta in the threads one can do:

_,gradParams = self.network:parameters()
dTheta = nn.Module.flatten(gradParams)

so dTheta is a single flattened storage tensor that can be added to theta

In the one step case then the learner does n forwards/backwards accumulating gradients in dTheta before doing a Hogwild theta += dTheta, ie. don't worry about synchronization and trust the cpu caches to be synchronised anyway for most of the updates. Adding the gradient is safe in the sense that the worst that can happen is loosing an update rarely, but theta doesn't get corrupted. Asynchronous = unsynchronised. There's no master thread and coordination, the pool can run forever, no need to stop on synchronize().

The shared RmsProp they describe with the shared g,g2 is trickier as the async neg()and sqrt() will corrupt the shared tensor with NaNs, I think there a thread local g,g2 copy is needed as well for the interim calculations.

from atari.

Kaixhin commented on July 20, 2024

@michaelghaben @lake4790k Thanks for the comments - one-step (and possibly n-step) Q-learning can hopefully be integrated with the other features in this repo e.g. the dueling architecture. If that can be achieved then the others like advantage actor-critic can be considered.

I'm not going to be able to take the lead on this in the near future, but I'm happy to lend a hand to either a fork or a new repo if I can. For now I've just added an async branch to make it a little cleaner than working directly on master.

from atari.

lake4790k commented on July 20, 2024

@Kaixhin I did a reference implementation as per above in a simple codebase without all the other methods. I hope I will have time to merge it with your codebase, would be interesting to see the performance compared to all the other methods. Will do a PR once ready.

from atari.

Kaixhin commented on July 20, 2024

Someone is trying to replicate this and after skimming through miyosuda/async_deep_reinforce#1 it seems like they got hold of hyperparameters not noted in the paper. Worth keeping an eye on.

from atari.

lake4790k commented on July 20, 2024

@Kaixhin yes, interesting about the hyperparameters. But I'm not sure that implementing this in Python is a good idea, afaik python does not support multithreaded execution of python script code at all (ie. all of the RL logic...), only code that is outside of GIL (ie. tensorflow operations, cython parts). They also mention slow performance compared to the original, I would not be surprised if that is because of the python issue. (but even single threaded python performance could be poor compared to lua/native for this use case)

I'll add my implementation of async to Atari in the coming days, I'm curious how it will work...

from atari.

lake4790k commented on July 20, 2024

@Kaixhin btw I started with catch comparing cpu and gpu behaviour and noticed that cpu did not converge as gpu did, which should not be the case as all the code is the same for both cases. Except for the random initialization, as by default the code sets manual seed of 1, but in the gpu case it then does a cutorch.manualSeed(torch.random()) before constructing the net. So I think then cpu always gets a poor initialzation from seed 1, while the gpu net gets some other random weights that works better. If I set the cpu seed to random it then converges similarly to gpu (and also faster as you note because of the small net). I'm not sure if this random initialization behaviour is intended, got me quite confused first...

from atari.

Kaixhin commented on July 20, 2024

@lake4790k Interesting - I went with the same initialisation as in DeepMind's code. I don't think a seed of 1 is worse than any other seed for a random number generator - it seems like you just got unlucky. It might be more obvious with Catch, but as far as I know weight initialisation hasn't been looked into for deep reinforcement learning.

from atari.

lake4790k commented on July 20, 2024

Some results running the async 1-step Q. I used pong to compare the learning speed with the async paper page 20 result.

This experiment I ran on 10 hyperthreads (5 physical cores). I would expect the equivalent deepmind performance to be somewhat below the midpoint between the 4 and 8 thread curves as the speed is limited by the 5 physical cores, but having more threads with diverse experiences helps a bit.

The time scale of this figure is little less than 14 hours. It achieved the score 0 in about 11 hours. That's exactly where the interpolated deepmind curve would be. I used a learning rate of 0.0007.

This experiment I ran an 8 hyperthreads (4 physical cores). The equivalent deepmind curve should be a bit above the 4 thread curve on page 20.

The time scale of this figure is little less than 24 hours. At 14 hours it achieved a score of -3. That is exactly where the 4 threaded deepmind curve is. I used a learning rate of 0.0016

In these experiments I did not have learning rate decay as in the paper. The paper says they used the same experiment setup as the double Q paper, but then also says they used gradient norm clipping (which I didn't turn on either) which was introduced in the duelling paper.

I also had experiment with the more aggressive 0.0016 learning rate that got stuck in the beginning not improving for long. My guess would be that the gradient clipping would have helped it to get out of there (also the learning rate decay eventually).

As the curves in the paper are the average of the 3 best agents out of 50 experiments, and most likely they used an optimized c++ implementation (with tensorflow) and ours is pure torch and I had only a few experiments, these results look pretty good.

I still plan to implement n-step Q (in combination with double/PAL/dueling) and A3C and unify as much as possible with experience replay codebase as we discussed.

from atari.

Kaixhin commented on July 20, 2024

@lake4790k looks great - thanks for comparing with DeepMind results. Epochs are more meaningful than training time due to differences in hardware, but your estimates sound about right. Keep at it!

from atari.

Kaixhin commented on July 20, 2024

Closed by #30.

from atari.

hym1120 commented on July 20, 2024

In A3CAgent:accumulateGradients, what is the reason we have 0.5 in vTarget instead of 2?
I was thinking "d(R-V)^2 / dTheta" is equal to "-2 * (R- V) * dV/dTheta"

from atari.

Implement asynchronous methods about atari HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent