ikostrikov / pytorch-a2c-ppo-acktr-gail Goto Github PK

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

License: MIT License

Python 72.14% Jupyter Notebook 27.86%

pytorch reinforcement-learning deep-learning deep-reinforcement-learning actor-critic advantage-actor-critic a2c ppo proximal-policy-optimization acktr

pytorch-a2c-ppo-acktr-gail's People

Contributors

Stargazers

Watchers

Forkers

benjamesbabala ethancaballero iory ajaytalati kastnerkyle allensmile matthewmackay picopoco wassname jperl awokeknowing cometyang nguyenducnhaty mamonraab mnrmja007 alpslee yalechang mbraihan cnheider codeaudit riashat zhuwenxiao tanduong iamduyang fatemi gihanali sriharsha0806 vicdu yatiraj1 amoliu achaiah hedgefair tony32769 juxiao yuhangsong tranlm nadavbh12 tenorbert huoliangyu chenglongchen redeipirati wranai williamd4112 mcdavid109 chriscremer dl-yc timmeinhardt shimazing colllin dattatreya303 meelement rgilman33 araffin ramonsanabria mightychaos dai-dao hengyuan-hu lizhangzhan shubhampachori12110095 wouterkool mabirck pengcheng-wang keithsw g-wang payamn bearpaw hfslyc zmonoid r3ni3 johny-c joydosun csjunxu yimingpeng haoyusoong ganji15 corl2017 edbeeching isacarnekvist sumitsk lingyongyan cinjon willwhitney junchenjin leesy714 lgsaber andy920262 devendrachaplot onisimchukv tickpeach knn1989 dinggit wnstlr prolearner andrewliao11 jingweiz loopingdoge hungpham2511 aymar73 marc-ducret nke001

pytorch-a2c-ppo-acktr-gail's Issues

How to use pixel training instead of low-dimensional state?

A2C performace on Seaquest

Hi,

I am having trouble reproducing the performance you got on Seaquest with A2C. It gets stuck pretty low (~300 points) after 10m frames. I am using the default learning rate/ settings.
Any tips to improve performance?

Thanks

Invalid Syntax in kfac.py ?

I've tried to run this code, but it encountered SyntaxError: invalid syntax in line 44 of kfac.py, namely return a.t()@(a / batch_size). Both python2.7 and python3.4 have this problem. Are there any solutions?

log showing making new env twice

Probably an upstream issue, but it puzzles me that it appears from the log that make_env is called twice for each processes instead of one (instead of once as it truly is)

[carlo@x1 pytorch-a2c-ppo-acktr]$ python main.py --env-name "BreakoutNoFrameskip-v0" --num-processes 2 --num-frames 100
#######
WARNING: All rewards are clipped or normalized so you need to use a monitor (see envs.py) or visdom plot to get true rewards
#######
[2017-12-05 17:23:30,700] Making new env: BreakoutNoFrameskip-v0
[2017-12-05 17:23:30,702] Making new env: BreakoutNoFrameskip-v0
[2017-12-05 17:23:30,947] Making new env: BreakoutNoFrameskip-v0
[2017-12-05 17:23:30,948] Making new env: BreakoutNoFrameskip-v0
Updates 0, num timesteps 10, FPS 41, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 1.38142, value loss 0.00155, policy loss 0.02737

The reward become negative soon and can't train the model well

python3 main.py --env-name "PongNoFrameskip-v4"
#######
WARNING: All rewards are clipped or normalized so you need to use a monitor (see envs.py) or visdom plot to get true rewards
#######
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.uint8'>. Please provide explicit dtype.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
WARN: <class 'envs.WrapPyTorch'> doesn't implement 'observation' method. Maybe it implements deprecated '_observation' method.
Updates 0, num timesteps 80, FPS 109, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 1.77930, value loss 0.02714, policy loss -0.21655
Updates 10, num timesteps 880, FPS 821, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 1.79153, value loss 0.12026, policy loss -0.22026
Updates 20, num timesteps 1680, FPS 1243, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 1.50924, value loss 0.00107, policy loss 0.04321
Updates 30, num timesteps 2480, FPS 1526, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 1.56517, value loss 0.00118, policy loss 0.04794
Updates 40, num timesteps 3280, FPS 1715, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 1.36696, value loss 0.02682, policy loss 0.09570
Updates 50, num timesteps 4080, FPS 1857, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 1.70921, value loss 0.12343, policy loss -0.20016
Updates 60, num timesteps 4880, FPS 1972, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 1.33832, value loss 0.11343, policy loss -0.06592
Updates 70, num timesteps 5680, FPS 2060, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 1.60979, value loss 0.06754, policy loss -0.03790
Updates 80, num timesteps 6480, FPS 2133, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 1.44397, value loss 0.12002, policy loss -0.07597
Updates 90, num timesteps 7280, FPS 2189, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 0.56885, value loss 0.01250, policy loss 0.05751
Updates 100, num timesteps 8080, FPS 2228, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 0.90573, value loss 0.01997, policy loss 0.08723
Updates 110, num timesteps 8880, FPS 2266, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 0.79921, value loss 0.04416, policy loss 0.07492
Updates 120, num timesteps 9680, FPS 2296, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 1.61159, value loss 0.00413, policy loss 0.08989
Updates 130, num timesteps 10480, FPS 2333, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 1.73758, value loss 0.08978, policy loss -0.10888
Updates 140, num timesteps 11280, FPS 2362, mean/median reward 0.0/0.0, min/max reward 0.0/0.0, entropy 1.64044, value loss 0.03661, policy loss 0.09212
Updates 150, num timesteps 12080, FPS 2382, mean/median reward -2.6/0.0, min/max reward -21.0/0.0, entropy 1.76401, value loss 0.00949, policy loss 0.10924
Updates 160, num timesteps 12880, FPS 2383, mean/median reward -9.2/0.0, min/max reward -21.0/0.0, entropy 1.77274, value loss 0.16604, policy loss -0.09154
Updates 170, num timesteps 13680, FPS 2398, mean/median reward -11.8/-21.0, min/max reward -21.0/0.0, entropy 1.52948, value loss 0.05661, policy loss 0.03393
Updates 180, num timesteps 14480, FPS 2410, mean/median reward -14.3/-21.0, min/max reward -21.0/0.0, entropy 1.76128, value loss 0.13569, policy loss -0.21295
Updates 190, num timesteps 15280, FPS 2412, mean/median reward -18.1/-21.0, min/max reward -21.0/0.0, entropy 1.67226, value loss 0.18734, policy loss -0.28125
Updates 200, num timesteps 16080, FPS 2423, mean/median reward -19.3/-21.0, min/max reward -21.0/0.0, entropy 1.65022, value loss 0.11167, policy loss -0.14717
Updates 210, num timesteps 16880, FPS 2437, mean/median reward -19.3/-21.0, min/max reward -21.0/0.0, entropy 1.69721, value loss 0.10054, policy loss -0.11975
Updates 220, num timesteps 17680, FPS 2450, mean/median reward -20.4/-21.0, min/max reward -21.0/-18.0, entropy 1.64935, value loss 0.06355, policy loss 0.12313
Updates 230, num timesteps 18480, FPS 2463, mean/median reward -20.4/-21.0, min/max reward -21.0/-18.0, entropy 1.70666, value loss 0.00183, policy loss 0.06517

GAE implementation for PPO

I am trying to make sense of the GAE implementation in PPO and have some doubts. When use_gae==False compute_returns sets the returns to empirical discounted sum of values.
https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/c6382c52f86be60185be762dd1538c00d64b48db/storage.py#L46

This is completely fine. But when use_gae==True, it sets returns to (empirical) advantage + V.

But the value loss is defined as MSE between the predicted values returned by the model and the computed returns https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/c6382c52f86be60185be762dd1538c00d64b48db/main.py#L220.

So, it seems to me, that if use_gae==True the model instead of the value function, learns advantage + V But on the other hand, it is still used as if it were returning the value prediction (e.g. https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/c6382c52f86be60185be762dd1538c00d64b48db/main.py#L120).

Am I missing something?

tanh in the MLP?

Just a question: you're using tanh in the MLPPolicy. Any reason for this? I thought ReLUs were overall better for backpropagating the gradient, so maybe they should be used in the MLPPolicy as well?

invalid syntax: main.py, line 60, pos 49 ??

"invalid syntax: main.py, line 60, pos 49 in file /home/pytorch-a2c-ppo-acktr-new/main.py
obs_shape = (obs_shape[0] * args.num_stack, *obs_shape[1:])" , is it ok?

typo in arg

There is a typo in the description of the argument --max-grad-norm.

    parser.add_argument('--value-loss-coef', type=float, default=0.5,
                        help='value loss coefficient (default: 0.5)')
    parser.add_argument('--max-grad-norm', type=float, default=0.5,
                        help='value loss coefficient (default: 0.5)')

Also, is 0.5 an appropriate default value (it is suspiciously equal to the value-loss-coef one)?

Bye
C

Atari games not learning?

Tried training on different Atari games. It seems that using default parameters, neither Breakout nor Boxing were able to learn. Tried both acktor and a2c and waited for 10M steps
Though Pong managed to learn no problem.
Are you still able able to reproduce the results in the graphs in the readme?

broken master (input shape)

It appears that last commit messed with input shape in Atari enviroment:

[lucibello@bidsa-sc pytorch-a2c-ppo-acktr]$ python main.py 
#######
WARNING: All rewards are clipped or normalized so you need to use a monitor (see envs.py) or visdom plot to get true rewards
#######
Traceback (most recent call last):
  File "main.py", line 265, in <module>
    main()
  File "main.py", line 122, in main
    Variable(rollouts.masks[step], volatile=True))
  File "/home/lucibello/Git/pytorch-a2c-ppo-acktr/model.py", line 24, in act
    value, x, states = self(inputs, states, masks)
  File "/home/lucibello/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lucibello/Git/pytorch-a2c-ppo-acktr/model.py", line 87, in forward
    x = self.conv1(inputs / 255.0)
  File "/home/lucibello/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lucibello/.local/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 254, in forward
    self.padding, self.dilation, self.groups)
  File "/home/lucibello/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 52, in conv2d
    return f(input, weight, bias)
RuntimeError: Given input size: (336, 84, 1). Calculated output size: (16, 20, 0). Output size is too small.

Reverting to recent commits fixes the problem.

Cheers
C

Custom Env log problems

Hello guys, I'm having problems to log my custom env. It doesn´t write in the .csv. I tried to understand where the data is writing, but I didn`t found it.

The only data written is:
#{"t_start": 1521155105.3171377, "env_id": "my_custom_env"}
r,l,t

Could someone help me with it?

Why are the value_loss and action_loss summed?

Why are the value_loss and action_loss summed together in line 185 of main.py,

(value_loss * args.value_loss_coef + action_loss - dist_entropy * args.entropy_coef).backward()

when they were both derived separately in line 163 and 165 and when both networks are meant to be separate models? I don't quite understand. Thank you in advance!

Missing Documentation Plotting Script

Hello,
Although I understand the general idea of the plotting script (smooth the reward curve), I have some trouble to understand the different functions in detail.

First, what is the idea behind fix_point(x, y, interval) ?

Then, in the smoothing function:

def smooth_reward_curve(x, y):
    # Halfwidth of our smoothing convolution
    halfwidth = min(31, int(np.ceil(len(x) / 30)))
    k = halfwidth
    xsmoo = x[k:-k]
    ysmoo = np.convolve(y, np.ones(2 * k + 1), mode='valid') / \
        np.convolve(np.ones_like(y), np.ones(2 * k + 1), mode='valid')
    downsample = max(int(np.floor(len(xsmoo) / 1e3)), 1)
    return xsmoo[::downsample], ysmoo[::downsample]

What does 30 and 31 correspond to in halfwidth = min(31, int(np.ceil(len(x) / 30))) ?

PS: the two links at the top of the script are dead link now...

Is there a way to track the episode length?

Hi Ilya,

I'm trying to solve the navigation task with your implementation of A2C. An example is here: https://github.com/zfw1226/icra2017-visual-navigation/blob/master/training_thread.py#L161

One evaluation metric is the episode length, i.e., the number of steps the agent takes from the start location to the target location. I was wondering if there is a way to track the episode length in your code? It is easy to track the episode length with A3C since training threads are independent of each other. Thank you very much!

Upload pre-trained models for A2C

Hello,

Can you please upload pre-trained models for A2C (there seems to only be PPO and ACKTR in your google drive). I have a specific request for A2C space invaders if you have that model available.

Thanks

What's the shape of the input param actions in logprobs_and_entropy in distributions.py?

Firstly, I can't run this program cause I was in trouble while installing mujoco-py in newest miniconda environment. So, it's hard for me to insert print() or other methods to check the program.
However, there is a question about the input parameter named actions in logprobs_and_entropy. Suppose the shape of the other input parameter named x is (N, M), then shape of actions should be (N,1) according to
action_log_probs = log_probs.gather(1, actions). But in main.py#L154, actions's shape is (N,action_shape) according to rollouts.actions.view(-1, action_shape).
Could you please explain it ? Thanks very much!

LSTM policy

Great work on the implementation!
Very comprehensible and straightforward implementation.

It seems you're performing two forward steps: 1) to choose an action (main.py, line 113), 2) to evaluate the actions (main.py, line 146).
Why not save the values, log_probs and entropy in while selecting actions (as you did in a3c)?
Are there computational benefits to performing these for all processes at once?

LSTM policy hyper parameters

Would it be possible to share LSTM policy hyper parameters? I couldn't make LSTM policy work using A2C. Thanks.

Suggestion: Mujoco - add timestep to the observation

See for example here: https://github.com/pat-coady/trpo
A major issue in Mujoco domains is that the test terminates due to timeout after a fixed number of steps.
This causes the state to be non-markovian (states near the end of the timelimit and identical states at the begining, have a different value assigned).

F is an undefined name

https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/master/kfac.py#L15

How do ppo & acktr compare in terms of wall-clock time?

ACKTR paper says that ACKTR paper processes ~25% less steps per minute compared to A2C.
What percent less steps per minute does PPO process compared to A2C?

potential bugs in kfac.py

In compute_cov_a,
a = a.view(-1, a.size(-1)).div_(a.size(1)).div_(a.size(2))
should be
a = a.view(-1, a.size(-1))

In compute_cov_g,
g = g.view(-1, g.size(-1)).mul_(g.size(1)).mul_(g.size(2))
should be
g = g.view(-1, g.size(-1))
and
g_ = g * batch_size
should be deleted. @ikostrikov Why do you multiply g by batch_size?

Python 2 Support

Hello,
I forked your repo and did minor edits in order to get it work with python 2:
https://github.com/araffin/pytorch_agents/tree/python2

There is an additional change required in OpenAI Baselines (https://github.com/openai/baselines/blob/master/baselines/common/dataset.py#L50):
The definition (line 50):

def iterbatches(arrays, *, num_batches=None, batch_size=None, shuffle=True, include_final_partial_batch=True):

need to be replaced by:

def iterbatches(arrays, num_batches=None, batch_size=None, shuffle=True, include_final_partial_batch=True):

Do you want me to do a pull request ?

Why do we need to evaluate the actor_critic model twice?

I'm sorry if I'm missing something obvious.

What is the reason we have to evaluate the model twice, i.e. once when we call actor_critic.act(...)
(here) and a second time when we call actor_critic.evaluate(...) (here).

(I so far have only looked at A2C)

I understand that the computation graph isn't saved in the current implementation from when we call act. But couldn't we just save values,action_log_probs and dist_entropy in a list of Variables when we call act?

Thanks for your help (and the implementation)!

Entropy of normal distribution

Thanks for sharing your code!

I think you have an issue here though:
https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/bcb4b5f8cfa2ae5332e9fcac72526d24579afcd5/distributions.py#L82

The entropy should look like this:
https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Entropy

Here you also does not seem to be able to learn the variance:
https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/bcb4b5f8cfa2ae5332e9fcac72526d24579afcd5/distributions.py#L56

If I'm mistaken please let me know!

PPO episode ends before num_steps

I just wanted to understand what happens when the episode ends sooner than num_steps.
I was going through the code and the env (for which done=True) doesn't appear to be reset.
Shouldn't that be happening? For example when you have just 1 process and the episode ends.

large negative rewards for Pong

When running
python main.py --env-name "PongNoFrameskip-v4"

The system doesn't seem to learn, and I get large negative mean rewards. Here is an example print out:

Updates 10030, num timesteps 802480, FPS 1828, mean/median reward -20.4/-20.0, min/max reward -21.0/-19.0, entropy 1.77331, value loss 0.02498, policy loss 0.03232

Pytorch Version 0.3.1
Baselines most recent commit (b71152eea0470ac2629c33e0fc66a54fe494949f)

I have tried running it multiple times and have gotten the same result.

Continuous action space: range not taken into account

Hello there!

I'm trying to train a CNN to control a robot with a differential drive. My gym environment has this action space:

        self.action_space = spaces.Box(
            low=-1,
            high=1,
            shape=(2,)
        )

That is, I need the CNN to output two motor velocities in the range [-1, 1]. Unfortunately, at the moment, the low and high range of my action space isn't taken into account. I get outputs as high as 540, which makes the robot spin out of control.

This seems like it should be an easy problem to fix, but I'm still very new to PyTorch. Could you make the change, or advise me as to how to fix this?

Two questions regarding recurrent policies

I have two questions regarding the implementation of recurrent policies:

Why do you have a loop recomputing states in your recurrent policy. It seems you could use the states you already stored and computed in rollouts? This would get rid of the loop which seems kind of ugly and difficult to follow (took me a while to figure out what was happening): https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/master/model.py#L104
What's missing to have ACKTR and your KFAC optimizer support recurrent policies? This is something I would like to have, because ACKTR seems more resilient than straight A2C.

Could you please share the hyper-parameters for KUKA use PPO?

I have tried both Discrete and Continous action space in Kuka by set the Kuka env parameter IsDiscrete true or false, but I still can not make it work. Could you please tell me the parameters you used for Kuka?
Thanks very much!!
ps. I use pybullet 2.87 with pybullet 1.92, which has a reward based on the object reaching a certain height (main) and distance between gripper and object(minor reward shaping, every step). Thanks again!

if self.linear1 = nn.LSTMCell

at model.py, if self.linear1 = nn.LSTMCell, how about the performance will be result ?

default load_dir in enjoy.py is incorrect

Hi there, it's me again!

There is a minor issue in enjoy.py. You have: save_path = os.path.join(args.save_dir, args.algo) in main.py. However, the algorithm name isn't taken into account in enjoy.py. I would suggest adding an --algo argument to enjoy.py, to keep it similar to what main.py expects.

There is also another problem, seemingly, with the .act() function being called with two arguments missing. It's missing two variables. These are not actually used in the PyTorch code, as far as I can tell, but the .act() function expects them.

Support for MultiDiscrete action spaces

Currently only Box and Discrete action spaces are supported. For my purposes I would need MuliDiscrete spaces. Is this something that is already in planning? If not I would be happy to implement the missing distribution. The implementation should be straight forward and could be based on the existing Categorical distribution. Or am I missing a crucial blocking property of the current implementation that complicates such an endeavour?

K-FAC

Is it possible to use your implementation of K-FAC as an optimizer for other RL algorithms (e.g. DDPG) or is it especially designed for ACKTR?

ImportError: cannot import name 'BatchSampler'

python main.py --env-name "PongNoFrameskip-v4"
Traceback (most recent call last):
File "main.py", line 12, in
from torch.utils.data.sampler import BatchSampler, SubsetRandomSampler
ImportError: cannot import name 'BatchSampler'

Missing `args.value_loss_coef`?

https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/17ea8333ecbfe6552470f50fab4f83e1444f43a6/main.py#L226

LazyFrames

Is there any elegant way to incorporate the LazyFrames memory optimization trick from baselines.common.atari_wrappers? WrapPyTorch seems to prevent us from using frame_stack=True in wrap_deepmind, and the naive hack of converting to a numpy array in the _observation method nullifies the memory optimization provided by LazyFrames.

Could you share the hyper-parameters of a2c with different discrete action space

I tried default settings with discrete action space Discrete(6), but it doesn't work when size of action space is 16, 26, and even 41. So, could you provide some tips or your hyper-parameters when training games with larger action space?
Thanks so much!!!

Setting random seed

Hi --

I'm trying to set this up so that it gets the exact same results every time (eg, for regression tests). Even when I set the seeds, I get (slightly) different results on each run. Any ideas what might be going on there?

Thanks

Catch small bin_size parameters

This part of the visualization https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/blob/163f0199a89339711ba09ce28420b554f6bf024d/visualize.py#L19 breaks if the bin_size is not at least 3. Maybe catch small bin sizes with an assert?

Adding Prioritized Experience Replay

Hi, I was wondering if a Prioritized Experience Replay buffer could be added to PPO?

They do something similar to that here - Leveraging Demonstrations for Deep Reinforcement
Learning on Robotics Problems with Sparse Rewards, with DDPG.

I'm guessing though PPO would be more stable?

Perhaps OpenAI's prioritized replay_buffer, from the baselines repo could be used?

Add Dockerfile

Can you list the exact commands you would run to install the dependencies? Then we can make a Dockerfile so people can run it as a docker container on any computer/server without installing anything.

use beta distribution instead of gaussian when you add continuous actions

Beta distribution seems to always work better than Gaussian on high dimensional continuous control tasks:

"Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution"
http://proceedings.mlr.press/v70/chou17a/chou17a.pdf

Q: reduced #process on GPU?

Hi,

Previously (a few days ago) in training I saw (args.num_processes + 1) processes on GPU using nvidia-smi, and GPU utilization high 80~90%.

With the latest code, I saw only one process on GPU, and sometimes GPU utilization only a few percent.

I just wondering what changed? and it's intended.

Thanks.

Hyperparameter for mujoco environment

Hi I was able to get the desired plot for Reacher -v1 using PPO but was unable to do it for the HalfCheetah-v1 env. Did you use the same set of hyperparameter values for all the Mujoco game environments for PPO algorithm?

ValueError: could not convert string to float

python main.py --env-name "BreakoutNoFrameskip-v4" --algo acktr --num-processes 32 --num-steps 20

Updates 10590, num timesteps 6778240, FPS 186, mean/median reward 16.6/8.0, min/max reward 0.0/76.0, entropy -0.77548, value loss 0.36399, policy loss 0.00355
Updates 10600, num timesteps 6784640, FPS 186, mean/median reward 17.7/8.0, min/max reward 0.0/94.0, entropy -0.79836, value loss 0.10146, policy loss 0.02539
Traceback (most recent call last):
File "main.py", line 246, in
main()
File "main.py", line 242, in main
win = visdom_plot(viz, win, args.log_dir, args.env_name, args.algo)
File "/home/x/project/pytorch-a2c-ppo-acktr/visualize.py", line 104, in visdom_plot
tx, ty = load_data(folder, smooth, bin_size)
File "/home/x/project/pytorch-a2c-ppo-acktr/visualize.py", line 64, in load_data
tmp = [t_time, int(tmp[1]), float(tmp[0])]
ValueError: could not convert string to float: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00296.0'

DataParallel problem with PPO

Hi,

adding act breaks PPO?

Traceback (most recent call last):
  File "main.py", line 224, in <module>
    main()
  File "main.py", line 102, in main
    value, action, action_log_probs = actor_critic.act(Variable(rollouts.states[step], volatile=True))
  File "/home/ajay/anaconda3/envs/py35_pytorch/lib/python3.5/site-packages/torch/nn/modules/module.py", line 262, in __getattr__
    type(self).__name__, name))
AttributeError: 'DataParallel' object has no attribute 'act'

I think the problem is here?


    actor_critic = ActorCritic(envs.observation_space.shape[0] * args.num_stack, envs.action_space)
    if args.algo == 'ppo':
        actor_critic = nn.DataParallel(actor_critic)

Tried the obvious fix, but got another error?

    value, action, action_log_probs = nn.DataParallel(actor_critic.act(Variable(rollouts.states[step], volatile=True)))                        
  File "/home/ajay/anaconda3/envs/py35_pytorch/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 53, in __init__
    self.module.cuda(device_ids[0])
AttributeError: 'tuple' object has no attribute 'cuda'

Tried without nn.DataParallel and got another error?

Traceback (most recent call last):
  File "main.py", line 225, in <module>
    main()
  File "main.py", line 182, in main
    states_batch = rollouts.states[:-1].view(-1, *rollouts.states.size()[-3:])[indices]
TypeError: indexing a tensor with an object of type list. The only supported types are integers, slices, numpy scalars and torch.LongTensor or torch.ByteTensor as the only argument.

Linear schedule for learning rate

The ACKTR paper describes "Both the baseline (A2C) and our method used a linear schedule for the learning rate over the course of training."
Is this implemented? I cannot find it in the code.

Compare with Uncertainty Bellman Equation

The new "Uncertainty Bellman Equation" seems to give SOTA results - would be interesting to add this to your great list of algorithms 👍

Uncertainty Bellman Equation and Exploration

ikostrikov / pytorch-a2c-ppo-acktr-gail Goto Github PK

pytorch-a2c-ppo-acktr-gail's People

Contributors

Stargazers

Watchers

Forkers

pytorch-a2c-ppo-acktr-gail's Issues

Recommend Projects

Recommend Topics

Recommend Org