Giter VIP home page Giter VIP logo

dreamer's Introduction

Hi there ๐Ÿ‘‹

๐Ÿค–ย  AI Algorithms

dreamerv3 Mastering Diverse Domains through World Models
daydreamer DayDreamer: World Models for Physical Robot Learning
director Deep Hierarchical Planning from Pixels
dreamerv2* Mastering Atari with Discrete World Models
dreamer* Dream to Control: Learning Behaviors by Latent Imagination
planet* Learning Latent Dynamics for Planning from Pixels
batch-ppo* Efficient Batched Reinforcement Learning in TensorFlow

๐Ÿ“ˆย  Benchmarks

crafter Benchmarking the Spectrum of Agent Capabilities
diamond_env Standardized Minecraft Diamond task for reinforcement learning

๐Ÿ› ๏ธย  Tools

zerofun Remote function calls for array data using ZMQ
elements Building blocks for productive research
ninjax General Modules for JAX
handout Turn Python scripts into handouts with Markdown and figures

ย * Archived

dreamer's People

Contributors

danijar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dreamer's Issues

Invalid one-hot action with Google Research football environment

Hi! I'm trying to train Dreamer on the Google Research football environment. When training, I get the following error:

Traceback (most recent call last):
  File "dreamer.py", line 471, in <module>
    main(parser.parse_args())
  File "dreamer.py", line 451, in main
    functools.partial(agent, training=False), test_envs, episodes=1)
  File "/home/gridsan/jgonik/my-dreamer/tools.py", line 124, in simulate
    obs, _, done = zip(*[p()[:3] for p in promises])
  File "/home/gridsan/jgonik/my-dreamer/tools.py", line 124, in <listcomp>
    obs, _, done = zip(*[p()[:3] for p in promises])
  File "/home/gridsan/jgonik/my-dreamer/wrappers.py", line 379, in step
    obs, reward, done, info = self._env.step(action)
  File "/home/gridsan/jgonik/my-dreamer/wrappers.py", line 191, in step
    obs, reward, done, info = self._env.step(action)
  File "/home/gridsan/jgonik/my-dreamer/wrappers.py", line 240, in step
    obs, reward, done, info = self._env.step(action)
  File "/home/gridsan/jgonik/my-dreamer/wrappers.py", line 349, in step
    raise ValueError(f'Invalid one-hot action:\n{action}')
ValueError: Invalid one-hot action:
[ 1.      0.9995 -0.9995 -0.9775 -0.99   -1.      1.     -0.999   1.
  0.9995  0.999  -0.999   0.9995 -0.9995 -1.      0.9985 -1.     -1.
  1.    ]

I saw that people have previously encountered this issue with the Atari environment (#29), and the solution there was to use the epsilon_greedy exploration strategy. However, I get this error no matter which exploration strategy I use. Any help would be greatly appreciated!

Why does dreamer not require stored hidden states?

Hi, @danijar

I notice that Dreamer does not store the hidden states of RSSM in the replay and always uses the zero initialization during learning. This seems to work reasonably well with Dreamer, but not with other off-policy algorithms such as R2D2. I've tried to add hidden states and burn-in periods to the replay but did not spot any performance difference. Do you have any idea why this is the case?

By the way, could you please publish the hyperparameters you use in Atari games?

Performance on sparse reward

Hi Danijar,
Great work in dreamer. I have a question, do you think the Dreamer will perform well upon extremely sparse reward problem? Like only a reward is gotten at the very last step.

Thanks

Dreamer for Atari

In short, here's the bug when I ran atari_breakout:

  File "dreamer.py", line 463, in <module>
    main(parser.parse_args())
  File "dreamer.py", line 443, in main
    functools.partial(agent, training=False), test_envs, episodes=1)
  File "/home/mluo/dreamer/tools.py", line 124, in simulate
    obs, _, done = zip(*[p()[:3] for p in promises])
  File "/home/mluo/dreamer/tools.py", line 124, in <listcomp>
    obs, _, done = zip(*[p()[:3] for p in promises])
  File "/home/mluo/dreamer/wrappers.py", line 350, in step
    obs, reward, done, info = self._env.step(action)
  File "/home/mluo/dreamer/wrappers.py", line 162, in step
    obs, reward, done, info = self._env.step(action)
  File "/home/mluo/dreamer/wrappers.py", line 211, in step
    obs, reward, done, info = self._env.step(action)
  File "/home/mluo/dreamer/wrappers.py", line 320, in step
    raise ValueError(f'Invalid one-hot action:\n{action}')
ValueError: Invalid one-hot action:
[ 0.999  -0.9995  0.9995  0.9995] 

I was wondering what changes are needed to get atari to work in your much cleaner Dreamer codebase and what possible hyperparameter changes would be needed to match the results reported in the paper.

Dreamer for DMLab

I'm not sure if you're still working on this, but if you are, I was wondering why the implementation here doesn't support DMLab environments. The older implementation does, but I'd rather not use have to use the older dependencies it relies on. Are you planning on adding this, or do I have to go with the older implementation on google-research?

Differences in free nats clipping between Dreamer, early and final PlaNet implementation

In PlaNet you changed the calculation of the free nats between the latest two commits from (5bb34f9)

loss = tf.maximum(tf.cast(free_nats, tf.float32), loss)

to (cbe77fc)

loss = tf.maximum(0.0, loss - float(free_nats))

Now in Dreamer you seem to have reverted back to the first version, again

div = tf.maximum(div, self._c.free_nats)

As I see it, the two versions should be equivalent to

  1. loss = tf.clip_by_value(loss, clip_value_min=free_nats)
  2. loss = tf.clip_by_value(loss, clip_value_min=free_nats) - free_nats

if I'm not mistaken. So one will always clip the KL-loss to at least the value of the free nats, while the other will always reduce the loss by the value of the free nats (without making the loss negative)?

Which one is the preferred approach to free nats clipping?

Are policy and value gradients propagated back through the world model?

Hi,

Furthermore, may I ask if the gradients of policy and value are backpropagated to the world model? If I understand your code right, they are not as this line suggests. However, your paper mentions it in several places that value and policy gradients are propagated back through the dynamics. Therefore, I am afraid that I understand the code wrong. Please let me know if I made any mistakes.

Best,

Sherwin

Provided scores don't match the results

Hello,

Thanks for the code! Just want to confirm, are the scores in dreamer.json from the old implementation or this TF2 implementation? This was asked before but not yet answered (and I also encounter the same reproducibility issue for Pendulum Swingup).

Why does the RSSM receive different inputs for learning and imagination?

From @xlnwel (asked in #4):

Also, I try to reframe RSSM so that I can wrap it with tf.keras.layers.RNN. Here's the cell structure I come up with

image

where $e_t$ is the embedding and $s'_t$ is the output of the posterior. Here, I move the deterministic and stochastic state models into a single RSSM cell and compute the posterior after the RNN is expanded. I humbly think this structure may be faster as we can process all posterior at once after the RNN is expanded. However, I'm not sure if I miss something. Do you think this structure is consistent with your implementation?

After implementing my version of RSSM, I found that I misunderstood it before: RSSM is expanded differently at dynamics learning and behavior learning. I think the correct diagram should be the following one

Screen Shot 2020-03-11 at 23 10 41

At the time of dynamics learning, the hidden state is computed from the output of the posterior $q$(the left part) while during the imagine phase, the hidden state is computed from the output of the prior $p$(the right part). I can see that the left matches the VAE structure. However, I can't explain why it should be preferred compared to the one I previously considered. Could you help me clarify it?

How to train on low-dimensional Gym envs?

Hi,

I have a custom env, derived from Gym classic_control/cartpole (but more difficult to solve). I currently use OpenAI stable baseline's ACKTR to train. I'd like to switch to dreamer but I read that it's pixel/Atari focused. Cartpole is not. Can dreamer be used (maybe little modified) for that?

Have you meet value function explosion when developing?

Hi Danijar. Yet another question about the actor-critic part.

I am trying to use the actor-critic part in my own project with a different world model. Sometimes the whole model works fine, but sometimes the value function suddenly explodes exponentially. I use the same gradient clip as you, but it did not stop the explosion. I wonder if you have meet the same phenomena when developing. Anything would be helpful. Thanks in advance.

How to run Atari environments?

I'm trying to run some Atari examples on Dreamer, but I cannot find correct parameters to do so without errors. When running the code for example with boxing environment, with the command:
python dreamer.py --logdir ./logdir/atari/dreamer/1 --task atari_boxing ,

I get an error:

...
Start evaluation.
Training for 100 steps.
[5000] expl_amount 0.0 / model_grad_norm inf / value_grad_norm 2.4 / actor_grad_norm 0.8 / prior_ent 35.6 / post_ent 32.7 / image_loss 11355.0 / reward_loss 0.9 / div 3.0 / model_loss 11355.0 / value_loss 1.1 / actor_loss -0.8 / action_ent -83.1
Traceback (most recent call last):
  File "dreamer.py", line 463, in <module>
    main(parser.parse_args())
  File "dreamer.py", line 442, in main
    tools.simulate(
  File "/home/jannkar/repositories/dreamer/tools.py", line 124, in simulate
    obs, _, done = zip(*[p()[:3] for p in promises])
  File "/home/jannkar/repositories/dreamer/tools.py", line 124, in <listcomp>
    obs, _, done = zip(*[p()[:3] for p in promises])
  File "/home/jannkar/repositories/dreamer/wrappers.py", line 350, in step
    obs, reward, done, info = self._env.step(action)
  File "/home/jannkar/repositories/dreamer/wrappers.py", line 162, in step
    obs, reward, done, info = self._env.step(action)
  File "/home/jannkar/repositories/dreamer/wrappers.py", line 211, in step
    obs, reward, done, info = self._env.step(action)
  File "/home/jannkar/repositories/dreamer/wrappers.py", line 320, in step
    raise ValueError(f'Invalid one-hot action:\n{action}')
ValueError: Invalid one-hot action:
[-0.9155  1.      1.      1.      0.1128 -1.     -1.      1.      0.9995
  0.9136 -1.      1.     -1.      0.9985 -1.      1.      0.9897 -1.    ]

I noticed that the code runs further when the --action_dist onehot is set and read that the --pcont True should be set as well. But the code still throws the same error, now just on line 447. So, what are the correct parameters that need to be set to run the Atari environments?

Tensorflow-probability version

Hi Danijar,

The setup instructions do not contain a pointer to the right tensorflow-probability version. It's probably not the case that bugs are induced by new versions but just in case, do you know the version you used?

Thanks!

Is there a JavaScript or typescript version out there?

I'm building a web based tensorflow ai, and have been looking for a good tf reinforcement library, but I haven't found one, so I only currently have a very primitive qlearning algorithm implemented and I'd like to try a better algorithm designed for delayed gratification :9

Runtime performance

Hi Danijar,
I have a question regarding the runtime performance. Is the implementation here the same as in the paper with ~3 hours per 1 million steps? Also run in eager mode?

My PyTorch implementation takes ~9 hours on an RTX3090 (which should be comparable to a V100), however the implementation here is similarly fast as my own implementation when tested on a GTX1080TI.

slow in atari tasks

I'm running Atari tasks on the nvidia 2080 and A100 respectively, but I only have around 20 FPS, which is too slow.
That means it takes tens of days to run 10 million steps. When you run the Atari task, how much time does each task take?
Is there a way to speed things up?

{"step": 1446778, "expl_amount": 0.09999962151050568, "model_grad_norm": 12.273571968078613, "value_grad_norm": 2023.7537841796875, "actor_grad_norm": 367.6117248535156, "prior_ent": 73.4467544555664, "post_ent": 70.40204620361328, "image_loss": 11264.0, "reward_loss": 1.7543749809265137, "pcont_loss": 0.5491357445716858, "div": 3.099745988845825, "model_loss": 11264.7998046875, "value_loss": Infinity, "actor_loss": -190.18875122070312, "action_ent": 0.013957500457763672, "fps": 20.45536337258022}

pcont when running Atari Games

Hi, may I ask, whether you have pcont = True when you obtain the Atari results in the paper? Thanks for your great work!

How to tune the hyper parameters of new RL algorithms?

Hi,

I've just finished my implementation of Dreamer. May I ask you several questions that have been bothered me for a long time? When you design an agent from scratch and unfortunately things go south, how can you tell whether it is because of the wrong hyperparameters/networks or because of the idea itself? How do you find the right parameters when you design an agent like Dreamer with tens of hyperparameters? As a simplest example, how do you decide that it's time to tune the network architecture and the loss weights? Currently, I only know random/grid search, but I find these methods soon become frustrating for a large project that requires a large amount of resources to train and has many hyperparameters, such as Dreamer.

I know that these questions are irrelevant to your project, but I really hope you could share some of your experience with me. Thanks in advance :-)

freenats inconsistent with tf1 repo

Hi! I've read this issue (#44) but the original tf1 dreamer repo seems to cap the per-timestep KL before averaging:
https://github.com/google-research/dreamer/blob/d517542f9777c53864ff6f2683cea88be9ea98ac/dreamer/training/utility.py#L160-L163
https://github.com/google-research/dreamer/blob/d517542f9777c53864ff6f2683cea88be9ea98ac/dreamer/training/utility.py#L240

Should that be capping after averaging instead (as done in this repo)? Or maybe I am missing something?

Thanks in advance!

Could you comment and format the code?

Hi,

first of all amazing work. Would it be possible to properly comment/format the code in order to increase readability/understandability? So, for example, noobs like me can better follow what it is going on :)

Thank you

Cheers,

Francesco

Why is the action mean bounded before applying tanh?

I'm very interested in Dreamer and try to reproduce it these days. However, I get stuck at the calling method of ActionDecoder. Specifically, I have a hard time to understand the following code originally from here

      mean = self._mean_scale * tf.tanh(mean / self._mean_scale)
      std = tf.nn.softplus(std + raw_init_std) + self._min_std
      dist = tfd.Normal(mean, std)
      dist = tfd.TransformedDistribution(dist, tools.TanhBijector())
      dist = tfd.Independent(dist, 1)
      dist = tools.SampleDist(dist) 

I cannot get the point of this line mean = self._mean_scale * tf.tanh(mean / self._mean_scale), which seems to limit mean to some range defined by self._mean_scale. But why do we do that as TanhBijector has already limited the output of the distribution to range [-1, 1]? Moreover, what's the point of Independent and SampleDist? If I understand them right, Independent introduces dependence between actions dimension and SampleDist seems to draw samples from the distribution and recompute the corresponding stats from these samples. But why do we do that as we can compute these stats directly from the mean and std?

Why are the random sampling ops patched?

From @xlnwel (asked in #4):

I noticed that you change the default behavior of tfd.MultivariateNormalDiag.sample, which took me a whole night as I found my implementation yielded different results from yours. Could you please share your intension of this code?

the code is running without any results and output

Hi,
When I run dreamer.py, it only outputs the following documents:

Prefill dataset with 5000 steps.
Train episode of length 1000 with return 38.9.
Train episode of length 1000 with return 50.6.
Train episode of length 1000 with return 46.3.
Train episode of length 1000 with return 43.8.
Train episode of length 1000 with return 44.6.
Simulating agent for 4995000 steps.
Found 5296820 model parameters.
Found 578412 actor parameters.
Found 413601 value parameters.
Start evaluation.
Training for 100 steps.

And it seems that the code is running without any results and output, do you know how to solve it? Thanks.

Why not use bit-depth preprocessing?

Hello.
I have a question about preprocess for observation in Dreamer.
In the PlaNet implementation, preprocess function converts images to 5-bits and adds noise, but it looks that preprocess function in this repository doesn't do such preprocess. Did you find that such preprocess is unnecessary for Dreamer?

Add flag for slow target value function

In this issue, you mentioned that using a slow target network for the value function could prevent the loss from diverging. Would it be possible to add a flag to enable this option? Or could you please describe how you implemented this feature? Thanks!

Why actor-critic loss is weighted by accumulated discount?

Hi Danijar. I have a question about the actor-critic loss.

In line 192 and 198, the loss is given by lambda returns weighted by accumulated discount. In the paper, it is said that 'terms are weighted down based on how likely the imagined trajectory would have ended'. I am confused by this intuition. I know in terminal state, the value of next state should be weighted by 0 like in DQN. But this behavior is already involved in the computation of lambda returns since the bellman backups multiply pcont in every steps. And the weighted loss is also used in no-episodic dmc tasks, which have no terminal state. So why we need this additional weight?

Could it be viewed as a variance reduction technique? The imagined trajectory will accumulate error through time, so it is better to put smaller weight on states with higher time steps since they are less reliable.

Sudden performance drop when value_grad_norm returns Inf

Hi, when I run Dreamer for the finger spin task, I encountered a pretty strange problem. The algorithm was working well until the value_grad_norm returns Inf and the performance dropped suddenly. I know that you are using LossScaleOptimizer in the code base, so it should not apply the gradient if the grad norm is infinite.
Do you know why this happened? And do you have any suggestions for fixing this problem? Thank you!
image

Running Atari games seems to crash

I'm not sure how to get dreamer to work for Atari. Running the following command

python dreamer.py --task atari_BreakOut

Results in a crash with the following trace. The environment seems to output an unexpected array named orientations. Any ideas on what the fix might be? And what is the recommended way to install atari?

Simulating agent for 4994812 steps.
Traceback (most recent call last):
  File "dreamer.py", line 468, in <module>
    main(parser.parse_args())
  File "dreamer.py", line 440, in main
    agent = Dreamer(config, datadir, actspace, writer)
  File "dreamer.py", line 108, in __init__
    self._build_model()
  File "dreamer.py", line 244, in _build_model
    self.train(next(self._dataset))
  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 296, in __next__
    return self.get_next()
  File "/home//miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 328, in get_next
    global_has_value, replicas = _get_next_as_optional(self, self._strategy)
  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 192, in _get_next_as_optional
Skipped short episode of length 0.
    iterator._iterators[i].get_next_as_list(new_name))  # pylint: disable=protected-access
  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/distribute/input_lib.py", line 1132, in get_next_as_list
    data_list = self._iterator.get_next_as_optional()
  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py", line 601, in get_next_as_optional
    iterator_ops.get_next_as_optional(self._device_iterators[i]))
  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 833, in get_next_as_optional
Skipped short episode of length 0.
    iterator.element_spec)), iterator.element_spec)
  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2444, in iterator_get_next_as_optional
    _ops.raise_from_not_ok_status(e, name)
  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
Skipped short episode of length 0.
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
Skipped short episode of length 0.
tensorflow.python.framework.errors_impl.InvalidArgumentError: TypeError: `generator` yielded an element that did not match the expected structure. The expected structure was {'image': dtype('uint8'), 'action': dtype('float16'), 'reward': dtype('float16'), 'discount': dtype('float16')}, but the yielded element was {'orientations': array([[-1.8042e-01, -9.8340e-01,  1.3806e-01, -9.9023e-01,  1.2341e-01,
        -9.9219e-01,  7.9688e-01, -6.0449e-01, -6.5283e-01, -7.5781e-01,
         9.2041e-01, -3.9136e-01,  9.9902e-01,  4.3976e-02],
       [-1.4453e-01, -9.8975e-01, -3.6102e-02, -9.9951e-01,  2.0496e-01,
        -9.7900e-01,  8.4912e-01, -5.2832e-01, -

....

.0437 , 0.04324, 0.042  , 0.04102, 0.04037, 0.04968, 0.118  ,
       0.0898 , 0.07574, 0.087  , 0.0953 , 0.10486, 0.138  , 0.1825 ,
       0.1633 , 0.1134 , 0.04486, 0.04956, 0.0498 , 0.04617, 0.03845,
       0.03317], dtype=float16), 'discount': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
      dtype=float16)}.
Traceback (most recent call last):

  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 791, in generator_py_func
    flattened_values = nest.flatten_up_to(output_types, values)

  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/data/util/nest.py", line 396, in flatten_up_to
    assert_shallow_structure(shallow_tree, input_tree)

  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/data/util/nest.py", line 311, in assert_shallow_structure
    % (len(input_tree), len(shallow_tree)))

ValueError: The two structures don't have the same sequence length. Input structure has length 7, while shallow structure has length 4.


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 243, in __call__
    ret = func(*args)

  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 309, in wrapper
    return func(*args, **kwargs)

  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 796, in generator_py_func
    "element was %s." % (output_types, values)), sys.exc_info()[2])

  File "/home/.local/lib/python3.7/site-packages/six.py", line 702, in reraise
    raise value.with_traceback(tb)

  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 791, in generator_py_func
    flattened_values = nest.flatten_up_to(output_types, values)

  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/data/util/nest.py", line 396, in flatten_up_to
    assert_shallow_structure(shallow_tree, input_tree)

  File "/home/miniconda3_march2020/envs/dreamer/lib/python3.7/site-packages/tensorflow/python/data/util/nest.py", line 311, in assert_shallow_structure
    % (len(input_tree), len(shallow_tree)))

TypeError: `generator` yielded an element that did not match the expected structure. The expected structure was {'image': dtype('uint8'), 'action': dtype('float16'), 'reward': dtype('float16'), 'discount': dtype('float16')}, but the yielded element was {'orientations': array([[-1.8042e-01, -9.8340e-01,  1.3806e-01, -9.9023e-01,  1.2341e-01,
        -9.9219e-01,  7.9688e-01, -6.0449e-01, -6.5283e-01, -7.5781e-01,
         9.2041e-01, -3.9136e-01,  9.9902e-01,  4.3976e-02],
       [-1.4453e-01, -9.8975e-01, -3.6102e-02, -9.9951e-01,  2.0496e-01,
        -9.7900e-01,  8.4912e-01, -5.2832e-01, -7.6025e-01, -6.4990e-01,
         8.6963e-01, -4.9341e-01,  9.9805e-01,  6.0699e-02],
       [ 1.7383e-01, -9.8486e-01, -4.8145e-01, -8.7646e-01,  3.6279e-01

...

33 , 0.1134 , 0.04486, 0.04956, 0.0498 , 0.04617, 0.03845,
       0.03317], dtype=float16), 'discount': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
      dtype=float16)}.


   [[{{node PyFunc}}]]
   [[MultiDeviceIteratorGetNextFromShard]]
   [[RemoteCall]] [Op:IteratorGetNextAsOptional]Skipped short episode of length 0.

Skipped short episode of length 0.

Spikes in Loss?

Hi everyone,
I reimplemented Dreamer in PyTorch. When running I see some kind of loss spikes. I was wondering if that is something that you also experience or whether this is a problem of my implementation:

50k batches / 500k env steps:

kl-divergence

Is there a shift in the action sequence?

Hi @danijar ,

Thanks for sharing your great work. I have a question about the action sequence. In the paper, the posterior distribution of s_t is defined as p(s_t | s_{t-1}, a_{t-1}, o_t}. But in function observe of RSSM, action and embed are passed to obs_step with the same index. So we may actually compute p(s_t | s_{t-1}, a_t, o_t}. I am confused by this shift. Therefore, I am afraid that I understand the code wrong. Please let me know if I made any mistakes.

Best regards,

Is RSSM trained by forward prediction or reconstruction?

Hi Danijar,

Thanks for releasing this tf2 implementation!

I was curious about how to reconcile the lines here:

dreamer/dreamer.py

Lines 159 to 163 in 1e38b1d

embed = self._encode(data)
post, prior = self._dynamics.observe(embed, data['action'])
feat = self._dynamics.get_feat(post)
image_pred = self._decode(feat)
reward_pred = self._reward(feat)

with the figure
image

In particular, the figure shows that the reconstruction observation (o_hat_1) is made from the rssm state obtained from encoding the true image (o_hat). However, the code above seems to suggest that the reconstruction is created by doing a single step of the RRSM?

Can you clarify how to reconcile these two things?

Thanks,

A question about reward and observation pairing in wrapper

Thank you for providing this great works!

I have a question about the reward and observation paring in wrapper. For my understanding, the transition of function step of class Collect at line 164 in wrappers.py is corresponding to ob{t}, reward{t-1}, action{t-1}. Since the Line 162 obs, reward, done, info = self._env.step(action) shows the action and reward are at same {t}, and obs is at {t+1}.
And when using this transition for updating model in function _train of class Dreamer, the reward{t-1} is the target of reward_pred, which is from state{t}, rooting from ob{t}, state{t-1} and action{t-1} .

But according to definition of reward model: q(r_t | s_t), the target of reward_pred should be reward{t}. So the transition should store the ob{t} reward{t} and action{t-1}. Am I misrecognizing the reward in transition as reward{t-1}?

Relaxed categorical for gradients through discrete actions

Hi, @danijar

For discrete actions, have you considered gumbel-softmax, a reparameterization trick for categorical distributions? Here's my code if you haven't and would like to give it a try. On the other hand, I would be very happy to know the result if you have done some similar experiments before --- I just found that I didn't have so much resources to train on Atari games.

class Categorical(Distribution):
    def __init__(self, logits, tau=1):
        self.logits = logits
        self.tau = tau  # tau in Gumbel-Softmax

    def log_prob(self, x):
        return -self.neg_log_prob(x)

    def neg_log_prob(self, x):
        if x.shape.ndims == len(self.logits.shape) and x.shape[-1] == self.logits.shape[-1]:
            # when x is one-hot encoded
            return tf.nn.softmax_cross_entropy_with_logits(labels=x, logits=self.logits)
        else:
            return tf.nn.sparse_softmax_cross_entropy_with_logits(labels=x, logits=self.logits)

    def sample(self, reparameterize=True, hard=True, one_hot=True):
        """
         A differentiable sampling method for categorical distribution
         reference paper: Categorical Reparameterization with Gumbel-Softmax
         original code: https://github.com/ericjang/gumbel-softmax/blob/master/Categorical%20VAE.ipynb
        """
        if reparameterize:
            g = tfd.Gumbel(0, 1).sample(tf.shape(self.logits))
            # Draw a sample from the Gumbel-Softmax distribution
            y = tf.nn.softmax((self.logits + g) / self.tau)
            # draw one-hot encoded sample from the softmax
            if not one_hot:
                y = tf.cast(tf.argmax(y, -1), tf.int32)
            elif hard:
                y_hard = tf.one_hot(tf.argmax(y, -1), self.logits.shape[-1])
                y = tf.stop_gradient(y_hard - y) + y
            assert y.shape.ndims == len(self.logits.shape)
        else:
            y = tfd.Categorical(self.logits).sample()
            assert y.shape.ndims == len(self.logits.shape) - 1
            if one_hot:
                y = tf.one_hot(y, self.logits.shape[-1])
                assert y.shape.ndims == len(self.logits.shape)
                assert y.shape[-1] == self.logits.shape[-1]

        return y

    def entropy(self):
        probs = tf.nn.softmax(self.logits)
        log_probs = tf.math.log(probs + EPSILON)
        entropy = tf.reduce_sum(-probs * log_probs, axis=-1)

        return entropy

    def kl(self, other):
        probs = tf.nn.softmax(self.logits)
        log_probs = tf.math.log(probs)
        other_log_probs = tf.nn.log_softmax(other.logits)
        kl = tf.reduce_sum(probs * (log_probs - other_log_probs), axis=-1)

        return kl

    def mode(self):
        return tf.argmax(self.logits, -1)

Why are the losses sometimes reported to be infinity?

Hi,

When running dreamer.py with the dm_control reacher environments, we are finding that the image_loss and model_loss are consistently logged as infs. Are there changes that need to be made for this particular environment to avoid this issue?

(Screenshot of outputs when running with debug parameters)
image

Thanks!

Isolating environments into threads or processes

Hi, I am trying to run your code with: python dreamer.py --logdir ./logdir/dmc_walker_walk/dreamer/ --task dmc_walker_walk --parallel process --envs 10
however, I got this error:
Traceback (most recent call last):
File "dreamer.py", line 468, in
main(parser.parse_args())
File "dreamer.py", line 434, in main
tools.simulate(random_agent, train_envs, prefill / config.action_repeat)
File "/home/zmh/mbmfec/dreamer/tools.py", line 125, in simulate
obs, _, done = zip([p()[:3] for p in promises])
File "/home/zmh/mbmfec/dreamer/tools.py", line 125, in
obs, _, done = zip(
[p()[:3] for p in promises])
File "/home/zmh/mbmfec/dreamer/wrappers.py", line 434, in _receive
message, payload = self._conn.recv()
File "/home/zmh/anaconda3/envs/dreamer/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/zmh/anaconda3/envs/dreamer/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/zmh/anaconda3/envs/dreamer/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError

Even I wrote --envs=1, I got the same problem.
I am using python3.7 and tensorflow2.1. Thanks for spending some time on this!

Process parallel tf scalar summary hangs

When I use parallel='process', the tf.summary.scalar call in the summarize_episode function doesn't return. If I use 'thread' or 'none' the tf.summary.scalar function finishes normally. I have no idea why. Has anyone encountered this issue already and found a fix?

Memory leak when calling policy function?

Hi,

I run into a memory leak when I use this function over and over and over again.

action, state = agent.policy(obs, state, training)

Is there something that needs to be cleared? I tried calling the function inside the call that resets the state (I'm not sure that it does that I'm assuming). This one:

if state is not None and reset.any():
      mask = tf.cast(1 - reset, self._float)[:, None]
      state = tf.nest.map_structure(lambda x: x * mask, state)

My guess is that is comes from this function:

def preprocess(obs, config):
  dtype = prec.global_policy().compute_dtype
  obs = obs.copy()
  with tf.device('cpu:0'):
    obs['image'] = tf.cast(obs['image'], dtype) / 255.0 - 0.5
    clip_rewards = dict(none=lambda x: x, tanh=tf.tanh)[config.clip_rewards]
    obs['reward'] = clip_rewards(obs['reward'])
  return obs

But I'm not familiar with tensorflow 2.X and I couldn't fix it.

A quick way to reproduce the issue is to modify the dreamer code as shown below and to run HTOP to monitor the RAM.

def main(config):
  if config.gpu_growth:
    for gpu in tf.config.experimental.list_physical_devices('GPU'):
      tf.config.experimental.set_memory_growth(gpu, True)
  assert config.precision in (16, 32), config.precision
  if config.precision == 16:
    prec.set_policy(prec.Policy('mixed_float16'))
  config.steps = int(config.steps)
  config.logdir.mkdir(parents=True, exist_ok=True)
  print('Logdir', config.logdir)

  # Create environments.
  datadir = config.logdir / 'episodes'
  writer = tf.summary.create_file_writer(
      str(config.logdir), max_queue=1000, flush_millis=20000)
  writer.set_as_default()
  train_envs = [wrappers.Async(lambda: make_env(
      config, writer, 'train', datadir, store=True), config.parallel)
      for _ in range(config.envs)]
  test_envs = [wrappers.Async(lambda: make_env(
      config, writer, 'test', datadir, store=False), config.parallel)
      for _ in range(config.envs)]
  actspace = train_envs[0].action_space

  # Prefill dataset with random episodes.
  step = count_steps(datadir, config)
  prefill = max(0, config.prefill - step)
  print(f'Prefill dataset with {prefill} steps.')
  random_agent = lambda o, d, _: ([actspace.sample() for _ in d], None)
  tools.simulate(random_agent, train_envs, prefill / config.action_repeat)
  writer.flush()

  # Train and regularly evaluate the agent.
  step = count_steps(datadir, config)
  print(f'Simulating agent for {config.steps-step} steps.')
  agent = Dreamer(config, datadir, actspace, writer)
  if (config.logdir / 'variables.pkl').exists():
    print('Load checkpoint.')
    agent.load(config.logdir / 'variables.pkl')
  
  import os
state = None
  state = None
  training = True
  files = os.listdir(str(datadir))
  keys = ['image','reward']
  for i in range(len(files)):
      print(i)
      episode = np.load(str(datadir)+'/'+files[i])
      episode = {k: episode[k] for k in episode.keys()}
      state=None
      for i in range(500):
          obs = {k: [episode[k][i]] for k in keys}
          action, state = agent.policy(obs, state, training)
  for env in train_envs + test_envs:
    env.close()

Please note that this behavior also happens if you run the call function instead of the policy one.

This happened with both python2.7 python3.8 on a ubuntu 18.04 system using tensorflow 2.1.0

Thanks in advance,

Regards,

Antoine

Can't reproduce results in some environments

I'm having trouble reproducing the results you achieved (in the paper and in scores/dreamer.json) for some environments, using the code in this repository. cartpole_swingup_sparse is one of the environments that most clearly doesn't reproduce, so I'll show examples of it here. I'm just running 1M steps as this environment approximately plateaus in your scores in this time frame.

In dreamer.json, the scores for this env look like:

image

And in the paper, the scores look like (interestingly decently different than scores.json):

image

Both reach 700-800 by 1M steps.

I'm running this command: rm -rf logdir/test && CUDA_VISIBLE_DEVICES=0 python3 dreamer.py --logdir ./logdir/test --task dmc_cartpole_swingup_sparse --steps 1e6.

First, I tried it with modern versions of all the packages (easier to set up than old packages). Tensorflow 2.5.0, CUDA 11.3, dm-control 0.0.425341097, A100 GPU, etc. I changed one line of code from experimental_run_v2 to run, but otherwise the code ran fine.

Then, I figured that maybe the reproduction issues were due to differences in package versions, so I tried to create an environment as close as possible to what you might have used (with a clean checkout of this repository):

  • python 3.8.10
  • tensorflow 2.2.0
  • tf probability 0.10.0
    • cloudpickle 1.3 required to get this working
  • dm_control commit 5701a00df44b84197692199794ad15cd1d2d0d55 from jan 27, 2020.
  • mujoco 2.0.0
  • gym 0.17.1ย  from march 2020.
  • PIL 7.1.0ย  from april 2020
  • numpy==1.19.5
  • cuda 10.1
  • cudnn 7.6.5
  • one V100 GPU

However, I'm still not able to get results that are close to yours. Here's some tensorboard graphs. Dark blue is from the "modern" environment, and red and light blue are from the env that is hopefully close to yours. I've run other replicates and they look similar to these - none look anything like the ones in dreamer.json.

image
image

Let me know what I might be doing wrong here! Happy to provide the full log files. If you have a list of exactly the packages you used I can try running again with those. I've been trying to release a clean Pytorch port of this repository (none of the existing ones really work too well), and my implementation seems to match the results of this code but is also unable to match the results in dreamer.json.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.