zacwellmer / worldmodels Goto Github PK

View Code? Open in Web Editor NEW

277.0 277.0 31.0 251.89 MB

World Models with TensorFlow 2

License: MIT License

Python 17.89% Shell 0.25% Jupyter Notebook 81.86%

worldmodels's People

Contributors

Stargazers

Watchers

worldmodels's Issues

Dockerfile.wm build not successful

After running:
docker container run -p 8888:8888 --gpus '"device=0"' --detach -it --name wm wm:1.0
It did not build successfully and I got an error starting with:
Building wheel for grpcio (setup.py): finished with status 'error'
And
Could not find <Python.h>
It built successfully only after moving the install lines between 92 and 112 to line 42:
RUN apt install -y python3-dev git wget libopenmpi-dev xvfb python-opengl fontconfig cmake gcc unzip zlib1g-dev libjpeg-dev libsdl2-dev libboost-all-dev gdb
This is for anyone who has the same issue.

Log files do not change after training

I am part of a group that is working on a project based off your World Models repo (as pointed to by the World Models authors).
My group has been having troubles collecting results and visualizing them. We are simply trying to train the Car Racing model and reproduce results similar to the paper, however after three epochs of training, the log files in the results folder still appear to be your results, and not the results from our training process.

Could you please provide any advice?

set_random_params

Both the VAE and RNN have a method named set_random_params, but they remain unused. Is that leftover code or did it slip through the cracks?

Issue while building the docker image

Hello,

Thanks for the repo. I am facing an error while building the docker:

Step 23/46 : RUN python3 -m pip install --no-cache-dir ${TF_PACKAGE}${TF_PACKAGE_VERSION:+==${TF_PACKAGE_VERSION}}
 ---> Running in ebae39a5f72c
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.5/dist-packages/pip/__main__.py", line 21, in <module>
    from pip._internal.cli.main import main as _main
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/cli/main.py", line 60
    sys.stderr.write(f"ERROR: {exc}")
                                   ^
SyntaxError: invalid syntax

Could you please suggest a fix for this?

Thanks

Lack of documentation for parameter "rnn_r_pred"

Hi Zac,

I am guessing that one reason the original paper did not train for the car racing challenge in the dream is because the reward cannot be easily calculated outside the real environment. In the Doom environment, the reward can just be increased for each frame the bot is alive in the dream.

I see a parameter "rnn_r_pred" that seems to indicate that the MDNRNN can be trained to predict the reward. Can this reward prediction be used to train the controller in the dream in the car racing environment?

Cheers!

Incorrect array indexes/sizes in the controller's trainer

Hi Zac,

There is a bug, possibly a list of cascading bugs, in the controller training script. Specifically, if controller_num_test_episode is greater than controller_num_episode, the following error occurs (in CarRacing):

Track generation: 1180..1479 -> 299-tiles track
Track generation: 1184..1484 -> 300-tiles track
Track generation: 1016..1274 -> 258-tiles track
Traceback (most recent call last):
  File "train.py", line 451, in <module>
    main(args)
  File "train.py", line 422, in main
    slave()
  File "train.py", line 193, in slave
    result_packet = encode_result_packet(results)
  File "train.py", line 137, in encode_result_packet
    r = np.concatenate([r, np.zeros(RESULT_PACKET_SIZE - eval_packet_size)-1.0], axis=0)
ValueError: negative dimensions are not allowed

The error can be reproduced by downloading the latest of the repo main branch, altering the CarRacing config file as shown below, then run the trainer only (no need to re-train the VAE or RNN):

export CONFIG_PATH=configs/carracing.config
CUDA_VISIBLE_DEVICES=-1 xvfb-run -a -s "-screen 0 1400x900x24 +extension RANDR" -- nice python train.py -c $CONFIG_PATH

Controller part of the config file:

controller_optimizer=cma
controller_num_episode=2
controller_num_test_episode=3
controller_eval_steps=4
controller_num_worker=10
controller_num_worker_trial=1
controller_antithetic=0
controller_cap_time=0
controller_retrain=0
controller_seed_start=0
controller_sigma_init=0.1
controller_sigma_decay=0.999
controller_batch_mode=mean

The evaluation results read from the workers could also be affected by this (see train.py at lines 219-220) because the orchestrator process is (over)reading num_episode items from the results, whereas there could only be num_test_episode items to read:

      reward_list_total[idx, :num_episode] = result[2]
      reward_list_total[idx, num_episode:] = result[3]

This could skew the reward mean for a particular batch and affect training performance and model accuracy. It should affect the Doom experiment as well, although I haven't tested it. A quick workaround is to set both controller_num_test_episode and controller_num_episode to the same value, but it is not ideal. I wonder if fixing this bug would get you closer to the results of the original paper.

Issue for training VAE

Hello,
Thank you for the repo. I am facing issues about training VAE(bash launch_scripts/carracing.bash) for CarRacing-v0.

Loss did not decreased
print(batch_z[0]) returns tf.Tensor([nan,nan,...,nan],shape=(32,),dtype=float32) in visualization.ipynb

I am using your Docker environments in my local Ubuntu PC.
Could you please tell me how to train correctly?

I look forward to hearing from you.

xvfb-run: command not found

When I ran the bash file for car racing you mention, I got this error for every iteration. What is xvfb-run?

worker X
launch_scripts/carracing.bash: line 5: xvfb-run: command not found

Dropout and LSTM

Hi Zac,

It looks like the dropout features in the LSTM layer are not used at all. According to the comments in the code, that might not be intended:
rnn_out, h, c = rnn.inference_base(input_x, initial_state=states, training=training) # set training True to use Dropout
For this to work, the dropout or/and recurrent_dropout parameters need to be specified when the LSTM layer is created in rnn/rnn.py on line 28:
self.inference_base = tf.keras.layers.LSTM(units=args.rnn_size, return_sequences=True, return_state=True, time_major=False, dropout=args.rnn_dropout, recurrent_dropout=args.rnn_rec_dropout)
with both args.rnn_dropout and args.rnn_rec_dropout around 0.4 as a starting point.

If the dropouts are not required, then the LSTM layer could be replaced with keras.layers.CuDNNLSTM, which is much faster to train on a GPU. The full requirements to replace the pure Tensorflow LSTM implementation with the CuDNN implementation are listed towards the top of this page.

Dropouts or speed, which will you choose?

Docker Image Build Not Successful

Using macOS Monterey (12.3.1) M1 Pro

Running the following command inside the cloned repo directory on my local machine:

docker image build -t wm:1.0 -f docker/Dockerfile.wm .

results in the following error message:

I fixed this by adding the command < --allow-authenticated >
on line 21 of Dockerfile.wm after the < --no-install-recommends > command (following the -y flag)

In addition, I ran into problems with the command:

< RUN python3 -m pip install --no-cache-dir ${TF_PACKAGE}${TF_PACKAGE_VERSION:+==${TF_PACKAGE_VERSION}} >

The error prompt suggests I might need to run:

< apt-get install -y python3-dev >

on Ubuntu systems.

This line is actually located towards the bottom (line ~94) but moving it up before the TensorFlow download line seems to do the trick

Upon including the python3-dev install above the TF download, the command line hangs for a long time but it should result in success

Also, the < apt -y update > command on line 93 failed for me because the package manager was missing a public key for the NVIDIA download (super technical, I know, but I've had a long day, forgive me), giving the following error:

< the following signatures couldn't be verified because the public key is not available >

I fixed this by prepending the "install packages" section with a line to update the package manager with the missing public key given by the error message, as per this resource here:

https://chrisjean.com/fix-apt-get-update-the-following-signatures-couldnt-be-verified-because-the-public-key-is-not-available/

This also took a long while. But in the end, the building of this docker image completed successfully!

Now to run the programs themselves...

zacwellmer / worldmodels Goto Github PK

worldmodels's People

Contributors

Stargazers

Watchers

Forkers

worldmodels's Issues

Dockerfile.wm build not successful

Log files do not change after training

set_random_params

Issue while building the docker image

Lack of documentation for parameter "rnn_r_pred"

Incorrect array indexes/sizes in the controller's trainer

Issue for training VAE

xvfb-run: command not found

Dropout and LSTM

Docker Image Build Not Successful

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent