Giter VIP home page Giter VIP logo

worldmodels's People

Contributors

zacwellmer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

worldmodels's Issues

Dockerfile.wm build not successful

After running:
docker container run -p 8888:8888 --gpus '"device=0"' --detach -it --name wm wm:1.0
It did not build successfully and I got an error starting with:
Building wheel for grpcio (setup.py): finished with status 'error'
And
Could not find <Python.h>
It built successfully only after moving the install lines between 92 and 112 to line 42:
RUN apt install -y python3-dev git wget libopenmpi-dev xvfb python-opengl fontconfig cmake gcc unzip zlib1g-dev libjpeg-dev libsdl2-dev libboost-all-dev gdb
This is for anyone who has the same issue.

Log files do not change after training

I am part of a group that is working on a project based off your World Models repo (as pointed to by the World Models authors).
My group has been having troubles collecting results and visualizing them. We are simply trying to train the Car Racing model and reproduce results similar to the paper, however after three epochs of training, the log files in the results folder still appear to be your results, and not the results from our training process.

Could you please provide any advice?

set_random_params

Both the VAE and RNN have a method named set_random_params, but they remain unused. Is that leftover code or did it slip through the cracks?

Issue while building the docker image

Hello,

Thanks for the repo. I am facing an error while building the docker:

Step 23/46 : RUN python3 -m pip install --no-cache-dir ${TF_PACKAGE}${TF_PACKAGE_VERSION:+==${TF_PACKAGE_VERSION}}
 ---> Running in ebae39a5f72c
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.5/dist-packages/pip/__main__.py", line 21, in <module>
    from pip._internal.cli.main import main as _main
  File "/usr/local/lib/python3.5/dist-packages/pip/_internal/cli/main.py", line 60
    sys.stderr.write(f"ERROR: {exc}")
                                   ^
SyntaxError: invalid syntax

Could you please suggest a fix for this?

Thanks

Lack of documentation for parameter "rnn_r_pred"

Hi Zac,

I am guessing that one reason the original paper did not train for the car racing challenge in the dream is because the reward cannot be easily calculated outside the real environment. In the Doom environment, the reward can just be increased for each frame the bot is alive in the dream.

I see a parameter "rnn_r_pred" that seems to indicate that the MDNRNN can be trained to predict the reward. Can this reward prediction be used to train the controller in the dream in the car racing environment?

Cheers!

Incorrect array indexes/sizes in the controller's trainer

Hi Zac,

There is a bug, possibly a list of cascading bugs, in the controller training script. Specifically, if controller_num_test_episode is greater than controller_num_episode, the following error occurs (in CarRacing):

Track generation: 1180..1479 -> 299-tiles track
Track generation: 1184..1484 -> 300-tiles track
Track generation: 1016..1274 -> 258-tiles track
Traceback (most recent call last):
  File "train.py", line 451, in <module>
    main(args)
  File "train.py", line 422, in main
    slave()
  File "train.py", line 193, in slave
    result_packet = encode_result_packet(results)
  File "train.py", line 137, in encode_result_packet
    r = np.concatenate([r, np.zeros(RESULT_PACKET_SIZE - eval_packet_size)-1.0], axis=0)
ValueError: negative dimensions are not allowed

The error can be reproduced by downloading the latest of the repo main branch, altering the CarRacing config file as shown below, then run the trainer only (no need to re-train the VAE or RNN):

export CONFIG_PATH=configs/carracing.config
CUDA_VISIBLE_DEVICES=-1 xvfb-run -a -s "-screen 0 1400x900x24 +extension RANDR" -- nice python train.py -c $CONFIG_PATH

Controller part of the config file:

controller_optimizer=cma
controller_num_episode=2
controller_num_test_episode=3
controller_eval_steps=4
controller_num_worker=10
controller_num_worker_trial=1
controller_antithetic=0
controller_cap_time=0
controller_retrain=0
controller_seed_start=0
controller_sigma_init=0.1
controller_sigma_decay=0.999
controller_batch_mode=mean

The evaluation results read from the workers could also be affected by this (see train.py at lines 219-220) because the orchestrator process is (over)reading num_episode items from the results, whereas there could only be num_test_episode items to read:

      reward_list_total[idx, :num_episode] = result[2]
      reward_list_total[idx, num_episode:] = result[3]

This could skew the reward mean for a particular batch and affect training performance and model accuracy. It should affect the Doom experiment as well, although I haven't tested it. A quick workaround is to set both controller_num_test_episode and controller_num_episode to the same value, but it is not ideal. I wonder if fixing this bug would get you closer to the results of the original paper.

Issue for training VAE

Hello,
Thank you for the repo. I am facing issues about training VAE(bash launch_scripts/carracing.bash) for CarRacing-v0.

  1. Loss did not decreased
  2. print(batch_z[0]) returns tf.Tensor([nan,nan,...,nan],shape=(32,),dtype=float32) in visualization.ipynb

I am using your Docker environments in my local Ubuntu PC.
Could you please tell me how to train correctly?

I look forward to hearing from you.

Screenshot from 2021-02-24 20-59-34

xvfb-run: command not found

When I ran the bash file for car racing you mention, I got this error for every iteration. What is xvfb-run?

worker X
launch_scripts/carracing.bash: line 5: xvfb-run: command not found

Dropout and LSTM

Hi Zac,

It looks like the dropout features in the LSTM layer are not used at all. According to the comments in the code, that might not be intended:
rnn_out, h, c = rnn.inference_base(input_x, initial_state=states, training=training) # set training True to use Dropout
For this to work, the dropout or/and recurrent_dropout parameters need to be specified when the LSTM layer is created in rnn/rnn.py on line 28:
self.inference_base = tf.keras.layers.LSTM(units=args.rnn_size, return_sequences=True, return_state=True, time_major=False, dropout=args.rnn_dropout, recurrent_dropout=args.rnn_rec_dropout)
with both args.rnn_dropout and args.rnn_rec_dropout around 0.4 as a starting point.

If the dropouts are not required, then the LSTM layer could be replaced with keras.layers.CuDNNLSTM, which is much faster to train on a GPU. The full requirements to replace the pure Tensorflow LSTM implementation with the CuDNN implementation are listed towards the top of this page.

Dropouts or speed, which will you choose?

Docker Image Build Not Successful

Using macOS Monterey (12.3.1) M1 Pro

Running the following command inside the cloned repo directory on my local machine:

docker image build -t wm:1.0 -f docker/Dockerfile.wm .

results in the following error message:

Screen Shot 2023-06-22 at 2 16 59 PM

I fixed this by adding the command < --allow-authenticated >
on line 21 of Dockerfile.wm after the < --no-install-recommends > command (following the -y flag)

In addition, I ran into problems with the command:

< RUN python3 -m pip install --no-cache-dir ${TF_PACKAGE}${TF_PACKAGE_VERSION:+==${TF_PACKAGE_VERSION}} >

The error prompt suggests I might need to run:

< apt-get install -y python3-dev >

on Ubuntu systems.

Screen Shot 2023-06-22 at 3 54 43 PM

This line is actually located towards the bottom (line ~94) but moving it up before the TensorFlow download line seems to do the trick

Upon including the python3-dev install above the TF download, the command line hangs for a long time but it should result in success

Also, the < apt -y update > command on line 93 failed for me because the package manager was missing a public key for the NVIDIA download (super technical, I know, but I've had a long day, forgive me), giving the following error:

< the following signatures couldn't be verified because the public key is not available >

I fixed this by prepending the "install packages" section with a line to update the package manager with the missing public key given by the error message, as per this resource here:

https://chrisjean.com/fix-apt-get-update-the-following-signatures-couldnt-be-verified-because-the-public-key-is-not-available/

This also took a long while. But in the end, the building of this docker image completed successfully!

Now to run the programs themselves...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.