Giter VIP home page Giter VIP logo

efficientzero's Introduction

EfficientZero (NeurIPS 2021)

Open-source codebase for EfficientZero, from "Mastering Atari Games with Limited Data" at NeurIPS 2021.

Environments

EfficientZero requires python3 (>=3.6) and pytorch (>=1.8.0) with the development headers.

We recommend to use torch amp (--amp_type torch_amp) to accelerate training.

Prerequisites

Before starting training, you need to build the c++/cython style external packages. (GCC version 7.5+ is required.)

cd core/ctree
bash make.sh

The distributed framework of this codebase is built on ray.

Installation

As for other packages required for this codebase, please run pip install -r requirements.txt.

Usage

Quick start

  • Train: python main.py --env BreakoutNoFrameskip-v4 --case atari --opr train --amp_type torch_amp --num_gpus 1 --num_cpus 10 --cpu_actor 1 --gpu_actor 1 --force
  • Test: python main.py --env BreakoutNoFrameskip-v4 --case atari --opr test --amp_type torch_amp --num_gpus 1 --load_model --model_path model.p \

Bash file

We provide train.sh and test.sh for training and evaluation.

  • Train:
    • With 4 GPUs (3090): bash train.sh
  • Test: bash test.sh
Required Arguments Description
--env Name of the environment
--case {atari} It's used for switching between different domains(default: atari)
--opr {train,test} select the operation to be performed
--amp_type {torch_amp,none} use torch amp for acceleration
Other Arguments Description
--force will rewrite the result directory
--num_gpus 4 how many GPUs are available
--num_cpus 96 how many CPUs are available
--cpu_actor 14 how many cpu workers
--gpu_actor 20 how many gpu workers
--seed 0 the seed
--use_priority use priority in replay buffer sampling
--use_max_priority use the max priority for the newly collectted data
--amp_type 'torch_amp' use torch amp for acceleration
--info 'EZ-V0' some tags for you experiments
--p_mcts_num 8 set the parallel number of envs in self-play
--revisit_policy_search_rate 0.99 set the rate of reanalyzing policies
--use_root_value use root values in value targets (require more GPU actors)
--render render in evaluation
--save_video save videos for evaluation

Architecture Designs

The architecture of the training pipeline is shown as follows:

Some suggestions

  • To use a smaller model, you can choose smaller dim of the projection layers (Eg: 256/64) and the LSTM hidden layer (Eg: 64) in the config.
  • For GPUs with 10G memory instead of 20G memory, you can allocate 0.25 gpu for each GPU maker (@ray.remote(num_gpus=0.25)) in core/reanalyze_worker.py.

New environment registration

If you wan to apply EfficientZero to a new environment like mujoco. Here are the steps for registration:

  1. Follow the directory config/atari and create dir for the env at config/mujoco.
  2. Implement your MujocoConfig(BaseConfig) class and implement the models as well as your environment wrapper.
  3. Register the case at main.py.

Results

Evaluation with 32 seeds for 3 different runs (different seeds).

Citation

If you find this repo useful, please cite our paper:

@inproceedings{ye2021mastering,
  title={Mastering Atari Games with Limited Data},
  author={Weirui Ye, and Shaohuai Liu, and Thanard Kurutach, and Pieter Abbeel, and Yang Gao},
  booktitle={NeurIPS},
  year={2021}
}

Contact

If you have any question or want to use the code, please contact [email protected] .

Acknowledgement

We appreciate the following github repos a lot for their valuable code base implementations:

https://github.com/koulanurag/muzero-pytorch

https://github.com/werner-duvaud/muzero-general

https://github.com/pytorch/ELF

efficientzero's People

Contributors

jl1990 avatar yewr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

efficientzero's Issues

Removing Baselines dependency

Hello,
it would be nice to remove baselines dependency (as it requires tensorflow, whereas the rest of the codebase is written with pytorch).
As apparently it is used for atari wrappers only, there are few options:

  • use the ones that are now in Gym (but I'm not sure they are exactly the same)
  • use the ones from Stable-Baselines3 (also depends on pytorch, so less dependencies)
  • copy them into the repo

Cannot reproduce Breakout results

Thank you authors for the awesome paper!

I have an issue reproducing the results of Breakout. Instead of 414 claimed in the paper, I get 362.43 (average mean performance over 3 seeds). This is likely same issue as 21.

reproduce the result of CrazyClimber

The paper show three runnings of CrazyClimber as below. It seems stable and of high performance.
image
image

However, when I rerun the code three times using the given command. I got the following results. The runnings broke off before training finished, but it still revealed that the reproduced result is far lower by an order of magnitude than what the paper reported.

Could you give some possible explanations why there is a huge difference?

# running 1
[2022-04-06 02:47:47,242][train_test][INFO][log.py>_log] ==> #0          Test Mean Score of CrazyClimberNoFrameskip-v4: 643.75     (max: 1000.0    , min:300.0     , std: 178.42628029525247)
[2022-04-06 03:54:07,033][train_test][INFO][log.py>_log] ==> #10083      Test Mean Score of CrazyClimberNoFrameskip-v4: 3071.875   (max: 3400.0    , min:2100.0    , std: 383.4338070319309)
[2022-04-06 04:55:49,719][train_test][INFO][log.py>_log] ==> #22478      Test Mean Score of CrazyClimberNoFrameskip-v4: 634.375    (max: 1300.0    , min:300.0     , std: 249.51124097923926)
[2022-04-06 05:57:18,634][train_test][INFO][log.py>_log] ==> #34576      Test Mean Score of CrazyClimberNoFrameskip-v4: 3378.125   (max: 5700.0    , min:2700.0    , std: 736.4332857598168)
[2022-04-06 06:53:54,973][train_test][INFO][log.py>_log] ==> #46427      Test Mean Score of CrazyClimberNoFrameskip-v4: 3065.625   (max: 6800.0    , min:2300.0    , std: 761.0064778797878)
[2022-04-06 07:55:10,556][train_test][INFO][log.py>_log] ==> #57917      Test Mean Score of CrazyClimberNoFrameskip-v4: 8534.375   (max: 11700.0   , min:4000.0    , std: 2101.5781830269843)
[2022-04-06 08:51:19,116][train_test][INFO][log.py>_log] ==> #69158      Test Mean Score of CrazyClimberNoFrameskip-v4: 6453.125   (max: 12100.0   , min:3100.0    , std: 2302.849372923683)
[2022-04-06 09:47:16,916][train_test][INFO][log.py>_log] ==> #80299      Test Mean Score of CrazyClimberNoFrameskip-v4: 7246.875   (max: 10300.0   , min:4900.0    , std: 1379.5344266726365)
# running 2
[2022-04-05 11:39:55,952][train_test][INFO][log.py>_log] ==> #0          Test Mean Score of CrazyClimberNoFrameskip-v4: 600.0      (max: 900.0     , min:300.0     , std: 152.0690632574555)
[2022-04-05 12:46:50,652][train_test][INFO][log.py>_log] ==> #10015      Test Mean Score of CrazyClimberNoFrameskip-v4: 393.75     (max: 800.0     , min:100.0     , std: 132.13984069916233)
[2022-04-05 13:41:18,799][train_test][INFO][log.py>_log] ==> #20080      Test Mean Score of CrazyClimberNoFrameskip-v4: 3634.375   (max: 5100.0    , min:1900.0    , std: 705.6067313844164)
[2022-04-05 14:31:57,618][train_test][INFO][log.py>_log] ==> #30013      Test Mean Score of CrazyClimberNoFrameskip-v4: 9634.375   (max: 13100.0   , min:3500.0    , std: 2230.1358387719347)
[2022-04-05 15:35:49,472][train_test][INFO][log.py>_log] ==> #40039      Test Mean Score of CrazyClimberNoFrameskip-v4: 6434.375   (max: 10000.0   , min:3500.0    , std: 1384.875755934445)
# running 3
[2022-04-05 04:43:53,277][train_test][INFO][log.py>_log] ==> #0          Test Mean Score of CrazyClimberNoFrameskip-v4: 615.625    (max: 1000.0    , min:400.0     , std: 146.00807982779583)
[2022-04-05 05:52:29,495][train_test][INFO][log.py>_log] ==> #10033      Test Mean Score of CrazyClimberNoFrameskip-v4: 950.0      (max: 1500.0    , min:700.0     , std: 264.5751311064591)
[2022-04-05 06:55:41,171][train_test][INFO][log.py>_log] ==> #20077      Test Mean Score of CrazyClimberNoFrameskip-v4: 2771.875   (max: 3800.0    , min:1000.0    , std: 850.086162912325)
[2022-04-05 07:58:09,925][train_test][INFO][log.py>_log] ==> #30064      Test Mean Score of CrazyClimberNoFrameskip-v4: 4550.0     (max: 7300.0    , min:2300.0    , std: 1228.3118496538248)
[2022-04-05 08:59:21,098][train_test][INFO][log.py>_log] ==> #40001      Test Mean Score of CrazyClimberNoFrameskip-v4: 9137.5     (max: 12500.0   , min:5000.0    , std: 1880.2842205368847)
[2022-04-05 10:00:18,121][train_test][INFO][log.py>_log] ==> #50019      Test Mean Score of CrazyClimberNoFrameskip-v4: 10393.75   (max: 24000.0   , min:5700.0    , std: 3386.1794012574114)

Envs seem not to work in parallel

Only one Data worker is working since config.num_actors=1. Inside one Data worker, there are several working envs. But they works in in serial rather than in parallel, as follows.
for i in range(env_nums): obs, ori_reward, done, info = env.step(action).
Is it possible to speed up data collection by setting config.num_actors > 1?

Code for continuous action space

Hi, thanks for the repository. Could you consider releasing the code that supports continuous actions spaces for the DMControl 100k benchmark please ? The code that uses the discretization of each dimension. This would be very helpful.
Best

Question about the effect of discount factor and done mask when calculating the target value?

Thanks for your open-sourced code very much.

This is a common definition of an target value in classical RL:
image

I'm a little confused about the way of calculating target value here in reanalyze_worker.py:

Why we do not multiply the bootstrap value (here is value_lst) by the discount_factor^td_steps, and why we do not mask the bootsrap value when the target obs is a done state.

Looking forward to your reply!

Question about the transform between true reward and value prefix

Hi,
I was a little confused about how to get true reward from value prefix in core/ctree/cnode.cpp

For the function update_tree_q() in Line 256, the true reward is calculated by
float true_reward = node->value_prefix - parent_value_prefix

Suppose we have a root node_1, with its two child (node_2 and node_3),

Before the while loop, we push node_1 into the node_stack;
For the first time of the while loop, we pop node_1, and push node_2, node_3 into the node_stack, finally we set parent_value_prefix = node_1.value_prefix;

For the second time of the while loop, we pop node_3, (suppose there is no child of node_3 expanded), and we set parent_value_prefix=node3.value_prefix (Line281);

In the third time of the while loop, we pop node_2, when we calc the true reward of node_2 in Line 266,
true_reward = node_2.value_prefix - parent_value_prefix = node_2.value_prefix - node_3.value_prefix,

However, the parent of node_2 is node_1, so the true_reward should be node_2.value_prefix - node_1.value_prefix
So I wonder if there is some problem for the operation for the variable "parent_value_prefix", or I misunderstood the code.

Alhough, in function update_tree_q, we only update the min_max value, so it may not affect the convergence. I wonder if it will convergence faster if there the operation is fixed.

procgen

Can this be tried out on procgen environment?

PyTorch Lightning Support

Thanks for sharing this codebase, I was wondering if you were planning to support PyTorch Lightning in the future ?

what does reward_hidden_c mean in mcts.py?

Hi, In mcts.py code line 35-36, what does reward_hidden_c and reward_hidden_h mean? ( what is c and h short for?) why reward_hidden_c_pool = [reward_hidden_roots[0]] and reward_hidden_h_pool = [reward_hidden_roots[1]]. I find it difficult to understand the code, could you give some comments. Many thanks!
截屏2022-01-20 下午9 25 09

WSL2 NVIDIA 3090 or M1 MBP correct environment

Hi, I had trouble identifying the right mix of python and packages to get this to run.

Could you please review/confirm the python version and requirements.txt for either one of these?

  1. Dual RTX 3090 on WSL2
  2. M1 MBP

Is there a docker container for EfficientZero?

Many thanks in advance!

Reward clipping and value transformation

Hello,

Thanks for this great work! I noticed that you choose to clip the reward to [-1, 1] for Atari. I'm wondering what's the purpose of applying value transformation (i.e. scalar_transform) if you already have the reward clipped?

EfficientZero high memory consumption / keeps increasing after replay buffer is full

I am currently experimenting on scaling EfficientZero to learning setups with high-data regimes.

As a first step, I am running experiments on Atari, with a replay buffer of 1M environment steps.
While doing this I observed that RAM consumption keeps increasing long after the replay buffer reached its maximum size.

Here are tensorboard plots on Breakout, for a 600k training steps run (20M environment steps / 80M environment frames):

breakout_high_mem

I perform experiments on cluster computers featuring 4 tesla V100 gpus / 40 cpus and 187GB of RAM.

As you can see, although the maximum replay buffer size ("total_node_num") is reached after 30k training steps, RAM (in %) keeps increasing until around 250k steps, from 80% to 85%.

Ideally, I would also like to increase the batch size. But it seems like the problem gets worse in that setting:

breakout_mem

The orange curves are from the same Breakout experiments, but with a batch size of 512 (instead of 256), and a smaller replay buffer size (0.1M). Here the maximum replay buffer size is obtained at 4k training steps but memory keeps increasing until 100K+ steps.
I understand that a bigger batch means more RAM because more data is being processed when updating/doing MCTS, but it does not explain why it keeps increasing after the replay buffer fills up

Any ideas on what causes this high ram consumption, and how we could mitigate that ?

Run details

Here are the parameters used for the first experiment I described (pink curves):

Param: {'action_space_size': 4, 'num_actors': 2, 'do_consistency': True, 'use_value_prefix': True, 'off_correction': True, 'gray_scale': False, 'auto_td_steps
_ratio': 0.3, 'episode_life': True, 'change_temperature': True, 'init_zero': True, 'state_norm': False, 'clip_reward': True, 'random_start': True, 'cvt_string': True, 'image_based': True, 'max_moves': 27000, 'test_max_m
oves': 3000, 'history_length': 400, 'num_simulations': 50, 'discount': 0.988053892081, 'max_grad_norm': 5, 'test_interval': 10000, 'test_episodes': 32, 'value_delta_max': 0.01, 'root_dirichlet_alpha': 0.3, 'root_explora
tion_fraction': 0.25, 'pb_c_base': 19652, 'pb_c_init': 1.25, 'training_steps': 900000, 'last_steps': 20000, 'checkpoint_interval': 100, 'target_model_interval': 200, 'save_ckpt_interval': 100000, 'log_interval': 1000, '
vis_interval': 1000, 'start_transitions': 2000, 'total_transitions': 30000000, 'transition_num': 1.0, 'batch_size': 256, 'num_unroll_steps': 5, 'td_steps': 5, 'frame_skip': 4, 'stacked_observations': 4, 'lstm_hidden_siz
e': 512, 'lstm_horizon_len': 5, 'reward_loss_coeff': 1, 'value_loss_coeff': 0.25, 'policy_loss_coeff': 1, 'consistency_coeff': 2, 'device': 'cuda', 'debug': False, 'seed': 0, 'value_support': <core.config.DiscreteSuppor
t object at 0x152644d101d0>, 'reward_support': <core.config.DiscreteSupport object at 0x152644d10210>, 'use_adam': False, 'weight_decay': 0.0001, 'momentum': 0.9, 'lr_warm_up': 0.01, 'lr_warm_step': 1000, 'lr_init': 0.2
, 'lr_decay_rate': 0.1, 'lr_decay_steps': 900000, 'mini_infer_size': 64, 'priority_prob_alpha': 0.6, 'priority_prob_beta': 0.4, 'prioritized_replay_eps': 1e-06, 'image_channel': 3, 'proj_hid': 1024, 'proj_out': 1024, 'p
red_hid': 512, 'pred_out': 1024, 'bn_mt': 0.1, 'blocks': 1, 'channels': 64, 'reduced_channels_reward': 16, 'reduced_channels_value': 16, 'reduced_channels_policy': 16, 'resnet_fc_reward_layers': [32], 'resnet_fc_value_l
ayers': [32], 'resnet_fc_policy_layers': [32], 'downsample': True, 'env_name': 'BreakoutNoFrameskip-v4', 'obs_shape': (12, 96, 96), 'case': 'atari', 'amp_type': 'torch_amp', 'use_priority': True, 'use_max_priority': Tru
e, 'cpu_actor': 14, 'gpu_actor': 20, 'p_mcts_num': 128, 'use_root_value': False, 'auto_td_steps': 270000.0, 'use_augmentation': True, 'augmentation': ['shift', 'intensity'], 'revisit_policy_search_rate': 0.99}

Zero score on Freeway

I tried to run the code for Atari Freeway using the following command with the default settings in the code:

python main.py --env FreewayNoFrameskip-v4 \
--case atari \
--opr train \
--amp_type torch_amp \
--num_gpus 1 \
--num_cpus 10 \
--cpu_actor 2 \
--gpu_actor 2 \
--force \
--object_store_memory 21474836480 \
--seed=0

I tried two seeds 0 and 1. Based on tensorboard curves, the algorithm seems to receive no reward at all for training. Both workers.ori_reward and Train_statistics.target_value_prefix_mean are constant zero from beginning to the end.

From train_test_log, seed 0 got positive reward (~7.5) at step 0, but then no reward at all after that. Seed 1 also got ~7.5 reward at step 0, while got 0 for the remaining half of the evaluations. The other half got 21.34.

I wonder whether I did something wrong.

Thanks

Wei

EfficientZero V2

I've learned your paper EfficientZero V2, really good job extending to continous control. Would you or your team consider open source? I would be very grateful!

Question about getting zero test score when I try to run EfficientZero on BabyAI grid environment

Hello, first of all thanks for your amazing job on EfficientZero.

I tried to adapt EfficientZero on BabyAI environment like: "PutNextLocal", but it just keep give me 0 test score during the 100k step training process.

I made several modifications in order to adapt to BabyAI "PutNextLocal" env:

  1. I create dir for env at config/babyai, and implement BabyAIConfig(BaseConfig). I leave every parameters as default just like Atari, and only change line 101 from (image_channel,96,96) to (image_channel,7,7) in file config/babyai/__init__.py.
  2. Change class name from AtariWrapper(Game) to BabyAIWrapper(Game), and leave everything else as default setting.
  3. Comment out from line 103 to line 111 since grid game does not have ale.
  4. Also comment out line from 235 to 237 https://github.com/YeWR/EfficientZero/blob/main/core/utils.py#L235 in core/utils.py
  5. Also, modify my bash file like,
    Screen Shot 2022-11-07 at 1 11 20 PM

After running the programing with default parameter setting like atari, the tensorboard like:
Screen Shot 2022-11-07 at 1 16 14 PM
Screen Shot 2022-11-07 at 1 16 31 PM
Screen Shot 2022-11-07 at 1 16 44 PM
Do you have any suggestions about how to make a correct modification and make the program produce reasonable result on babyai 'PutNextLocal'?

Thank you so much and looking forward to hear from you.

Training is really slow

First of all, congratulations on the great work!

I've been trying to train an agent to play breakout and the training is really slow. This is really confusing to me since, according to the paper, it should take 7 hours to do a full training of 100k steps. My experience has been different:

Running time

  • 4k steps each 8 hours

Hardware:

  • 4 GPU (QUADRORTX6000)
  • 80 CPUs (4 GB ram per CPU)

Running command

python main.py --env atari 
                           --case BreakoutNoFrameskipv4 
                           --opr train 
                           --amp_type torch_amp 
                           --num_gpus 4 
                           --num_cpus 80 
                           --cpu_actor 5 
                           --gpu_actor 13 
                           --seed 2917 
                           --force 
                           --use_priority 
                           --use_max_priority 
                           --debug 
                           --p_mcts_num 1

Do you have any idea or advice so that we can optimize the runtime?

@YeWR

"bash make.sh" failed

Hi Weirui,
I tried to build the depandency but failed. Is there a requirement on GCC version? The log is as belows. I also tried to modify ">>" to "> >", but the "nullptr" problem was still out of there. Did you have any suggestions? Thank you!

Best,
Tao

running build_ext
building 'cytree' extension
gcc -pthread -B /home/v-ty/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I. -I/home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include -I/home/v-ty/anaconda3/include/python3.8 -c cytree.cpp -o build/temp.linux-x86_64-3.8/cytree.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from cytree.cpp:653:0:
cnode.cpp:31:9: warning: identifier ‘nullptr’ is a keyword in C++11 [-Wc++0x-compat]
this->ptr_node_pool = nullptr;
^
In file included from /home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include/numpy/ndarraytypes.h:1822:0,
from /home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
from /home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from cytree.cpp:659:
/home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it with "
^
In file included from cnode.cpp:2:0,
from cytree.cpp:653:
cnode.h:47:42: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> node_pools;
^
cnode.h:53:94: error: ‘>>’ should be ‘> >’ within a nested template argument list
void prepare(float root_exploration_fraction, const std::vector<std::vector> &noises, const std::vector &value_prefixs, const std::vector<std::vector> &policies);
^
cnode.h:53:182: error: ‘>>’ should be ‘> >’ within a nested template argument list
void prepare(float root_exploration_fraction, const std::vector<std::vector> &noises, const std::vector &value_prefixs, const std::vector<std::vector> &policies);
^
cnode.h:54:111: error: ‘>>’ should be ‘> >’ within a nested template argument list
void prepare_no_noise(const std::vector &value_prefixs, const std::vector<std::vector> &policies);
^
cnode.h:56:40: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> get_trajectories();
^
cnode.h:57:40: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> get_distributions();
^
cnode.h:67:43: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector<CNode*>> search_paths;
^
cnode.h:79:184: error: ‘>>’ should be ‘> >’ within a nested template argument list
void cbatch_back_propagate(int hidden_state_index_x, float discount, const std::vector &value_prefixs, const std::vector &values, const std::vector<std::vector> &policies, tools::CMinMaxStatsList min_max_s
^
In file included from cytree.cpp:653:0:
cnode.cpp: In constructor ‘tree::CNode::CNode()’:
cnode.cpp:31:31: error: ‘nullptr’ was not declared in this scope
this->ptr_node_pool = nullptr;
^
In file included from cytree.cpp:653:0:
cnode.cpp: At global scope:
cnode.cpp:204:94: error: ‘>>’ should be ‘> >’ within a nested template argument list
void CRoots::prepare(float root_exploration_fraction, const std::vector<std::vector> &noises, const std::vector &value_prefixs, const std::vector<std::vector> &policies){
^
cnode.cpp:204:182: error: ‘>>’ should be ‘> >’ within a nested template argument list
void CRoots::prepare(float root_exploration_fraction, const std::vector<std::vector> &noises, const std::vector &value_prefixs, const std::vector<std::vector> &policies){
^
cnode.cpp:213:111: error: ‘>>’ should be ‘> >’ within a nested template argument list
void CRoots::prepare_no_noise(const std::vector &value_prefixs, const std::vector<std::vector> &policies){
^
cnode.cpp:226:32: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> CRoots::get_trajectories(){
^
cnode.cpp: In member function ‘std::vector<std::vector > tree::CRoots::get_trajectories()’:
cnode.cpp:227:36: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> trajs;
^
cnode.cpp: At global scope:
cnode.cpp:236:32: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> CRoots::get_distributions(){
^
cnode.cpp: In member function ‘std::vector<std::vector > tree::CRoots::get_distributions()’:
cnode.cpp:237:36: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> distributions;
^
cnode.cpp: At global scope:
cnode.cpp:317:184: error: ‘>>’ should be ‘> >’ within a nested template argument list
void cbatch_back_propagate(int hidden_state_index_x, float discount, const std::vector &value_prefixs, const std::vector &values, const std::vector<std::vector> &policies, tools::CMinMaxStatsList min_max_s
^
cytree.cpp:3382:12: warning: ‘int pyx_pw_6cytree_4Node_1__cinit(PyObject
, PyObject
, PyObject*)’ defined but not used [-Wunused-function]
static int pyx_pw_6cytree_4Node_1__cinit(PyObject *__pyx_v_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
^
error: command 'gcc' failed with exit status 1

Using custom gym environment

I wonder if there is any tutorial how to add own custom gym environment to use EffficientZero algorithm ?
Where the model is saved after training?
How to use saved model?
Is possible to use own custom environment with 3 dimensional np array as observation (state) ?
Peter

License?

If I may ask, would it be possible to add a license?

Thanks!

Question: Why not reanalyze 100% policy targets?

Hi there,

First of all, great work and thank you for opensourcing your code!

I have a question regarding reanalyze: you chose to reanalyze 99% of policy targets and 100% of value targets. I am just curious about the reason behind this choice. Did you try reanalyzing 100% of the policy targets? Did it hurt the performance?

Thank you!

Slight discrepancy with implementation of value scaling

Hey, firstly just wanted to say thank you because this is an amazing repo for understanding how MuZero/EfficientZero work in detail!

I've been trying to dig into exactly how the value prediction is done as it seems like a pretty significant detail that is hidden away in an appendix and I think there seems to be a slight discrepancy (that probably doesn't make much difference but is maybe still worth highlighting).

In the original paper (https://arxiv.org/pdf/1805.11593.pdf) they define the scaling function as:

with the inverse function given by proposition A.2 (iii).

but in the MuZero appendix they have:

(with the final term inside the bracket).

Unless I'm mistaken, in the code you've used the MuZero version of h(x), but for the inverse formula you've used the formula given in proposition A.2 (iii) of the first paper - which won't quite be correct anymore, right?

Just to show the discrepancy - if I look at the following code:

import torch

def scalar_transform(x, epsilon=0.001):
    sign = torch.ones(x.shape).float().to(x.device)
    sign[x < 0] = -1.0
    output = sign * (torch.sqrt(torch.abs(x) + 1) - 1 + epsilon * x)
    return output

def inverse_scalar_transform(value, epsilon=0.001):
    sign = torch.ones(value.shape).float().to(value.device)
    sign[value < 0] = -1.0
    output = (((torch.sqrt(1 + 4 * epsilon * (torch.abs(value) + 1 + epsilon)) - 1) / (2 * epsilon)) ** 2 - 1)
    output = sign * output
    return output


a = torch.randn(1000)
b = scalar_transform(a)
c = inverse_scalar_transform(b)

print(torch.sum(torch.abs(a-c)))

which is how the functions are implemented in this code base I get a value of ~2.4 printed, whilst if I change the scalar transform to be the same as in the first paper I get a value of ~0.04.

All memory seems on the first GPU

Hi, I found something wired when training EfficientZero. I trained the agent on a P40 sever which has 4 24G GPUs and 28 CPUs. But all the computed memory was on the first GPU even I have set CUDA_VISIBLE_DEVICES=0,1,2,3. I tried to change @ray.remote(num_gpus), but the problem was still out of there. Did you have any suggestions? Thank you!
1fb124062c119a926ba497720b2019d
f47bfe7cd4a733f4923b2386729227d

How to use with SLURM

Any guidance for using with SLURM? Certain actors are failing

When I run

srun -p compsci-gpu --gres=gpu:4 --cpus-per-gpu=5 --mem=24G --pty bash

Followed by:

python main.py --env BreakoutNoFrameskip-v4 --case atari --opr train --amp_type torch_amp --num_gpus 1 --num_cpus 10 --cpu_actor 1 --gpu_actor 1 --force

I get the following warning:

WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 135095644160 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.

Followed by the task failing:

2022-12-22 10:38:02,577 WARNING worker.py:1072 -- The node with node id 67f743d808b7bd16d45063d18dadf1b5cbb39e7d has been marked dead because the detector has missed too many heartbeats from it.

E1222 10:38:02.612172 8087 8433 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=core.reanalyze_worker, class_name=BatchWorker_CPU, function_name=run, function_hash=}, task_id=d251967856448ceb88866c7d01000000, task_name=BatchWorker_CPU.run(), job_id=01000000, num_args=0, num_returns=2, actor_task_spec={actor_id=88866c7d01000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=0}

I am not sure how to parse the error, any advice? What #SBATCH headings do you recommend using in the providedtrain.sh? Thank you!

How to evaluate the model

Thanks for your great work!

When I run your code, I find scores from the test bash is always a little higher than scores from the evaluation stage in the training bash (In train, the model is tested every 1w steps).

There are some results I got from the scripts. Left is from train bash and right is from test bash.

CrazyClimber 7246 9603
BankHeist 419 454

I have glanced two bash scripts and codes. In my understanding, two bash scripts evaluate agents in the completely same way where agents are evaluated with 32 seeds and get the mean of 32 scores.

So I have two questions,

  1. Why the test bash is always a little higher than scores from the evaluation stage in the training bash ?

  2. Which scripts you used to get the results in the paper?

Looking forward for your reply.

Clarification on the atari environment?

Hi all,

Thanks for releasing the code.
Could you provide some additional information on the exact setting you are using with respect to the atari environment? From the code, it seems that you are using the NoFrameSkip-v4 version of the gym env, which, as far as I can tell, implies:

  • You are taking an action every frame, whereas standard evaluation protocol uses a frameskip of 4, meaning taking an action only every fourth frame
  • Your environment is fully deterministic, in particular there is no sticky action (repeat_action_probability=0). As far as I can tell, some of the methods that you are comparing to, such as SGI, do use sticky actions.

Could you please clarify?

Thanks in advance.

Question about whether need to train multiple agents for different games

Thanks for you open-sourced code very much.
Recently, I want to apply the model used for breakout to other games, but I find that different games have different action Spaces, which will lead to errors in the process of test, the parameter dimension of breakout is inconsistent with that of other games, I would like to ask whether each game needs to train an agent separately,I really hope to get your answer,tank you

reproduce results for other environment

Hi, Nice work! Many thanks for the open-source code.

If I want to reproduce results for other environments other than BreakoutNoFrameskip-v4, what env name (especially the version name like -v4) I should pass in?

Thanks!

The first selfplay worker uses the same seed for all parallel environments

I might have found an unexpected behavior in how parallel training environments are being seeded.

I am referring to this line:

envs = [self.config.new_game(self.config.seed + self.rank * i) for i in range(env_nums)]

Because the rank of the first selfplay worker is 0, parallel environments are being initialized with the same seed, which might reduce training data diversity.

We could go for a simple fix like replacing self.rank by (self.rank + 1), however this is still problematic if considering multiple workers, as there will be seed overlap between them anyway.

A good option might be to sample a seed for each parallel environment using numpy (which is seeded before launching data workers). For instance:

envs = [self.config.new_game(np.random.randint(10**9)) for i in range(env_nums)]

ray warning

(scheduler +2m13s) Warning: The following resource request cannot be scheduled right now: {'GPU': 0.125, 'CPU': 0.5}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
Do not know this problem you have encountered, do not know how to solve, run for a period of time will automatically stop,Thank you so much

Question about the effect of torch_amp

Hi,

First of all, thank you for opensourcing your nice code!

I have a question regarding the effect of torch_amp: I test the training process of EfficientZero when using and not using torch_amp in env PongNoFrameskip-v4 on k8s machine. We keep all the other setting same to compare fairly. I found that using torch.amp is a little slower than not using torch.amp. It's counterintuitive.
image
image
where the blue line is the result not using torch_amp, and the orange line is the result using torch_amp.

Could you provide some your experimental results and insights about whether to use torch_amp or not ?

Thanks a lot!

Question about the dynamics network

Hi,

I was just wondering if you could explain/give some motivation for why the dynamics network works as it does.

I'm looking at a simple ATARI example and when I'm inside:
def dynamics(self, encoded_state, reward_hidden, action):

the encoded state is [2, 64, 6, 6] (batch size of 2 - just as a test), and the actions is [2, 1] (integers between 1 and 4).

You then define "actions_one_hot" as torch.ones(2, 1, 6, 6) and say:
actions_one_hot = actions[:, :, None, None] * actions_one_hot / self.action_space_size
which gives actions_one_hot as [2, 1, 6, 6], with the values copied along the final two dimensions (so each action value is copied 36 times here). Then you concatenate with the encoded state along dim=1 to give a final state which is [2, 65, 6, 6].

Is this a standard thing to do/something that's been done elsewhere? It just feels a bit weird to me. Firstly, the actions are not "one hot encoded" here, so maybe the variable names aren't perfect (but that doesn't really matter I guess). I suppose it makes sense in that you probably want to be able to apply convolutions to the joint state/action within the dynamics network. And I guess with n_actions=4 this is fine, but it feels like this approach would probably break with a larger discrete action space, right?

Anyway if you have the time I'd be interested to hear your motivation/reasoning behind this, thanks!

Question about the effect of BatchNorm?

I found some BatchNorm(BN) ops are used in config/atari/model.py, but BN is generally not used for RL. So I do some ablations with and without BN. The following results show that the performances are unstable and sometimes even bad without BN.
Could you give some reasons or insights about these results?

image

Question about the index of pad_child_visits_lst in selfplay_worker.py

Thanks for you open-sourced code very much.

I am very confused about this code segment in put_last_trajectory method in selfplay_worker.py:

In Line 69 , why is,
pad_child_visits_lst = game_histories[i].child_visits[beg_index:end_index] rather than
pad_child_visits_lst = game_histories[i].child_visits[:self.config.num_unroll_steps],

in my understanding, the game_histories[i].child_visits[0] is the child_visits of stacked obs game_histories[i].obs_history[beg_index],

is this a bug?

Looking forward to your reply!

EfficientZero doesn't seem to be training

Hi, first of all congratulations on the great work!

I haven't managed to train an agent yet using the EfficientZero framework. The command I'm using to train is the following:

python3 main.py  --env BreakoutNoFrameskip-v4 
                 --case atari 
                 --opr train 
                 --amp_type torch_amp 
                 --num_gpus 4 
                 --num_cpus 32 
                 --cpu_actor 12 
                 --gpu_actor 28 
                 --force 
                 --use_priority 
                 --use_max_priority 
                 --debug

In a cluster with the following architecture:

  • 32 CPUs, each with 8 GB ram.
  • 4 16GB teslaV100 gpus:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   32C    P0    52W / 300W |  12836MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   31C    P0    51W / 300W |  11373MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   31C    P0    54W / 300W |  10004MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P0    55W / 300W |   8529MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

The problem I'm facing is that even after a while of training there's only the following log:

(pid=52926) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=52926) [Powered by Stella]
(pid=52926) Start evaluation at step 0.
(pid=52926) Step 0, test scores: 
(pid=52926) [5. 0. 5. 2. 0. 2. 0. 9. 0. 0. 0. 2. 2. 4. 0. 2. 0. 0. 0. 0. 2. 2. 0. 5.
(pid=52926)  0. 0. 0. 5. 2. 0. 2. 5.]

Also the results folder of the experiment is mostly empty, I only have a train.log with the initial parameters.

I'm not sure if this is just a matter of waiting for a long time or If something in the inner workings is stuck (it looks like the batch_storage from the main train loop is always empty since we haven't entered into the train phase yet.

Something I think is really weird is that time passes but the GPU Memory-Usage stays exactly the same which makes me think something is off.

Would appreciate any advice in order to make this work.
Thanks in advance!

100% Offline RL use-case

Hey,

is there a way to use your implementation with a fixed MDP dataset instead of an environment for 100% offline RL?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.