yewr / efficientzero Goto Github PK

View Code? Open in Web Editor NEW

852.0 47.0 133.0 2.16 MB

Open-source codebase for EfficientZero, from "Mastering Atari Games with Limited Data" at NeurIPS 2021.

License: GNU General Public License v3.0

Python 86.46% C++ 10.02% Cython 3.18% Shell 0.34%

efficientzero's Introduction

EfficientZero (NeurIPS 2021)

Open-source codebase for EfficientZero, from "Mastering Atari Games with Limited Data" at NeurIPS 2021.

Environments

EfficientZero requires python3 (>=3.6) and pytorch (>=1.8.0) with the development headers.

We recommend to use torch amp (--amp_type torch_amp) to accelerate training.

Prerequisites

Before starting training, you need to build the c++/cython style external packages. (GCC version 7.5+ is required.)

cd core/ctree
bash make.sh

The distributed framework of this codebase is built on ray.

Installation

As for other packages required for this codebase, please run pip install -r requirements.txt.

Usage

Quick start

Train: python main.py --env BreakoutNoFrameskip-v4 --case atari --opr train --amp_type torch_amp --num_gpus 1 --num_cpus 10 --cpu_actor 1 --gpu_actor 1 --force
Test: python main.py --env BreakoutNoFrameskip-v4 --case atari --opr test --amp_type torch_amp --num_gpus 1 --load_model --model_path model.p \

Bash file

We provide train.sh and test.sh for training and evaluation.

Train:
- With 4 GPUs (3090): bash train.sh
Test: bash test.sh

Required Arguments	Description
`--env`	Name of the environment
`--case {atari}`	It's used for switching between different domains(default: atari)
`--opr {train,test}`	select the operation to be performed
`--amp_type {torch_amp,none}`	use torch amp for acceleration

Other Arguments	Description
`--force`	will rewrite the result directory
`--num_gpus 4`	how many GPUs are available
`--num_cpus 96`	how many CPUs are available
`--cpu_actor 14`	how many cpu workers
`--gpu_actor 20`	how many gpu workers
`--seed 0`	the seed
`--use_priority`	use priority in replay buffer sampling
`--use_max_priority`	use the max priority for the newly collectted data
`--amp_type 'torch_amp'`	use torch amp for acceleration
`--info 'EZ-V0'`	some tags for you experiments
`--p_mcts_num 8`	set the parallel number of envs in self-play
`--revisit_policy_search_rate 0.99`	set the rate of reanalyzing policies
`--use_root_value`	use root values in value targets (require more GPU actors)
`--render`	render in evaluation
`--save_video`	save videos for evaluation

Architecture Designs

The architecture of the training pipeline is shown as follows:

Some suggestions

To use a smaller model, you can choose smaller dim of the projection layers (Eg: 256/64) and the LSTM hidden layer (Eg: 64) in the config.
For GPUs with 10G memory instead of 20G memory, you can allocate 0.25 gpu for each GPU maker (@ray.remote(num_gpus=0.25)) in core/reanalyze_worker.py.

New environment registration

If you wan to apply EfficientZero to a new environment like mujoco. Here are the steps for registration:

Follow the directory config/atari and create dir for the env at config/mujoco.
Implement your MujocoConfig(BaseConfig) class and implement the models as well as your environment wrapper.
Register the case at main.py.

Results

Evaluation with 32 seeds for 3 different runs (different seeds).

Citation

If you find this repo useful, please cite our paper:

@inproceedings{ye2021mastering,
  title={Mastering Atari Games with Limited Data},
  author={Weirui Ye, and Shaohuai Liu, and Thanard Kurutach, and Pieter Abbeel, and Yang Gao},
  booktitle={NeurIPS},
  year={2021}
}

Contact

If you have any question or want to use the code, please contact [email protected] .

Acknowledgement

We appreciate the following github repos a lot for their valuable code base implementations:

https://github.com/koulanurag/muzero-pytorch

https://github.com/werner-duvaud/muzero-general

https://github.com/pytorch/ELF

efficientzero's People

Contributors

Stargazers

Watchers

Forkers

memazouni enzo-gorlomi cgliu jfrancis71 staminatang brando90 cbmm matthias-samwald ipsec thelastrefugee micbosi jl1990 bdotgradb theword geekyutao breakds renos ankeshanand christianlabansky frederikschubert mbalesni celsopitta sergioarnaud zhangzongliang zactodd tomfrederik 0xjchen trigrass2 ksc999 lars12llt ramiribat woshialex waynemunro galmacky albertcity gary109 iconstantine vectorsl rationalism ubastic axelmarmet 3dalgolab 1a3orn jepetolee zyvoi viruliant zetimente awlego jiawei415 xaswq penchekrak mirzaabdulwahab1612 isa233 alexanderdurr tandychao ltzheng zhonglj2012 15tpopat jamesliu zivzone thierryonre rportelas chanpuku 1jsingh dirichi mtaktikos rainwangphy ejmejm mrsipan steventrouble shamirapz pduckworth gergopool yang-xy20 hoagyc c0d313arn zhoubin-me tirsgaard codeaudit atomosphere0 mbrukman zlapp eles13 nebraskinator aalonso99 vladisai hiuhiuwong digits122 yanivo1123 kastnerkyle gg-big-org yshrk homer6 zjjsl1 davidva1 mjg1000 emrul sapio-s bradlick aayushmaanjain

efficientzero's Issues

Question about the dynamics network

Hi,

I was just wondering if you could explain/give some motivation for why the dynamics network works as it does.

I'm looking at a simple ATARI example and when I'm inside:
def dynamics(self, encoded_state, reward_hidden, action):

the encoded state is [2, 64, 6, 6] (batch size of 2 - just as a test), and the actions is [2, 1] (integers between 1 and 4).

You then define "actions_one_hot" as torch.ones(2, 1, 6, 6) and say:
actions_one_hot = actions[:, :, None, None] * actions_one_hot / self.action_space_size
which gives actions_one_hot as [2, 1, 6, 6], with the values copied along the final two dimensions (so each action value is copied 36 times here). Then you concatenate with the encoded state along dim=1 to give a final state which is [2, 65, 6, 6].

Is this a standard thing to do/something that's been done elsewhere? It just feels a bit weird to me. Firstly, the actions are not "one hot encoded" here, so maybe the variable names aren't perfect (but that doesn't really matter I guess). I suppose it makes sense in that you probably want to be able to apply convolutions to the joint state/action within the dynamics network. And I guess with n_actions=4 this is fine, but it feels like this approach would probably break with a larger discrete action space, right?

Anyway if you have the time I'd be interested to hear your motivation/reasoning behind this, thanks!

Clarification on the atari environment?

Hi all,

Thanks for releasing the code.
Could you provide some additional information on the exact setting you are using with respect to the atari environment? From the code, it seems that you are using the NoFrameSkip-v4 version of the gym env, which, as far as I can tell, implies:

You are taking an action every frame, whereas standard evaluation protocol uses a frameskip of 4, meaning taking an action only every fourth frame
Your environment is fully deterministic, in particular there is no sticky action (repeat_action_probability=0). As far as I can tell, some of the methods that you are comparing to, such as SGI, do use sticky actions.

Could you please clarify?

Thanks in advance.

Question about the effect of torch_amp

Hi,

First of all, thank you for opensourcing your nice code!

I have a question regarding the effect of torch_amp: I test the training process of EfficientZero when using and not using torch_amp in env PongNoFrameskip-v4 on k8s machine. We keep all the other setting same to compare fairly. I found that using torch.amp is a little slower than not using torch.amp. It's counterintuitive.

where the blue line is the result not using torch_amp, and the orange line is the result using torch_amp.

Could you provide some your experimental results and insights about whether to use torch_amp or not ?

Thanks a lot!

"bash make.sh" failed

Hi Weirui,
I tried to build the depandency but failed. Is there a requirement on GCC version? The log is as belows. I also tried to modify ">>" to "> >", but the "nullptr" problem was still out of there. Did you have any suggestions? Thank you!

Best,
Tao

running build_ext
building 'cytree' extension
gcc -pthread -B /home/v-ty/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I. -I/home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include -I/home/v-ty/anaconda3/include/python3.8 -c cytree.cpp -o build/temp.linux-x86_64-3.8/cytree.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from cytree.cpp:653:0:
cnode.cpp:31:9: warning: identifier ‘nullptr’ is a keyword in C++11 [-Wc++0x-compat]
this->ptr_node_pool = nullptr;
^
In file included from /home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include/numpy/ndarraytypes.h:1822:0,
from /home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
from /home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from cytree.cpp:659:
/home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it with "
^
In file included from cnode.cpp:2:0,
from cytree.cpp:653:
cnode.h:47:42: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> node_pools;
^
cnode.h:53:94: error: ‘>>’ should be ‘> >’ within a nested template argument list
void prepare(float root_exploration_fraction, const std::vector<std::vector> &noises, const std::vector &value_prefixs, const std::vector<std::vector> &policies);
^
cnode.h:53:182: error: ‘>>’ should be ‘> >’ within a nested template argument list
void prepare(float root_exploration_fraction, const std::vector<std::vector> &noises, const std::vector &value_prefixs, const std::vector<std::vector> &policies);
^
cnode.h:54:111: error: ‘>>’ should be ‘> >’ within a nested template argument list
void prepare_no_noise(const std::vector &value_prefixs, const std::vector<std::vector> &policies);
^
cnode.h:56:40: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> get_trajectories();
^
cnode.h:57:40: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> get_distributions();
^
cnode.h:67:43: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector<CNode*>> search_paths;
^
cnode.h:79:184: error: ‘>>’ should be ‘> >’ within a nested template argument list
void cbatch_back_propagate(int hidden_state_index_x, float discount, const std::vector &value_prefixs, const std::vector &values, const std::vector<std::vector> &policies, tools::CMinMaxStatsList min_max_s
^
In file included from cytree.cpp:653:0:
cnode.cpp: In constructor ‘tree::CNode::CNode()’:
cnode.cpp:31:31: error: ‘nullptr’ was not declared in this scope
this->ptr_node_pool = nullptr;
^
In file included from cytree.cpp:653:0:
cnode.cpp: At global scope:
cnode.cpp:204:94: error: ‘>>’ should be ‘> >’ within a nested template argument list
void CRoots::prepare(float root_exploration_fraction, const std::vector<std::vector> &noises, const std::vector &value_prefixs, const std::vector<std::vector> &policies){
^
cnode.cpp:204:182: error: ‘>>’ should be ‘> >’ within a nested template argument list
void CRoots::prepare(float root_exploration_fraction, const std::vector<std::vector> &noises, const std::vector &value_prefixs, const std::vector<std::vector> &policies){
^
cnode.cpp:213:111: error: ‘>>’ should be ‘> >’ within a nested template argument list
void CRoots::prepare_no_noise(const std::vector &value_prefixs, const std::vector<std::vector> &policies){
^
cnode.cpp:226:32: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> CRoots::get_trajectories(){
^
cnode.cpp: In member function ‘std::vector<std::vector > tree::CRoots::get_trajectories()’:
cnode.cpp:227:36: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> trajs;
^
cnode.cpp: At global scope:
cnode.cpp:236:32: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> CRoots::get_distributions(){
^
cnode.cpp: In member function ‘std::vector<std::vector > tree::CRoots::get_distributions()’:
cnode.cpp:237:36: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> distributions;
^
cnode.cpp: At global scope:
cnode.cpp:317:184: error: ‘>>’ should be ‘> >’ within a nested template argument list
void cbatch_back_propagate(int hidden_state_index_x, float discount, const std::vector &value_prefixs, const std::vector &values, const std::vector<std::vector> &policies, tools::CMinMaxStatsList min_max_s
^
cytree.cpp:3382:12: warning: ‘int pyx_pw_6cytree_4Node_1__cinit(PyObject, PyObject, PyObject*)’ defined but not used [-Wunused-function]
static int pyx_pw_6cytree_4Node_1__cinit(PyObject *__pyx_v_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
^
error: command 'gcc' failed with exit status 1

procgen

Can this be tried out on procgen environment?

WSL2 NVIDIA 3090 or M1 MBP correct environment

Hi, I had trouble identifying the right mix of python and packages to get this to run.

Could you please review/confirm the python version and requirements.txt for either one of these?

Dual RTX 3090 on WSL2
M1 MBP

Is there a docker container for EfficientZero?

Many thanks in advance!

Question about getting zero test score when I try to run EfficientZero on BabyAI grid environment

Hello, first of all thanks for your amazing job on EfficientZero.

I tried to adapt EfficientZero on BabyAI environment like: "PutNextLocal", but it just keep give me 0 test score during the 100k step training process.

I made several modifications in order to adapt to BabyAI "PutNextLocal" env:

I create dir for env at config/babyai, and implement BabyAIConfig(BaseConfig). I leave every parameters as default just like Atari, and only change line 101 from (image_channel,96,96) to (image_channel,7,7) in file config/babyai/__init__.py.
Change class name from AtariWrapper(Game) to BabyAIWrapper(Game), and leave everything else as default setting.
Comment out from line 103 to line 111 since grid game does not have ale.
Also comment out line from 235 to 237 https://github.com/YeWR/EfficientZero/blob/main/core/utils.py#L235 in core/utils.py
Also, modify my bash file like,

After running the programing with default parameter setting like atari, the tensorboard like:

Do you have any suggestions about how to make a correct modification and make the program produce reasonable result on babyai 'PutNextLocal'?

Thank you so much and looking forward to hear from you.

EfficientZero high memory consumption / keeps increasing after replay buffer is full

I am currently experimenting on scaling EfficientZero to learning setups with high-data regimes.

As a first step, I am running experiments on Atari, with a replay buffer of 1M environment steps.
While doing this I observed that RAM consumption keeps increasing long after the replay buffer reached its maximum size.

Here are tensorboard plots on Breakout, for a 600k training steps run (20M environment steps / 80M environment frames):

I perform experiments on cluster computers featuring 4 tesla V100 gpus / 40 cpus and 187GB of RAM.

As you can see, although the maximum replay buffer size ("total_node_num") is reached after 30k training steps, RAM (in %) keeps increasing until around 250k steps, from 80% to 85%.

Ideally, I would also like to increase the batch size. But it seems like the problem gets worse in that setting:

The orange curves are from the same Breakout experiments, but with a batch size of 512 (instead of 256), and a smaller replay buffer size (0.1M). Here the maximum replay buffer size is obtained at 4k training steps but memory keeps increasing until 100K+ steps.
I understand that a bigger batch means more RAM because more data is being processed when updating/doing MCTS, but it does not explain why it keeps increasing after the replay buffer fills up

Any ideas on what causes this high ram consumption, and how we could mitigate that ?

Run details

Here are the parameters used for the first experiment I described (pink curves):

Param: {'action_space_size': 4, 'num_actors': 2, 'do_consistency': True, 'use_value_prefix': True, 'off_correction': True, 'gray_scale': False, 'auto_td_steps
_ratio': 0.3, 'episode_life': True, 'change_temperature': True, 'init_zero': True, 'state_norm': False, 'clip_reward': True, 'random_start': True, 'cvt_string': True, 'image_based': True, 'max_moves': 27000, 'test_max_m
oves': 3000, 'history_length': 400, 'num_simulations': 50, 'discount': 0.988053892081, 'max_grad_norm': 5, 'test_interval': 10000, 'test_episodes': 32, 'value_delta_max': 0.01, 'root_dirichlet_alpha': 0.3, 'root_explora
tion_fraction': 0.25, 'pb_c_base': 19652, 'pb_c_init': 1.25, 'training_steps': 900000, 'last_steps': 20000, 'checkpoint_interval': 100, 'target_model_interval': 200, 'save_ckpt_interval': 100000, 'log_interval': 1000, '
vis_interval': 1000, 'start_transitions': 2000, 'total_transitions': 30000000, 'transition_num': 1.0, 'batch_size': 256, 'num_unroll_steps': 5, 'td_steps': 5, 'frame_skip': 4, 'stacked_observations': 4, 'lstm_hidden_siz
e': 512, 'lstm_horizon_len': 5, 'reward_loss_coeff': 1, 'value_loss_coeff': 0.25, 'policy_loss_coeff': 1, 'consistency_coeff': 2, 'device': 'cuda', 'debug': False, 'seed': 0, 'value_support': <core.config.DiscreteSuppor
t object at 0x152644d101d0>, 'reward_support': <core.config.DiscreteSupport object at 0x152644d10210>, 'use_adam': False, 'weight_decay': 0.0001, 'momentum': 0.9, 'lr_warm_up': 0.01, 'lr_warm_step': 1000, 'lr_init': 0.2
, 'lr_decay_rate': 0.1, 'lr_decay_steps': 900000, 'mini_infer_size': 64, 'priority_prob_alpha': 0.6, 'priority_prob_beta': 0.4, 'prioritized_replay_eps': 1e-06, 'image_channel': 3, 'proj_hid': 1024, 'proj_out': 1024, 'p
red_hid': 512, 'pred_out': 1024, 'bn_mt': 0.1, 'blocks': 1, 'channels': 64, 'reduced_channels_reward': 16, 'reduced_channels_value': 16, 'reduced_channels_policy': 16, 'resnet_fc_reward_layers': [32], 'resnet_fc_value_l
ayers': [32], 'resnet_fc_policy_layers': [32], 'downsample': True, 'env_name': 'BreakoutNoFrameskip-v4', 'obs_shape': (12, 96, 96), 'case': 'atari', 'amp_type': 'torch_amp', 'use_priority': True, 'use_max_priority': Tru
e, 'cpu_actor': 14, 'gpu_actor': 20, 'p_mcts_num': 128, 'use_root_value': False, 'auto_td_steps': 270000.0, 'use_augmentation': True, 'augmentation': ['shift', 'intensity'], 'revisit_policy_search_rate': 0.99}

License?

If I may ask, would it be possible to add a license?

Thanks!

Question about the effect of discount factor and done mask when calculating the target value?

Thanks for your open-sourced code very much.

This is a common definition of an target value in classical RL:

I'm a little confused about the way of calculating target value here in reanalyze_worker.py:

Why we do not multiply the bootstrap value (here is value_lst) by the discount_factor^td_steps, and why we do not mask the bootsrap value when the target obs is a done state.

Looking forward to your reply！

Slight discrepancy with implementation of value scaling

Hey, firstly just wanted to say thank you because this is an amazing repo for understanding how MuZero/EfficientZero work in detail!

I've been trying to dig into exactly how the value prediction is done as it seems like a pretty significant detail that is hidden away in an appendix and I think there seems to be a slight discrepancy (that probably doesn't make much difference but is maybe still worth highlighting).

In the original paper (https://arxiv.org/pdf/1805.11593.pdf) they define the scaling function as:
$h(x) = \text{sgn}(x) (\sqrt{|x| 1} - 1) + \epsilon x$

with the inverse function given by proposition A.2 (iii).

but in the MuZero appendix they have:
$h(x) = \text{sgn}(x) (\sqrt{|x| 1} - 1 + \epsilon x)$
(with the final term inside the bracket).

Unless I'm mistaken, in the code you've used the MuZero version of h(x), but for the inverse formula you've used the formula given in proposition A.2 (iii) of the first paper - which won't quite be correct anymore, right?

Just to show the discrepancy - if I look at the following code:

import torch

def scalar_transform(x, epsilon=0.001):
    sign = torch.ones(x.shape).float().to(x.device)
    sign[x < 0] = -1.0
    output = sign * (torch.sqrt(torch.abs(x) + 1) - 1 + epsilon * x)
    return output

def inverse_scalar_transform(value, epsilon=0.001):
    sign = torch.ones(value.shape).float().to(value.device)
    sign[value < 0] = -1.0
    output = (((torch.sqrt(1 + 4 * epsilon * (torch.abs(value) + 1 + epsilon)) - 1) / (2 * epsilon)) ** 2 - 1)
    output = sign * output
    return output


a = torch.randn(1000)
b = scalar_transform(a)
c = inverse_scalar_transform(b)

print(torch.sum(torch.abs(a-c)))

which is how the functions are implemented in this code base I get a value of ~2.4 printed, whilst if I change the scalar transform to be the same as in the first paper I get a value of ~0.04.

In reanalyze_worker GPU worker, why prepare policy targets and value targets separately?

Why prepare policy targets and value targets separately? The two functions both run initial inference and a MCTS search to get either the root distributions or the root values. Why not get them using a single initial inference and MCTS search to save computation?

Question about the effect of state encoding indentity connection in dynamics network

Thanks for your open-sourced code very much.

I'm a little confused about the reason for the identity connection of state encoding in DynamicsNetwork in model.py:

Why do we add this state encoding identity connection, rather than using action encoding, and what is its empirical impact on atari results?

Looking forward to your reply！

Training is really slow

First of all, congratulations on the great work!

I've been trying to train an agent to play breakout and the training is really slow. This is really confusing to me since, according to the paper, it should take 7 hours to do a full training of 100k steps. My experience has been different:

Running time

4k steps each 8 hours

Hardware:

4 GPU (QUADRORTX6000)
80 CPUs (4 GB ram per CPU)

Running command

python main.py --env atari 
                           --case BreakoutNoFrameskipv4 
                           --opr train 
                           --amp_type torch_amp 
                           --num_gpus 4 
                           --num_cpus 80 
                           --cpu_actor 5 
                           --gpu_actor 13 
                           --seed 2917 
                           --force 
                           --use_priority 
                           --use_max_priority 
                           --debug 
                           --p_mcts_num 1

Do you have any idea or advice so that we can optimize the runtime?

@YeWR

Envs seem not to work in parallel

Only one Data worker is working since config.num_actors=1. Inside one Data worker, there are several working envs. But they works in in serial rather than in parallel, as follows.
for i in range(env_nums): obs, ori_reward, done, info = env.step(action).
Is it possible to speed up data collection by setting config.num_actors > 1?

PyTorch Lightning Support

Thanks for sharing this codebase, I was wondering if you were planning to support PyTorch Lightning in the future ?

Question about the effect of BatchNorm?

I found some BatchNorm(BN) ops are used in config/atari/model.py, but BN is generally not used for RL. So I do some ablations with and without BN. The following results show that the performances are unstable and sometimes even bad without BN.
Could you give some reasons or insights about these results?

Question about the effect of state encoding indentity connection in dynamics network

Thanks for you open-sourced code very much.

I'm a little confused about the reason for the identity connection of state encoding in DynamicsNetwork in model.py:

Why do we add this state encoding identity connection, rather than using action encoding, and what is its empirical impact on atari results?

Looking forward to your reply！

BreakoutNoFrameskip-v4 is not registered?

Got an error with the example train.sh showing 'gym.error.UnregisteredEnv: No registered env with id: BreakoutNoFrameskip-v4'

ray warning

(scheduler +2m13s) Warning: The following resource request cannot be scheduled right now: {'GPU': 0.125, 'CPU': 0.5}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
Do not know this problem you have encountered, do not know how to solve, run for a period of time will automatically stop，Thank you so much

Zero score on Freeway

I tried to run the code for Atari Freeway using the following command with the default settings in the code:

python main.py --env FreewayNoFrameskip-v4 \
--case atari \
--opr train \
--amp_type torch_amp \
--num_gpus 1 \
--num_cpus 10 \
--cpu_actor 2 \
--gpu_actor 2 \
--force \
--object_store_memory 21474836480 \
--seed=0

I tried two seeds 0 and 1. Based on tensorboard curves, the algorithm seems to receive no reward at all for training. Both workers.ori_reward and Train_statistics.target_value_prefix_mean are constant zero from beginning to the end.

From train_test_log, seed 0 got positive reward (~7.5) at step 0, but then no reward at all after that. Seed 1 also got ~7.5 reward at step 0, while got 0 for the remaining half of the evaluations. The other half got 21.34.

I wonder whether I did something wrong.

Thanks

Wei

Question about the index of pad_child_visits_lst in selfplay_worker.py

Thanks for you open-sourced code very much.

I am very confused about this code segment in put_last_trajectory method in selfplay_worker.py:

In Line 69 , why is,
pad_child_visits_lst = game_histories[i].child_visits[beg_index:end_index] rather than
pad_child_visits_lst = game_histories[i].child_visits[:self.config.num_unroll_steps],

in my understanding, the game_histories[i].child_visits[0] is the child_visits of stacked obs game_histories[i].obs_history[beg_index],

is this a bug?

Looking forward to your reply！

Question about whether need to train multiple agents for different games

Thanks for you open-sourced code very much.
Recently, I want to apply the model used for breakout to other games, but I find that different games have different action Spaces, which will lead to errors in the process of test, the parameter dimension of breakout is inconsistent with that of other games, I would like to ask whether each game needs to train an agent separately,I really hope to get your answer,tank you

Using custom gym environment

I wonder if there is any tutorial how to add own custom gym environment to use EffficientZero algorithm ?
Where the model is saved after training?
How to use saved model?
Is possible to use own custom environment with 3 dimensional np array as observation (state) ?
Peter

Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).

Is it something wrong with this warning?

CUDA_VISIBLE_DEVICES="1,2" python main.py --env BreakoutNoFrameskip-v4 --case atari --opr train --amp_type torch_amp --num_gpus 1 --num_cpus 10 --cpu
_actor 1 --gpu_actor 1 --force
2024-07-22 16:47:01,232 INFO services.py:1164 -- View the Ray dashboard at http://127.0.0.1:8265
A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd)
[Powered by Stella]
[2024-07-22 16:47:02,593][train][INFO][main.py><module>] ==> Path: /mgData4/dengjianhao/EfficientZero/results/atari/none/BreakoutNoFrameskip-v4/seed=0/Mon Jul 22 16:47:02 2024
[2024-07-22 16:47:02,594][train][INFO][main.py><module>] ==> Param: {'action_space_size': 4, 'num_actors': 1, 'do_consistency': True, 'use_value_prefix': True, 'off_correction': True, 'gray_scale': False, 'auto_td_steps_ratio': 0.3, 'episode_life': True, 'change_temperature': True, 'init_zero': True, 'state_norm': False, 'clip_reward': True, 'random_start': True, 'cvt_string': True, 'image_based': True, 'max_moves': 3000, 'test_max_moves': 3000, 'history_length': 400, 'num_simulations': 50, 'discount': 0.988053892081, 'max_grad_norm': 5, 'test_interval': 10000, 'test_episodes': 32, 'value_delta_max': 0.01, 'root_dirichlet_alpha': 0.3, 'root_exploration_fraction': 0.25, 'pb_c_base': 19652, 'pb_c_init': 1.25, 'training_steps': 100000, 'last_steps': 20000, 'checkpoint_interval': 100, 'target_model_interval': 200, 'save_ckpt_interval': 10000, 'log_interval': 1000, 'vis_interval': 1000, 'start_transitions': 2000, 'total_transitions': 100000, 'transition_num': 1, 'batch_size': 256, 'num_unroll_steps': 5, 'td_steps': 5, 'frame_skip': 4, 'stacked_observations': 4, 'lstm_hidden_size': 512, 'lstm_horizon_len': 5, 'reward_loss_coeff': 1, 'value_loss_coeff': 0.25, 'policy_loss_coeff': 1, 'consistency_coeff': 2, 'device': 'cuda', 'debug': False, 'seed': 0, 'value_support': <core.config.DiscreteSupport object at 0x7fd58c7bf100>, 'reward_support': <core.config.DiscreteSupport object at 0x7fd58c7bf160>, 'weight_decay': 0.0001, 'momentum': 0.9, 'lr_warm_up': 0.01, 'lr_warm_step': 1000, 'lr_init': 0.2, 'lr_decay_rate': 0.1, 'lr_decay_steps': 100000, 'mini_infer_size': 64, 'priority_prob_alpha': 0, 'priority_prob_beta': 0.4, 'prioritized_replay_eps': 1e-06, 'image_channel': 3, 'proj_hid': 1024, 'proj_out': 1024, 'pred_hid': 512, 'pred_out': 1024, 'bn_mt': 0.1, 'blocks': 1, 'channels': 64, 'reduced_channels_reward': 16, 'reduced_channels_value': 16, 'reduced_channels_policy': 16, 'resnet_fc_reward_layers': [32], 'resnet_fc_value_layers': [32], 'resnet_fc_policy_layers': [32], 'downsample': True, 'env_name': 'BreakoutNoFrameskip-v4', 'obs_shape': (12, 96, 96), 'case': 'atari', 'amp_type': 'torch_amp', 'use_priority': False, 'use_max_priority': False, 'cpu_actor': 1, 'gpu_actor': 1, 'p_mcts_num': 4, 'use_root_value': False, 'auto_td_steps': 30000.0, 'use_augmentation': True, 'augmentation': ['shift', 'intensity'], 'revisit_policy_search_rate': 0.99, 'model_dir': '/mgData4/dengjianhao/EfficientZero/results/atari/none/BreakoutNoFrameskip-v4/seed=0/Mon Jul 22 16:47:02 2024/model'}
(pid=1482750) A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd)
(pid=1482750) [Powered by Stella]
(pid=1482751) A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd)
(pid=1482751) [Powered by Stella]
(pid=1482751) Start evaluation at step 0.
(pid=1482751) Training step 0, test scores: 
(pid=1482751) [9. 0. 5. 2. 0. 0. 0. 9. 0. 0. 0. 0. 3. 3. 0. 5. 0. 0. 0. 0. 2. 2. 3. 9.
(pid=1482751)  0. 0. 0. 5. 2. 0. 0. 5.] of 416 eval steps.
Begin training...
/mgData4/dengjianhao/EfficientZero/core/train.py:68: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
  obs_batch_ori = torch.from_numpy(obs_batch_ori).to(config.device).float() / 255.0
[2024-07-22 16:49:37,539][train][INFO][log.py>_log] ==> #0          Total Loss: 47.875   [weighted Loss:47.875   Policy Loss: 7.901    Value Loss: 37.092   Reward Sum Loss: 30.693   Consistency Loss: 0.004    ] Replay Episodes Collected: 55         Buffer Size: 55         Transition Number: 2.068   k Batch Size: 256        Lr: 0.000   
[2024-07-22 16:49:37,539][train_test][INFO][log.py>_log] ==> #0          Test Mean Score of BreakoutNoFrameskip-v4: 2.0        (max: 9.0       , min:0.0       , std: 2.839454172900137)
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
[2024-07-22 17:13:06,564][train][INFO][log.py>_log] ==> #1000       Total Loss: -0.068   [weighted Loss:-0.068   Policy Loss: 7.664    Value Loss: 0.189    Reward Sum Loss: 0.087    Consistency Loss: -3.933   ] Replay Episodes Collected: 55         Buffer Size: 55         Transition Number: 2.068   k Batch Size: 256        Lr: 0.200   
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).
Warning: Batch Queue is empty (Require more batch actors Or batch actor fails).

All memory seems on the first GPU

Hi, I found something wired when training EfficientZero. I trained the agent on a P40 sever which has 4 24G GPUs and 28 CPUs. But all the computed memory was on the first GPU even I have set CUDA_VISIBLE_DEVICES=0,1,2,3. I tried to change @ray.remote(num_gpus), but the problem was still out of there. Did you have any suggestions? Thank you!

Question about the test phase not always running fully

I always can't run the test phase completely, most of the training stops running when the test phase reaches about 3%-7%, and there is no error reported. Could you please tell me why this is, and what should I do

EfficientZero doesn't seem to be training

Hi, first of all congratulations on the great work!

I haven't managed to train an agent yet using the EfficientZero framework. The command I'm using to train is the following:

python3 main.py  --env BreakoutNoFrameskip-v4 
                 --case atari 
                 --opr train 
                 --amp_type torch_amp 
                 --num_gpus 4 
                 --num_cpus 32 
                 --cpu_actor 12 
                 --gpu_actor 28 
                 --force 
                 --use_priority 
                 --use_max_priority 
                 --debug

In a cluster with the following architecture:

32 CPUs, each with 8 GB ram.
4 16GB teslaV100 gpus:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   32C    P0    52W / 300W |  12836MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   31C    P0    51W / 300W |  11373MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   31C    P0    54W / 300W |  10004MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P0    55W / 300W |   8529MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

The problem I'm facing is that even after a while of training there's only the following log:

(pid=52926) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=52926) [Powered by Stella]
(pid=52926) Start evaluation at step 0.
(pid=52926) Step 0, test scores: 
(pid=52926) [5. 0. 5. 2. 0. 2. 0. 9. 0. 0. 0. 2. 2. 4. 0. 2. 0. 0. 0. 0. 2. 2. 0. 5.
(pid=52926)  0. 0. 0. 5. 2. 0. 2. 5.]

Also the results folder of the experiment is mostly empty, I only have a train.log with the initial parameters.

I'm not sure if this is just a matter of waiting for a long time or If something in the inner workings is stuck (it looks like the batch_storage from the main train loop is always empty since we haven't entered into the train phase yet.

Something I think is really weird is that time passes but the GPU Memory-Usage stays exactly the same which makes me think something is off.

Would appreciate any advice in order to make this work.
Thanks in advance!

Question about state_norm config option not mentioned in paper

Hello! Thank you for this open source implementation and your great research!

state_norm on this line:

https://github.com/YeWR/EfficientZero/blob/main/core/config.py#L57

Do you have any insights into whether normalizing the hidden state helps training or not, or is there no noticeable difference?

100% Offline RL use-case

Hey,

is there a way to use your implementation with a fixed MDP dataset instead of an environment for 100% offline RL?

How to evaluate the model

Thanks for your great work!

When I run your code, I find scores from the test bash is always a little higher than scores from the evaluation stage in the training bash (In train, the model is tested every 1w steps).

There are some results I got from the scripts. Left is from train bash and right is from test bash.

CrazyClimber 7246 9603
BankHeist 419 454

I have glanced two bash scripts and codes. In my understanding, two bash scripts evaluate agents in the completely same way where agents are evaluated with 32 seeds and get the mean of 32 scores.

So I have two questions,

Why the test bash is always a little higher than scores from the evaluation stage in the training bash ?
Which scripts you used to get the results in the paper?

Looking forward for your reply.

Question: Why not reanalyze 100% policy targets?

Hi there,

First of all, great work and thank you for opensourcing your code!

I have a question regarding reanalyze: you chose to reanalyze 99% of policy targets and 100% of value targets. I am just curious about the reason behind this choice. Did you try reanalyzing 100% of the policy targets? Did it hurt the performance?

Thank you!

Removing Baselines dependency

Hello,
it would be nice to remove baselines dependency (as it requires tensorflow, whereas the rest of the codebase is written with pytorch).
As apparently it is used for atari wrappers only, there are few options:

use the ones that are now in Gym (but I'm not sure they are exactly the same)
use the ones from Stable-Baselines3 (also depends on pytorch, so less dependencies)
copy them into the repo

EfficientZero V2

I've learned your paper EfficientZero V2, really good job extending to continous control. Would you or your team consider open source? I would be very grateful!

Code for continuous action space

Hi, thanks for the repository. Could you consider releasing the code that supports continuous actions spaces for the DMControl 100k benchmark please ? The code that uses the discretization of each dimension. This would be very helpful.
Best

Reward clipping and value transformation

Hello,

Thanks for this great work! I noticed that you choose to clip the reward to [-1, 1] for Atari. I'm wondering what's the purpose of applying value transformation (i.e. scalar_transform) if you already have the reward clipped?

How to use with SLURM

Any guidance for using with SLURM? Certain actors are failing

When I run

srun -p compsci-gpu --gres=gpu:4 --cpus-per-gpu=5 --mem=24G --pty bash

Followed by:

python main.py --env BreakoutNoFrameskip-v4 --case atari --opr train --amp_type torch_amp --num_gpus 1 --num_cpus 10 --cpu_actor 1 --gpu_actor 1 --force

I get the following warning:

WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 135095644160 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.

Followed by the task failing:

2022-12-22 10:38:02,577 WARNING worker.py:1072 -- The node with node id 67f743d808b7bd16d45063d18dadf1b5cbb39e7d has been marked dead because the detector has missed too many heartbeats from it.

E1222 10:38:02.612172 8087 8433 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=core.reanalyze_worker, class_name=BatchWorker_CPU, function_name=run, function_hash=}, task_id=d251967856448ceb88866c7d01000000, task_name=BatchWorker_CPU.run(), job_id=01000000, num_args=0, num_returns=2, actor_task_spec={actor_id=88866c7d01000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=0}

I am not sure how to parse the error, any advice? What #SBATCH headings do you recommend using in the providedtrain.sh? Thank you!

Cannot reproduce Breakout results

Thank you authors for the awesome paper!

I have an issue reproducing the results of Breakout. Instead of 414 claimed in the paper, I get 362.43 (average mean performance over 3 seeds). This is likely same issue as 21.

reproduce results for other environment

Hi, Nice work! Many thanks for the open-source code.

If I want to reproduce results for other environments other than BreakoutNoFrameskip-v4, what env name (especially the version name like -v4) I should pass in?

Thanks!

what does reward_hidden_c mean in mcts.py?

Hi, In mcts.py code line 35-36, what does reward_hidden_c and reward_hidden_h mean? ( what is c and h short for?) why reward_hidden_c_pool = [reward_hidden_roots[0]] and reward_hidden_h_pool = [reward_hidden_roots[1]]. I find it difficult to understand the code, could you give some comments. Many thanks!

reproduce the result of CrazyClimber

The paper show three runnings of CrazyClimber as below. It seems stable and of high performance.

However, when I rerun the code three times using the given command. I got the following results. The runnings broke off before training finished, but it still revealed that the reproduced result is far lower by an order of magnitude than what the paper reported.

Could you give some possible explanations why there is a huge difference?

# running 1
[2022-04-06 02:47:47,242][train_test][INFO][log.py>_log] ==> #0          Test Mean Score of CrazyClimberNoFrameskip-v4: 643.75     (max: 1000.0    , min:300.0     , std: 178.42628029525247)
[2022-04-06 03:54:07,033][train_test][INFO][log.py>_log] ==> #10083      Test Mean Score of CrazyClimberNoFrameskip-v4: 3071.875   (max: 3400.0    , min:2100.0    , std: 383.4338070319309)
[2022-04-06 04:55:49,719][train_test][INFO][log.py>_log] ==> #22478      Test Mean Score of CrazyClimberNoFrameskip-v4: 634.375    (max: 1300.0    , min:300.0     , std: 249.51124097923926)
[2022-04-06 05:57:18,634][train_test][INFO][log.py>_log] ==> #34576      Test Mean Score of CrazyClimberNoFrameskip-v4: 3378.125   (max: 5700.0    , min:2700.0    , std: 736.4332857598168)
[2022-04-06 06:53:54,973][train_test][INFO][log.py>_log] ==> #46427      Test Mean Score of CrazyClimberNoFrameskip-v4: 3065.625   (max: 6800.0    , min:2300.0    , std: 761.0064778797878)
[2022-04-06 07:55:10,556][train_test][INFO][log.py>_log] ==> #57917      Test Mean Score of CrazyClimberNoFrameskip-v4: 8534.375   (max: 11700.0   , min:4000.0    , std: 2101.5781830269843)
[2022-04-06 08:51:19,116][train_test][INFO][log.py>_log] ==> #69158      Test Mean Score of CrazyClimberNoFrameskip-v4: 6453.125   (max: 12100.0   , min:3100.0    , std: 2302.849372923683)
[2022-04-06 09:47:16,916][train_test][INFO][log.py>_log] ==> #80299      Test Mean Score of CrazyClimberNoFrameskip-v4: 7246.875   (max: 10300.0   , min:4900.0    , std: 1379.5344266726365)

# running 2
[2022-04-05 11:39:55,952][train_test][INFO][log.py>_log] ==> #0          Test Mean Score of CrazyClimberNoFrameskip-v4: 600.0      (max: 900.0     , min:300.0     , std: 152.0690632574555)
[2022-04-05 12:46:50,652][train_test][INFO][log.py>_log] ==> #10015      Test Mean Score of CrazyClimberNoFrameskip-v4: 393.75     (max: 800.0     , min:100.0     , std: 132.13984069916233)
[2022-04-05 13:41:18,799][train_test][INFO][log.py>_log] ==> #20080      Test Mean Score of CrazyClimberNoFrameskip-v4: 3634.375   (max: 5100.0    , min:1900.0    , std: 705.6067313844164)
[2022-04-05 14:31:57,618][train_test][INFO][log.py>_log] ==> #30013      Test Mean Score of CrazyClimberNoFrameskip-v4: 9634.375   (max: 13100.0   , min:3500.0    , std: 2230.1358387719347)
[2022-04-05 15:35:49,472][train_test][INFO][log.py>_log] ==> #40039      Test Mean Score of CrazyClimberNoFrameskip-v4: 6434.375   (max: 10000.0   , min:3500.0    , std: 1384.875755934445)

# running 3
[2022-04-05 04:43:53,277][train_test][INFO][log.py>_log] ==> #0          Test Mean Score of CrazyClimberNoFrameskip-v4: 615.625    (max: 1000.0    , min:400.0     , std: 146.00807982779583)
[2022-04-05 05:52:29,495][train_test][INFO][log.py>_log] ==> #10033      Test Mean Score of CrazyClimberNoFrameskip-v4: 950.0      (max: 1500.0    , min:700.0     , std: 264.5751311064591)
[2022-04-05 06:55:41,171][train_test][INFO][log.py>_log] ==> #20077      Test Mean Score of CrazyClimberNoFrameskip-v4: 2771.875   (max: 3800.0    , min:1000.0    , std: 850.086162912325)
[2022-04-05 07:58:09,925][train_test][INFO][log.py>_log] ==> #30064      Test Mean Score of CrazyClimberNoFrameskip-v4: 4550.0     (max: 7300.0    , min:2300.0    , std: 1228.3118496538248)
[2022-04-05 08:59:21,098][train_test][INFO][log.py>_log] ==> #40001      Test Mean Score of CrazyClimberNoFrameskip-v4: 9137.5     (max: 12500.0   , min:5000.0    , std: 1880.2842205368847)
[2022-04-05 10:00:18,121][train_test][INFO][log.py>_log] ==> #50019      Test Mean Score of CrazyClimberNoFrameskip-v4: 10393.75   (max: 24000.0   , min:5700.0    , std: 3386.1794012574114)

Where's the code?

Question about the transform between true reward and value prefix

Hi,
I was a little confused about how to get true reward from value prefix in core/ctree/cnode.cpp

For the function update_tree_q() in Line 256, the true reward is calculated by
float true_reward = node->value_prefix - parent_value_prefix

Suppose we have a root node_1, with its two child (node_2 and node_3),

Before the while loop, we push node_1 into the node_stack;
For the first time of the while loop, we pop node_1, and push node_2, node_3 into the node_stack, finally we set parent_value_prefix = node_1.value_prefix;

For the second time of the while loop, we pop node_3, (suppose there is no child of node_3 expanded), and we set parent_value_prefix=node3.value_prefix (Line281);

In the third time of the while loop, we pop node_2, when we calc the true reward of node_2 in Line 266,
true_reward = node_2.value_prefix - parent_value_prefix = node_2.value_prefix - node_3.value_prefix,

However, the parent of node_2 is node_1, so the true_reward should be node_2.value_prefix - node_1.value_prefix
So I wonder if there is some problem for the operation for the variable "parent_value_prefix", or I misunderstood the code.

Alhough, in function update_tree_q, we only update the min_max value, so it may not affect the convergence. I wonder if it will convergence faster if there the operation is fixed.

The first selfplay worker uses the same seed for all parallel environments

I might have found an unexpected behavior in how parallel training environments are being seeded.

I am referring to this line:

EfficientZero/core/selfplay_worker.py

Line 112 in c533ebf

 envs = [self.config.new_game(self.config.seed + self.rank * i) for i in range(env_nums)] 

Because the rank of the first selfplay worker is 0, parallel environments are being initialized with the same seed, which might reduce training data diversity.

We could go for a simple fix like replacing self.rank by (self.rank + 1), however this is still problematic if considering multiple workers, as there will be seed overlap between them anyway.

A good option might be to sample a seed for each parallel environment using numpy (which is seeded before launching data workers). For instance:

envs = [self.config.new_game(np.random.randint(10**9)) for i in range(env_nums)]

yewr / efficientzero Goto Github PK

efficientzero's Introduction

EfficientZero (NeurIPS 2021)

Environments

Prerequisites

Installation

Usage

Quick start

Bash file

Architecture Designs

Some suggestions

New environment registration

Results

Citation

Contact

Acknowledgement

efficientzero's People

Contributors

Stargazers

Watchers

Forkers

efficientzero's Issues

Run details

Recommend Projects

Recommend Topics

Recommend Org