yewr / efficientzero Goto Github PK
View Code? Open in Web Editor NEWOpen-source codebase for EfficientZero, from "Mastering Atari Games with Limited Data" at NeurIPS 2021.
License: GNU General Public License v3.0
Open-source codebase for EfficientZero, from "Mastering Atari Games with Limited Data" at NeurIPS 2021.
License: GNU General Public License v3.0
First of all, congratulations on the great work!
I've been trying to train an agent to play breakout and the training is really slow. This is really confusing to me since, according to the paper, it should take 7 hours to do a full training of 100k steps. My experience has been different:
Running time
Hardware:
Running command
python main.py --env atari
--case BreakoutNoFrameskipv4
--opr train
--amp_type torch_amp
--num_gpus 4
--num_cpus 80
--cpu_actor 5
--gpu_actor 13
--seed 2917
--force
--use_priority
--use_max_priority
--debug
--p_mcts_num 1
Do you have any idea or advice so that we can optimize the runtime?
Thanks for you open-sourced code very much.
I'm a little confused about the reason for the identity connection of state encoding in DynamicsNetwork in model.py:
Why do we add this state encoding identity connection, rather than using action encoding, and what is its empirical impact on atari results?
Looking forward to your reply!
Hi, I found something wired when training EfficientZero. I trained the agent on a P40 sever which has 4 24G GPUs and 28 CPUs. But all the computed memory was on the first GPU even I have set CUDA_VISIBLE_DEVICES=0,1,2,3. I tried to change @ray.remote(num_gpus), but the problem was still out of there. Did you have any suggestions? Thank you!
Any guidance for using with SLURM? Certain actors are failing
When I run
srun -p compsci-gpu --gres=gpu:4 --cpus-per-gpu=5 --mem=24G --pty bash
Followed by:
python main.py --env BreakoutNoFrameskip-v4 --case atari --opr train --amp_type torch_amp --num_gpus 1 --num_cpus 10 --cpu_actor 1 --gpu_actor 1 --force
I get the following warning:
WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 135095644160 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Followed by the task failing:
2022-12-22 10:38:02,577 WARNING worker.py:1072 -- The node with node id 67f743d808b7bd16d45063d18dadf1b5cbb39e7d has been marked dead because the detector has missed too many heartbeats from it.
E1222 10:38:02.612172 8087 8433 task_manager.cc:323] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=core.reanalyze_worker, class_name=BatchWorker_CPU, function_name=run, function_hash=}, task_id=d251967856448ceb88866c7d01000000, task_name=BatchWorker_CPU.run(), job_id=01000000, num_args=0, num_returns=2, actor_task_spec={actor_id=88866c7d01000000, actor_caller_id=ffffffffffffffffffffffff01000000, actor_counter=0}
I am not sure how to parse the error, any advice? What #SBATCH headings do you recommend using in the providedtrain.sh
? Thank you!
Thanks for you open-sourced code very much.
Recently, I want to apply the model used for breakout to other games, but I find that different games have different action Spaces, which will lead to errors in the process of test, the parameter dimension of breakout is inconsistent with that of other games, I would like to ask whether each game needs to train an agent separately,I really hope to get your answer,tank you
I always can't run the test phase completely, most of the training stops running when the test phase reaches about 3%-7%, and there is no error reported. Could you please tell me why this is, and what should I do
I found some BatchNorm(BN) ops are used in config/atari/model.py, but BN is generally not used for RL. So I do some ablations with and without BN. The following results show that the performances are unstable and sometimes even bad without BN.
Could you give some reasons or insights about these results?
I might have found an unexpected behavior in how parallel training environments are being seeded.
I am referring to this line:
EfficientZero/core/selfplay_worker.py
Line 112 in c533ebf
Because the rank of the first selfplay worker is 0, parallel environments are being initialized with the same seed, which might reduce training data diversity.
We could go for a simple fix like replacing self.rank
by (self.rank + 1)
, however this is still problematic if considering multiple workers, as there will be seed overlap between them anyway.
A good option might be to sample a seed for each parallel environment using numpy (which is seeded before launching data workers). For instance:
envs = [self.config.new_game(np.random.randint(10**9)) for i in range(env_nums)]
I've learned your paper EfficientZero V2, really good job extending to continous control. Would you or your team consider open source? I would be very grateful!
Hello,
it would be nice to remove baselines dependency (as it requires tensorflow, whereas the rest of the codebase is written with pytorch).
As apparently it is used for atari wrappers only, there are few options:
Hi, Nice work! Many thanks for the open-source code.
If I want to reproduce results for other environments other than BreakoutNoFrameskip-v4, what env name (especially the version name like -v4) I should pass in?
Thanks!
Only one Data worker is working since config.num_actors=1. Inside one Data worker, there are several working envs. But they works in in serial rather than in parallel, as follows.
for i in range(env_nums): obs, ori_reward, done, info = env.step(action)
.
Is it possible to speed up data collection by setting config.num_actors > 1?
(scheduler +2m13s) Warning: The following resource request cannot be scheduled right now: {'GPU': 0.125, 'CPU': 0.5}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
Do not know this problem you have encountered, do not know how to solve, run for a period of time will automatically stop,Thank you so much
Thanks for sharing this codebase, I was wondering if you were planning to support PyTorch Lightning in the future ?
Hello,
Thanks for this great work! I noticed that you choose to clip the reward to [-1, 1] for Atari. I'm wondering what's the purpose of applying value transformation (i.e. scalar_transform
) if you already have the reward clipped?
Hi,
First of all, thank you for opensourcing your nice code!
I have a question regarding the effect of torch_amp: I test the training process of EfficientZero when using and not using torch_amp in env PongNoFrameskip-v4 on k8s machine. We keep all the other setting same to compare fairly. I found that using torch.amp is a little slower than not using torch.amp. It's counterintuitive.
where the blue line is the result not using torch_amp, and the orange line is the result using torch_amp.
Could you provide some your experimental results and insights about whether to use torch_amp or not ?
Thanks a lot!
Hi there,
First of all, great work and thank you for opensourcing your code!
I have a question regarding reanalyze: you chose to reanalyze 99% of policy targets and 100% of value targets. I am just curious about the reason behind this choice. Did you try reanalyzing 100% of the policy targets? Did it hurt the performance?
Thank you!
If I may ask, would it be possible to add a license?
Thanks!
Hi, In mcts.py code line 35-36, what does reward_hidden_c and reward_hidden_h mean? ( what is c and h short for?) why reward_hidden_c_pool = [reward_hidden_roots[0]] and reward_hidden_h_pool = [reward_hidden_roots[1]]. I find it difficult to understand the code, could you give some comments. Many thanks!
Hi, I had trouble identifying the right mix of python and packages to get this to run.
Could you please review/confirm the python version and requirements.txt for either one of these?
Is there a docker container for EfficientZero?
Many thanks in advance!
I tried to run the code for Atari Freeway using the following command with the default settings in the code:
python main.py --env FreewayNoFrameskip-v4 \
--case atari \
--opr train \
--amp_type torch_amp \
--num_gpus 1 \
--num_cpus 10 \
--cpu_actor 2 \
--gpu_actor 2 \
--force \
--object_store_memory 21474836480 \
--seed=0
I tried two seeds 0 and 1. Based on tensorboard curves, the algorithm seems to receive no reward at all for training. Both workers.ori_reward and Train_statistics.target_value_prefix_mean are constant zero from beginning to the end.
From train_test_log, seed 0 got positive reward (~7.5) at step 0, but then no reward at all after that. Seed 1 also got ~7.5 reward at step 0, while got 0 for the remaining half of the evaluations. The other half got 21.34.
I wonder whether I did something wrong.
Thanks
Wei
Thanks for your open-sourced code very much.
This is a common definition of an target value in classical RL:
I'm a little confused about the way of calculating target value here in reanalyze_worker.py:
Why we do not multiply the bootstrap value (here is value_lst
) by the discount_factor^td_steps
, and why we do not mask the bootsrap value when the target obs is a done state.
Looking forward to your reply!
Thanks for your great work!
When I run your code, I find scores from the test bash is always a little higher than scores from the evaluation stage in the training bash (In train, the model is tested every 1w steps).
There are some results I got from the scripts. Left is from train bash and right is from test bash.
CrazyClimber 7246 9603
BankHeist 419 454
I have glanced two bash scripts and codes. In my understanding, two bash scripts evaluate agents in the completely same way where agents are evaluated with 32 seeds and get the mean of 32 scores.
So I have two questions,
Why the test bash is always a little higher than scores from the evaluation stage in the training bash ?
Which scripts you used to get the results in the paper?
Looking forward for your reply.
Hi, thanks for the repository. Could you consider releasing the code that supports continuous actions spaces for the DMControl 100k benchmark please ? The code that uses the discretization of each dimension. This would be very helpful.
Best
I wonder if there is any tutorial how to add own custom gym environment to use EffficientZero algorithm ?
Where the model is saved after training?
How to use saved model?
Is possible to use own custom environment with 3 dimensional np array as observation (state) ?
Peter
Hey, firstly just wanted to say thank you because this is an amazing repo for understanding how MuZero/EfficientZero work in detail!
I've been trying to dig into exactly how the value prediction is done as it seems like a pretty significant detail that is hidden away in an appendix and I think there seems to be a slight discrepancy (that probably doesn't make much difference but is maybe still worth highlighting).
In the original paper (https://arxiv.org/pdf/1805.11593.pdf) they define the scaling function as:
with the inverse function given by proposition A.2 (iii).
but in the MuZero appendix they have:
(with the final term inside the bracket).
Unless I'm mistaken, in the code you've used the MuZero version of h(x), but for the inverse formula you've used the formula given in proposition A.2 (iii) of the first paper - which won't quite be correct anymore, right?
Just to show the discrepancy - if I look at the following code:
import torch
def scalar_transform(x, epsilon=0.001):
sign = torch.ones(x.shape).float().to(x.device)
sign[x < 0] = -1.0
output = sign * (torch.sqrt(torch.abs(x) + 1) - 1 + epsilon * x)
return output
def inverse_scalar_transform(value, epsilon=0.001):
sign = torch.ones(value.shape).float().to(value.device)
sign[value < 0] = -1.0
output = (((torch.sqrt(1 + 4 * epsilon * (torch.abs(value) + 1 + epsilon)) - 1) / (2 * epsilon)) ** 2 - 1)
output = sign * output
return output
a = torch.randn(1000)
b = scalar_transform(a)
c = inverse_scalar_transform(b)
print(torch.sum(torch.abs(a-c)))
which is how the functions are implemented in this code base I get a value of ~2.4 printed, whilst if I change the scalar transform to be the same as in the first paper I get a value of ~0.04.
Why prepare policy targets and value targets separately? The two functions both run initial inference and a MCTS search to get either the root distributions or the root values. Why not get them using a single initial inference and MCTS search to save computation?
Thanks for you open-sourced code very much.
I am very confused about this code segment in put_last_trajectory method in selfplay_worker.py:
In Line 69 , why is,
pad_child_visits_lst = game_histories[i].child_visits[beg_index:end_index]
rather than
pad_child_visits_lst = game_histories[i].child_visits[:self.config.num_unroll_steps]
,
in my understanding, the game_histories[i].child_visits[0] is the child_visits of stacked obs game_histories[i].obs_history[beg_index]
,
is this a bug?
Looking forward to your reply!
Thanks for your open-sourced code very much.
I'm a little confused about the reason for the identity connection of state encoding in DynamicsNetwork in model.py:
Why do we add this state encoding identity connection, rather than using action encoding, and what is its empirical impact on atari results?
Looking forward to your reply!
Hi all,
Thanks for releasing the code.
Could you provide some additional information on the exact setting you are using with respect to the atari environment? From the code, it seems that you are using the NoFrameSkip-v4 version of the gym env, which, as far as I can tell, implies:
Could you please clarify?
Thanks in advance.
Thank you authors for the awesome paper!
I have an issue reproducing the results of Breakout. Instead of 414 claimed in the paper, I get 362.43 (average mean performance over 3 seeds). This is likely same issue as 21.
Hello, first of all thanks for your amazing job on EfficientZero.
I tried to adapt EfficientZero on BabyAI environment like: "PutNextLocal", but it just keep give me 0 test score during the 100k step training process.
I made several modifications in order to adapt to BabyAI "PutNextLocal" env:
config/babyai
, and implement BabyAIConfig(BaseConfig)
. I leave every parameters as default just like Atari, and only change line 101 from (image_channel,96,96) to (image_channel,7,7) in file config/babyai/__init__.py
.core/utils.py
After running the programing with default parameter setting like atari, the tensorboard like:
Do you have any suggestions about how to make a correct modification and make the program produce reasonable result on babyai 'PutNextLocal'?
Thank you so much and looking forward to hear from you.
Hi,
I was just wondering if you could explain/give some motivation for why the dynamics network works as it does.
I'm looking at a simple ATARI example and when I'm inside:
def dynamics(self, encoded_state, reward_hidden, action):
the encoded state is [2, 64, 6, 6] (batch size of 2 - just as a test), and the actions is [2, 1] (integers between 1 and 4).
You then define "actions_one_hot" as torch.ones(2, 1, 6, 6)
and say:
actions_one_hot = actions[:, :, None, None] * actions_one_hot / self.action_space_size
which gives actions_one_hot as [2, 1, 6, 6], with the values copied along the final two dimensions (so each action value is copied 36 times here). Then you concatenate with the encoded state along dim=1 to give a final state which is [2, 65, 6, 6].
Is this a standard thing to do/something that's been done elsewhere? It just feels a bit weird to me. Firstly, the actions are not "one hot encoded" here, so maybe the variable names aren't perfect (but that doesn't really matter I guess). I suppose it makes sense in that you probably want to be able to apply convolutions to the joint state/action within the dynamics network. And I guess with n_actions=4 this is fine, but it feels like this approach would probably break with a larger discrete action space, right?
Anyway if you have the time I'd be interested to hear your motivation/reasoning behind this, thanks!
Hey,
is there a way to use your implementation with a fixed MDP dataset instead of an environment for 100% offline RL?
I am currently experimenting on scaling EfficientZero to learning setups with high-data regimes.
As a first step, I am running experiments on Atari, with a replay buffer of 1M environment steps.
While doing this I observed that RAM consumption keeps increasing long after the replay buffer reached its maximum size.
Here are tensorboard plots on Breakout, for a 600k training steps run (20M environment steps / 80M environment frames):
I perform experiments on cluster computers featuring 4 tesla V100 gpus / 40 cpus and 187GB of RAM.
As you can see, although the maximum replay buffer size ("total_node_num") is reached after 30k training steps, RAM (in %) keeps increasing until around 250k steps, from 80% to 85%.
Ideally, I would also like to increase the batch size. But it seems like the problem gets worse in that setting:
The orange curves are from the same Breakout experiments, but with a batch size of 512 (instead of 256), and a smaller replay buffer size (0.1M). Here the maximum replay buffer size is obtained at 4k training steps but memory keeps increasing until 100K+ steps.
I understand that a bigger batch means more RAM because more data is being processed when updating/doing MCTS, but it does not explain why it keeps increasing after the replay buffer fills up
Any ideas on what causes this high ram consumption, and how we could mitigate that ?
Here are the parameters used for the first experiment I described (pink curves):
Param: {'action_space_size': 4, 'num_actors': 2, 'do_consistency': True, 'use_value_prefix': True, 'off_correction': True, 'gray_scale': False, 'auto_td_steps
_ratio': 0.3, 'episode_life': True, 'change_temperature': True, 'init_zero': True, 'state_norm': False, 'clip_reward': True, 'random_start': True, 'cvt_string': True, 'image_based': True, 'max_moves': 27000, 'test_max_m
oves': 3000, 'history_length': 400, 'num_simulations': 50, 'discount': 0.988053892081, 'max_grad_norm': 5, 'test_interval': 10000, 'test_episodes': 32, 'value_delta_max': 0.01, 'root_dirichlet_alpha': 0.3, 'root_explora
tion_fraction': 0.25, 'pb_c_base': 19652, 'pb_c_init': 1.25, 'training_steps': 900000, 'last_steps': 20000, 'checkpoint_interval': 100, 'target_model_interval': 200, 'save_ckpt_interval': 100000, 'log_interval': 1000, '
vis_interval': 1000, 'start_transitions': 2000, 'total_transitions': 30000000, 'transition_num': 1.0, 'batch_size': 256, 'num_unroll_steps': 5, 'td_steps': 5, 'frame_skip': 4, 'stacked_observations': 4, 'lstm_hidden_siz
e': 512, 'lstm_horizon_len': 5, 'reward_loss_coeff': 1, 'value_loss_coeff': 0.25, 'policy_loss_coeff': 1, 'consistency_coeff': 2, 'device': 'cuda', 'debug': False, 'seed': 0, 'value_support': <core.config.DiscreteSuppor
t object at 0x152644d101d0>, 'reward_support': <core.config.DiscreteSupport object at 0x152644d10210>, 'use_adam': False, 'weight_decay': 0.0001, 'momentum': 0.9, 'lr_warm_up': 0.01, 'lr_warm_step': 1000, 'lr_init': 0.2
, 'lr_decay_rate': 0.1, 'lr_decay_steps': 900000, 'mini_infer_size': 64, 'priority_prob_alpha': 0.6, 'priority_prob_beta': 0.4, 'prioritized_replay_eps': 1e-06, 'image_channel': 3, 'proj_hid': 1024, 'proj_out': 1024, 'p
red_hid': 512, 'pred_out': 1024, 'bn_mt': 0.1, 'blocks': 1, 'channels': 64, 'reduced_channels_reward': 16, 'reduced_channels_value': 16, 'reduced_channels_policy': 16, 'resnet_fc_reward_layers': [32], 'resnet_fc_value_l
ayers': [32], 'resnet_fc_policy_layers': [32], 'downsample': True, 'env_name': 'BreakoutNoFrameskip-v4', 'obs_shape': (12, 96, 96), 'case': 'atari', 'amp_type': 'torch_amp', 'use_priority': True, 'use_max_priority': Tru
e, 'cpu_actor': 14, 'gpu_actor': 20, 'p_mcts_num': 128, 'use_root_value': False, 'auto_td_steps': 270000.0, 'use_augmentation': True, 'augmentation': ['shift', 'intensity'], 'revisit_policy_search_rate': 0.99}
Hi Weirui,
I tried to build the depandency but failed. Is there a requirement on GCC version? The log is as belows. I also tried to modify ">>" to "> >", but the "nullptr" problem was still out of there. Did you have any suggestions? Thank you!
Best,
Tao
running build_ext
building 'cytree' extension
gcc -pthread -B /home/v-ty/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I. -I/home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include -I/home/v-ty/anaconda3/include/python3.8 -c cytree.cpp -o build/temp.linux-x86_64-3.8/cytree.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from cytree.cpp:653:0:
cnode.cpp:31:9: warning: identifier ‘nullptr’ is a keyword in C++11 [-Wc++0x-compat]
this->ptr_node_pool = nullptr;
^
In file included from /home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include/numpy/ndarraytypes.h:1822:0,
from /home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,
from /home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from cytree.cpp:659:
/home/v-ty/anaconda3/lib/python3.8/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it with "
^
In file included from cnode.cpp:2:0,
from cytree.cpp:653:
cnode.h:47:42: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> node_pools;
^
cnode.h:53:94: error: ‘>>’ should be ‘> >’ within a nested template argument list
void prepare(float root_exploration_fraction, const std::vector<std::vector> &noises, const std::vector &value_prefixs, const std::vector<std::vector> &policies);
^
cnode.h:53:182: error: ‘>>’ should be ‘> >’ within a nested template argument list
void prepare(float root_exploration_fraction, const std::vector<std::vector> &noises, const std::vector &value_prefixs, const std::vector<std::vector> &policies);
^
cnode.h:54:111: error: ‘>>’ should be ‘> >’ within a nested template argument list
void prepare_no_noise(const std::vector &value_prefixs, const std::vector<std::vector> &policies);
^
cnode.h:56:40: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> get_trajectories();
^
cnode.h:57:40: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> get_distributions();
^
cnode.h:67:43: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector<CNode*>> search_paths;
^
cnode.h:79:184: error: ‘>>’ should be ‘> >’ within a nested template argument list
void cbatch_back_propagate(int hidden_state_index_x, float discount, const std::vector &value_prefixs, const std::vector &values, const std::vector<std::vector> &policies, tools::CMinMaxStatsList min_max_s
^
In file included from cytree.cpp:653:0:
cnode.cpp: In constructor ‘tree::CNode::CNode()’:
cnode.cpp:31:31: error: ‘nullptr’ was not declared in this scope
this->ptr_node_pool = nullptr;
^
In file included from cytree.cpp:653:0:
cnode.cpp: At global scope:
cnode.cpp:204:94: error: ‘>>’ should be ‘> >’ within a nested template argument list
void CRoots::prepare(float root_exploration_fraction, const std::vector<std::vector> &noises, const std::vector &value_prefixs, const std::vector<std::vector> &policies){
^
cnode.cpp:204:182: error: ‘>>’ should be ‘> >’ within a nested template argument list
void CRoots::prepare(float root_exploration_fraction, const std::vector<std::vector> &noises, const std::vector &value_prefixs, const std::vector<std::vector> &policies){
^
cnode.cpp:213:111: error: ‘>>’ should be ‘> >’ within a nested template argument list
void CRoots::prepare_no_noise(const std::vector &value_prefixs, const std::vector<std::vector> &policies){
^
cnode.cpp:226:32: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> CRoots::get_trajectories(){
^
cnode.cpp: In member function ‘std::vector<std::vector > tree::CRoots::get_trajectories()’:
cnode.cpp:227:36: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> trajs;
^
cnode.cpp: At global scope:
cnode.cpp:236:32: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> CRoots::get_distributions(){
^
cnode.cpp: In member function ‘std::vector<std::vector > tree::CRoots::get_distributions()’:
cnode.cpp:237:36: error: ‘>>’ should be ‘> >’ within a nested template argument list
std::vector<std::vector> distributions;
^
cnode.cpp: At global scope:
cnode.cpp:317:184: error: ‘>>’ should be ‘> >’ within a nested template argument list
void cbatch_back_propagate(int hidden_state_index_x, float discount, const std::vector &value_prefixs, const std::vector &values, const std::vector<std::vector> &policies, tools::CMinMaxStatsList min_max_s
^
cytree.cpp:3382:12: warning: ‘int pyx_pw_6cytree_4Node_1__cinit(PyObject, PyObject, PyObject*)’ defined but not used [-Wunused-function]
static int pyx_pw_6cytree_4Node_1__cinit(PyObject *__pyx_v_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
^
error: command 'gcc' failed with exit status 1
Can this be tried out on procgen environment?
The paper show three runnings of CrazyClimber as below. It seems stable and of high performance.
However, when I rerun the code three times using the given command. I got the following results. The runnings broke off before training finished, but it still revealed that the reproduced result is far lower by an order of magnitude than what the paper reported.
Could you give some possible explanations why there is a huge difference?
# running 1
[2022-04-06 02:47:47,242][train_test][INFO][log.py>_log] ==> #0 Test Mean Score of CrazyClimberNoFrameskip-v4: 643.75 (max: 1000.0 , min:300.0 , std: 178.42628029525247)
[2022-04-06 03:54:07,033][train_test][INFO][log.py>_log] ==> #10083 Test Mean Score of CrazyClimberNoFrameskip-v4: 3071.875 (max: 3400.0 , min:2100.0 , std: 383.4338070319309)
[2022-04-06 04:55:49,719][train_test][INFO][log.py>_log] ==> #22478 Test Mean Score of CrazyClimberNoFrameskip-v4: 634.375 (max: 1300.0 , min:300.0 , std: 249.51124097923926)
[2022-04-06 05:57:18,634][train_test][INFO][log.py>_log] ==> #34576 Test Mean Score of CrazyClimberNoFrameskip-v4: 3378.125 (max: 5700.0 , min:2700.0 , std: 736.4332857598168)
[2022-04-06 06:53:54,973][train_test][INFO][log.py>_log] ==> #46427 Test Mean Score of CrazyClimberNoFrameskip-v4: 3065.625 (max: 6800.0 , min:2300.0 , std: 761.0064778797878)
[2022-04-06 07:55:10,556][train_test][INFO][log.py>_log] ==> #57917 Test Mean Score of CrazyClimberNoFrameskip-v4: 8534.375 (max: 11700.0 , min:4000.0 , std: 2101.5781830269843)
[2022-04-06 08:51:19,116][train_test][INFO][log.py>_log] ==> #69158 Test Mean Score of CrazyClimberNoFrameskip-v4: 6453.125 (max: 12100.0 , min:3100.0 , std: 2302.849372923683)
[2022-04-06 09:47:16,916][train_test][INFO][log.py>_log] ==> #80299 Test Mean Score of CrazyClimberNoFrameskip-v4: 7246.875 (max: 10300.0 , min:4900.0 , std: 1379.5344266726365)
# running 2
[2022-04-05 11:39:55,952][train_test][INFO][log.py>_log] ==> #0 Test Mean Score of CrazyClimberNoFrameskip-v4: 600.0 (max: 900.0 , min:300.0 , std: 152.0690632574555)
[2022-04-05 12:46:50,652][train_test][INFO][log.py>_log] ==> #10015 Test Mean Score of CrazyClimberNoFrameskip-v4: 393.75 (max: 800.0 , min:100.0 , std: 132.13984069916233)
[2022-04-05 13:41:18,799][train_test][INFO][log.py>_log] ==> #20080 Test Mean Score of CrazyClimberNoFrameskip-v4: 3634.375 (max: 5100.0 , min:1900.0 , std: 705.6067313844164)
[2022-04-05 14:31:57,618][train_test][INFO][log.py>_log] ==> #30013 Test Mean Score of CrazyClimberNoFrameskip-v4: 9634.375 (max: 13100.0 , min:3500.0 , std: 2230.1358387719347)
[2022-04-05 15:35:49,472][train_test][INFO][log.py>_log] ==> #40039 Test Mean Score of CrazyClimberNoFrameskip-v4: 6434.375 (max: 10000.0 , min:3500.0 , std: 1384.875755934445)
# running 3
[2022-04-05 04:43:53,277][train_test][INFO][log.py>_log] ==> #0 Test Mean Score of CrazyClimberNoFrameskip-v4: 615.625 (max: 1000.0 , min:400.0 , std: 146.00807982779583)
[2022-04-05 05:52:29,495][train_test][INFO][log.py>_log] ==> #10033 Test Mean Score of CrazyClimberNoFrameskip-v4: 950.0 (max: 1500.0 , min:700.0 , std: 264.5751311064591)
[2022-04-05 06:55:41,171][train_test][INFO][log.py>_log] ==> #20077 Test Mean Score of CrazyClimberNoFrameskip-v4: 2771.875 (max: 3800.0 , min:1000.0 , std: 850.086162912325)
[2022-04-05 07:58:09,925][train_test][INFO][log.py>_log] ==> #30064 Test Mean Score of CrazyClimberNoFrameskip-v4: 4550.0 (max: 7300.0 , min:2300.0 , std: 1228.3118496538248)
[2022-04-05 08:59:21,098][train_test][INFO][log.py>_log] ==> #40001 Test Mean Score of CrazyClimberNoFrameskip-v4: 9137.5 (max: 12500.0 , min:5000.0 , std: 1880.2842205368847)
[2022-04-05 10:00:18,121][train_test][INFO][log.py>_log] ==> #50019 Test Mean Score of CrazyClimberNoFrameskip-v4: 10393.75 (max: 24000.0 , min:5700.0 , std: 3386.1794012574114)
Got an error with the example train.sh showing 'gym.error.UnregisteredEnv: No registered env with id: BreakoutNoFrameskip-v4'
Hello! Thank you for this open source implementation and your great research!
state_norm on this line:
https://github.com/YeWR/EfficientZero/blob/main/core/config.py#L57
Do you have any insights into whether normalizing the hidden state helps training or not, or is there no noticeable difference?
Hi,
I was a little confused about how to get true reward from value prefix in core/ctree/cnode.cpp
For the function update_tree_q() in Line 256, the true reward is calculated by
float true_reward = node->value_prefix - parent_value_prefix
Suppose we have a root node_1, with its two child (node_2 and node_3),
Before the while loop, we push node_1 into the node_stack;
For the first time of the while loop, we pop node_1, and push node_2, node_3 into the node_stack, finally we set parent_value_prefix = node_1.value_prefix;
For the second time of the while loop, we pop node_3, (suppose there is no child of node_3 expanded), and we set parent_value_prefix=node3.value_prefix (Line281);
In the third time of the while loop, we pop node_2, when we calc the true reward of node_2 in Line 266,
true_reward = node_2.value_prefix - parent_value_prefix = node_2.value_prefix - node_3.value_prefix,
However, the parent of node_2 is node_1, so the true_reward should be node_2.value_prefix - node_1.value_prefix
So I wonder if there is some problem for the operation for the variable "parent_value_prefix", or I misunderstood the code.
Alhough, in function update_tree_q, we only update the min_max value, so it may not affect the convergence. I wonder if it will convergence faster if there the operation is fixed.
Hi, first of all congratulations on the great work!
I haven't managed to train an agent yet using the EfficientZero framework. The command I'm using to train is the following:
python3 main.py --env BreakoutNoFrameskip-v4
--case atari
--opr train
--amp_type torch_amp
--num_gpus 4
--num_cpus 32
--cpu_actor 12
--gpu_actor 28
--force
--use_priority
--use_max_priority
--debug
In a cluster with the following architecture:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 32C P0 52W / 300W | 12836MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 31C P0 51W / 300W | 11373MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 31C P0 54W / 300W | 10004MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 33C P0 55W / 300W | 8529MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
The problem I'm facing is that even after a while of training there's only the following log:
(pid=52926) A.L.E: Arcade Learning Environment (version +978d2ce)
(pid=52926) [Powered by Stella]
(pid=52926) Start evaluation at step 0.
(pid=52926) Step 0, test scores:
(pid=52926) [5. 0. 5. 2. 0. 2. 0. 9. 0. 0. 0. 2. 2. 4. 0. 2. 0. 0. 0. 0. 2. 2. 0. 5.
(pid=52926) 0. 0. 0. 5. 2. 0. 2. 5.]
Also the results folder of the experiment is mostly empty, I only have a train.log with the initial parameters.
I'm not sure if this is just a matter of waiting for a long time or If something in the inner workings is stuck (it looks like the batch_storage
from the main train loop is always empty since we haven't entered into the train phase yet.
Something I think is really weird is that time passes but the GPU Memory-Usage stays exactly the same which makes me think something is off.
Would appreciate any advice in order to make this work.
Thanks in advance!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.