I tried to run the code for Atari Freeway using the following command with the default

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Strengthening the relevance of <a class="user-mention notranslate" data-hovercard-typ

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Zero score on Freeway about efficientzero HOT 6 OPEN

yewr commented on August 22, 2024

Zero score on Freeway

from efficientzero.

Comments (6)

rPortelas commented on August 22, 2024 1

@emailweixu It is true that Freeway is challenging in terms of exploration, however in both the EfficientMuzero paper and the original Muzero paper (check Table S1 in appendix), non-zero performance improvements are reported. So we should be able to reproduce it.

from efficientzero.

emailweixu commented on August 22, 2024 1

@rPortelas I know both EfficientZero and MuZero reported reasonable performance on Freeway. The original MuZero is not opensourced so I cannot re-run the experiments and cannot know for sure. But since it trained on much more frames (20B frames), it is more likely to be able to obtain reward though random exploration. Furthermore, the original MuZero paper didn't describe how the weights of the models are initialized, it is possible that non-zero initialization of the last prediction layer can get some reward (non-zero initialization can make the initial policy not uniformly random). In fact, I did try non-zero initialization with EfficientZero (change init_zero to False from True), it did get some reward during the training, but the final performance is still much lower than the reported number. But zero initialization is explicitly described by EfficientZero in A.1.

from efficientzero.

rPortelas commented on August 22, 2024

Strengthening the relevance of @emailweixu reproducibility issue

Here are my performance results on Freeway, 4 seeds:

The 4 seeds obtained a score of 0 by the end of training, however 1 seed did manage to reacher 21.5 reward at some points during training.

I used the provided train.sh script (so 4gpus), with the following modifications to fit my setup: I used "--object_store_memory 100000000000" and "--num_cpus 80", which should not impact performance.

This issue is related to issue #21 , which points out another reproducibility issue. See issue #21 for potential reasons.

Best,
Rémy

from efficientzero.

emailweixu commented on August 22, 2024

@rPortelas Actually, I have reasons to believe that zero score for Freeway is expected. If you play Freeway yourself, you can see that it needs consistent exploration for one direction (UP) for many steps in order to get any reward. However, for the current implementation of EfficientZero, the behavior policy is a stochastic policy based on MCTS result. And at the beginning of training, the policy from MCTS is close to uniform given how EfficientZero is initialized (i.e. zero initialization for last layer of prediction nets), which makes it very hard to consistently go UP. Other algorithms such as CURL or SPR uses a greedy policy (coupled with noisy net) and are more likely to have consistent exploration behavior.

from efficientzero.

szrlee commented on August 22, 2024

Thanks for the discussion!
Any follow-up message so far?

from efficientzero.

emailweixu commented on August 22, 2024

@rPortelas did you try the "raw" version you mentioned in #21 on Freeway?

from efficientzero.

Zero score on Freeway about efficientzero HOT 6 OPEN

Comments (6)

Strengthening the relevance of @emailweixu reproducibility issue

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent