stanfordvl / alignment Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 5.0 907 KB

ELIGN: Expectation Alignment as a Multi-agent Intrinsic Reward

License: MIT License

Python 91.48% Batchfile 0.16% Shell 0.38% Jupyter Notebook 7.97%

alignment's People

Contributors

Stargazers

Watchers

Forkers

timefly-1989 junjunjunhj nailiang97 violetxi sriyash421

alignment's Issues

Continue on the previous question.

    Hi thanks for the questions! It's reasonable that the end steps for all episodes is 25 (I believe the max number of steps is set to 25 by default, and it can remain 25 even if you enable early stopping when the goal is achieved). As for the difference between `test_reward` and `test_bench/step_reward`, it's due to two major differences. First, the reward and benchmark loggers log things a little bit differently: (as far as I remember from my notes) the reward logger resets at the end of each episode whereas the benchmark logger resets only once at the collector's init(), so the trends can be different. Second, the `test_bench/step_reward` additionally divides the episode reward by the number of steps in each episode (i.e. avg reward per step). Please check the code for the reward and benchmark logger as well as `offpolicy_trainer` for your own understanding, and feel free to write your own logger for your purposes! Lmk if you have any other questions, thanks!

Originally posted by @zixianma in #3 (comment)

Thanks for your reply to the previous question. I follow your reminder and check the code for SimpleSpreadBenchmarkLogger and find the code that might be the key to the difference between these two metrics (i.e., test_reward and test_bench/step_reward). Here is the code:

alignment/map/tianshou/env/utils.py

Line 232 in 58754e4

bench_data = elem['n'][0]

Here you only add the info of the first agent (i.e., elem['n'][0]). However, for the default setting, there are 5 agents and the length of elem['n'] is 5. Moreover, each element in elem['n'] has different info for different agents, so the rewards can be different. This phenomenon does not occur in the computation of test_reward, so their trends are different. Could you help me check out if my understanding is correct? Thanks!

Reproducing the results in the paper

I recently read the paper preprint on Arxiv, and it's nice to have the official implementation here, thank you! I ran into some problems getting it up and running, for example when reproducing the ELIGN-adv result on Predator-Prey (2v2), the code gave an error in sacd_multi_wm.py, line 307, where nonzero_obs_count contains zeroes and yields NaN. The exact command I used is python train_multi_sacd.py --task simple_tag_in --num-good-agents 2 --num-adversaries 2 --obs-radius 0.5 --intr-rew elign_adv --epoch 100 --save-models --benchmark --logdir log/simple_tag_2_2_elign_adv_32procs --wandb-enabled --training-num 32 --test-num 32.
I'm also a little bit confused about the training time, which is 100+ epochs * 800K episodes per epoch * 25 timesteps per episode = 2*10^9+ timesteps in the Arxiv preprint; this seems somewhat large, could the authors confirm this? It would be great to have the scripts reproducing the main results, in case I made some mistakes regarding the command line arguments to use. Thank you!

Questions about the rewards in logger

Hi, I ran your codes with three different configs and obtained three variants. I observe the tensorboard, finding that the trend of 'test_reward' is different from 'test_bench/step_reward'. Could you explain that? BTW, the end steps for each episode is 25, is it normal? Here is my tensorboard.

Confused about the number of episodes and steps per episode while training

Hello, I am so interested in your work about intrinsic reward. However, it makes me confused that the number of training episodes and steps per episode disagree with the code. Your paper said "800K episodes of 25 timesteps" for training and "1K test episodes" for evaluating. But the code set the "--epoch 10" and "--step-per-epoch 1000" by default. It bothers me again that the code setting trains even better. I don't know if something went wrong, just want to know the right episode and step per episode number.

Thanks.

What does the "the mean test episode extrinsic rewards" means?

Assuming that five seed experiments are run now, is the result reported in the paper the highest of each experiment and then averaged?

stanfordvl / alignment Goto Github PK

alignment's People

Contributors

Stargazers

Watchers

Forkers

alignment's Issues

Continue on the previous question.

Reproducing the results in the paper

Questions about the rewards in logger

Confused about the number of episodes and steps per episode while training

What does the "the mean test episode extrinsic rewards" means?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent