openai / mlsh Goto Github PK

View Code? Open in Web Editor NEW

602.0 46.0 163.0 413 KB

Code for the paper "Meta-Learning Shared Hierarchies"

Home Page: https://arxiv.org/abs/1710.09767

Python 99.59% Makefile 0.06% Shell 0.10% Dockerfile 0.25%

paper

mlsh's Introduction

Status: Archive (code is provided as-is, no updates expected)

Meta-Learning Shared Hierarchies

Code for Meta-Learning Shared Hierarchies.

Installation

Add to your .bash_profile (replace ... with path to directory):
export PYTHONPATH=$PYTHONPATH:/.../mlsh/gym;
export PYTHONPATH=$PYTHONPATH:/.../mlsh/rl-algs;

Install MovementBandits environments:
cd test_envs
pip install -e .

Running Experiments

python main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 30 --replay False AntAgent

Once you've trained your agent, view it by running:

python main.py [...] --replay True --continue_iter [your iteration] AntAgent

The MLSH script works on any Gym environment that implements the randomizeCorrect() function. See the envs/ folder for examples of such environments.

To run on multiple cores:

mpirun -np 12 python main.py ...

mlsh's People

Contributors

Stargazers

Watchers

Forkers

derrickjnet miej javierluraschi jithsjoy kelvinson kastnerkyle fourks cnheider cheetah132 jiefloyd pustar psfournier ajaytalati universityai jc-johnson georgezoto codeaudit benjamesbabala ifhubs dmitrilosev strategist922 hadisalman ryfan-rs frankatmech cnn-gan justinlogan vv111y hongyunnchen zhf459 windyjune zhenv5 davidkuhta ruikunl quykiemsau andow7 tony32769 kaizoku99 alapini ayush15k1994 zacharias030 favarete 20chase flyers skiev colamas nikhilclasher vashishtmadhavan ml-lab qistang abhishekp906 jkim992 pouncealot gsygsy96 ourobouros jaicie parsonszeng ei-grad johbln freddyb1 huoliangyu xinyanliang robintyh1 wwxfromtju ysaibhargav ykankaya piandpower igni-c albertwujj peterzcc iceclear natsuki14 pengbaolin janebert zxsted geometricbci thuang cclauss airobotgui xuzhuo77 daominglyu brahmac mystery-college-of-the-adapts yur1xpp lucasosouza wayne980 lturing neelshah18 shitou208 taylor-liu atgambardella ll550 christopherhesse yinlang832 xdr1xnl0p3z hityangzhen 355380o726602 muharremokutan chaonan99 sx1616039 daiviet01

mlsh's Issues

readme for Windows

Hi! It would be great if you can give a more detailed readme for a Windows system. Thanks very much.

how to render Fourrooms-v0

For Fourrooms-v0 game, env.render() function can be applied. how to render the game??

Hi, how did you solve this error?

Hi, have similar error with rl_algos:

mlsh/mlsh_code$ python main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 30 --replay False AntAgent
Traceback (most recent call last):
  File "main.py", line 20, in <module>
    from rl_algs.common import set_global_seeds, tf_util as U
ModuleNotFoundError: No module named 'rl_algs'

Originally posted by @ViktorM in #1 (comment)

Wrong environment reset?

See:
https://github.com/openai/mlsh/blob/master/mlsh_code/rollouts.py#L83

Are you shure that "and" condition there is correct?
It would mean that the environment gets not reset although you already reached the goal.
(It will do another x steps till macrolen gets reached although its clear it won't get any reward)
And also means that the environment will never reset at all if you are at a stage where the goal cannot be reached anymore.
An "or" would make more sense in my opinion.

How to determine the number of sub-policies?

Hi, I read the paper and in the experiment section, apart from the first simple examples where it is trivial to determine the number of sub-policies, from section 6.4 (ant robot, etc.) on, I didn't see any detail about how this number is set.

In my opinion, this number should be quite important for this algorithm to perform well, for example you won't get good results by setting num-policy=3 in FourRooms.

Could you please explain how this number should be chosen? Thank you.

Details on running the code with multiple cores

Just want to check if we run the code in a correct way. If I want to run the code in 120 parallel instance as the paper suggests. Do I just use mpiexec command like the following command? or there is another way doing this?

mpiexec -n 120 python3 main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 30 --replay False AntAgent

Bug in the FourRooms env implementation

It seems the code for the FourRooms env was carried over from the option-critic implementation, and thus it also has a bug that the option-critic implementation had. See: jeanharb/option_critic#11

Also, it seems that the goal is same every episode? https://github.com/openai/mlsh/blob/master/test_envs/test_envs/envs/fourrooms.py#L60

@kvfrans It would be cool if you can clarify this?

Fourroom environment not working

When running fourroom environment, there is a bug in the policy_network.py that makes the code crash. The observation returned for the four room environment is an integer, however, for the movement bandit, the return is a vector of current location and two goal locations. In line 49 in the policy_network.py, ob[None] will crash if ob is an integer.

Termination due to `Bus error (signal 7)`

Hi,

I'm running the code using mpirun inside a docker container. It worked at first but recently I started getting the error message

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 135
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Bus error (signal 7)

As far as I'm aware of I didn't change anything. Does anyone know where this might be coming from?
Thanks a lot!
Best, Max

Reproducing agent performance in MovementBandits

Hi, great work on the paper and code! I am working on a project that builds on top of MLSH. We implemented our own GPU optimized version of the algorithm based on your MPI based code. We observe both in our setting as well as with your code for MovementBandits that both the sub policies end up learning the same strategy of moving to just one of the bandits (the same one consistently) in a single run.

Here are the parameters from my run
mpirun -np 120 python3 main.py --task MovementBandits-v0 --num_subs 2 --macro_duration 10 --num_rollouts 2000 --warmup_time 9 --train_time 1 --replay False MovementBandits

Additionally, I tried both optimizer step sizes of 3e-4 (as mentioned in the paper) and 3e-5 (embedded as a default argument in code) and changed the seed value of 1401 also embedded in your main file.

I modified master.py to log some additional information such as the current iteration number and the real goal chosen by randomizeCorrect (our fork). Here is a snippet from one of the runs

Mini ep 10, goal 1, iteration 30: global: 18.60333333333333, local: 42.5
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 2.375
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 2.725
Mini ep 5, goal 1, iteration 35: global: 18.60333333333333, local: 43.075
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 1.05
Mini ep 2, goal 0, iteration 38: global: 18.60333333333333, local: 3.125
Mini ep 4, goal 1, iteration 36: global: 18.60333333333333, local: 44.375
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 3.275
Mini ep 4, goal 1, iteration 36: global: 18.60333333333333, local: 43.65
Mini ep 7, goal 0, iteration 33: global: 18.60333333333333, local: 5.575
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 3.975
Mini ep 8, goal 0, iteration 32: global: 18.60333333333333, local: 4.0
Mini ep 2, goal 0, iteration 38: global: 18.60333333333333, local: 0.475
Mini ep 5, goal 1, iteration 35: global: 18.60333333333333, local: 41.4
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 0.975

We observe similar behavior (confirmed upon visualization of the subpolicies with render, both of them cause the agent to move to the same disc throughout the entire training run) with our own implementation.

We are attempting to reproduce the results from the paper (figure 4, page 6) where the agent learns to get rewards around 40 after a few gradient updates. Please let us know if we are running the right hyperparameter configuration/what seeds to use with the original codebase to observe such behavior; this will greatly help with our research! Thanks.

rl_algs

could you please point out the library for rl_algs ?

--continue_iter is buggy

I used the following statement:
python3 main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 50 --continue_iter 00615 --replay True AntAgent

The thing i notice is that you need to write "0" infront of the iteration-number.
Another thing i noticed is that i needed to copy the files from "savedir" to the folder "AntAgent" to make it work.
I guess the checkpoint-algorithm uses the wrong directory for storing checkpoints.

The first output also shows "It is Iteration 0 so i'm changing [...]".
But i wanted to continue the learning process and didn't want to start from beginning.

Terminal states logic for sub-policies

mlsh/mlsh_code/rollouts.py

Line 123 in 58f527a

nonterminal = 1-new[t+1]

Hi, shouldn't the logic for determining terminal states for sub-policies consider the case where the master action changes? If the action changes, shouldn't we designate the current state as terminal? It seems that the current implementation can bootstrap from a different sub-policy network when such a case arises.

Observation of MovementBandits env

Hello, I have tried to train the MLSH policies under the MovementBandits environment,
but outputs of the master policy seems to be random even after training.

The command I tried is here:
mpirun -np 120 python3 main.py --task MovementBandits-v0 --num_subs 2 --macro_duration 10 --num_rollouts 2000 --warmup_time 9 --train_time 1 --replay False MovementBandits

I guess the master policy has to have observation about the correct goal to select sub policies, but the current implementation provides nothing about the correct goal.
Do you have any updates about MovementBandits?