Giter VIP home page Giter VIP logo

mlsh's Introduction

Status: Archive (code is provided as-is, no updates expected)

Meta-Learning Shared Hierarchies

Code for Meta-Learning Shared Hierarchies.

Installation
Add to your .bash_profile (replace ... with path to directory):
export PYTHONPATH=$PYTHONPATH:/.../mlsh/gym;
export PYTHONPATH=$PYTHONPATH:/.../mlsh/rl-algs;

Install MovementBandits environments:
cd test_envs
pip install -e .
Running Experiments
python main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 30 --replay False AntAgent

Once you've trained your agent, view it by running:

python main.py [...] --replay True --continue_iter [your iteration] AntAgent

The MLSH script works on any Gym environment that implements the randomizeCorrect() function. See the envs/ folder for examples of such environments.

To run on multiple cores:

mpirun -np 12 python main.py ...

mlsh's People

Contributors

cberner avatar christopherhesse avatar kvfrans avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlsh's Issues

readme for Windows

Hi! It would be great if you can give a more detailed readme for a Windows system. Thanks very much.

Hi, how did you solve this error?

Hi, have similar error with rl_algos:

mlsh/mlsh_code$ python main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 30 --replay False AntAgent
Traceback (most recent call last):
  File "main.py", line 20, in <module>
    from rl_algs.common import set_global_seeds, tf_util as U
ModuleNotFoundError: No module named 'rl_algs'

Originally posted by @ViktorM in #1 (comment)

Wrong environment reset?

See:
https://github.com/openai/mlsh/blob/master/mlsh_code/rollouts.py#L83

Are you shure that "and" condition there is correct?
It would mean that the environment gets not reset although you already reached the goal.
(It will do another x steps till macrolen gets reached although its clear it won't get any reward)
And also means that the environment will never reset at all if you are at a stage where the goal cannot be reached anymore.
An "or" would make more sense in my opinion.

How to determine the number of sub-policies?

Hi, I read the paper and in the experiment section, apart from the first simple examples where it is trivial to determine the number of sub-policies, from section 6.4 (ant robot, etc.) on, I didn't see any detail about how this number is set.

In my opinion, this number should be quite important for this algorithm to perform well, for example you won't get good results by setting num-policy=3 in FourRooms.

Could you please explain how this number should be chosen? Thank you.

Details on running the code with multiple cores

Just want to check if we run the code in a correct way. If I want to run the code in 120 parallel instance as the paper suggests. Do I just use mpiexec command like the following command? or there is another way doing this?

mpiexec -n 120 python3 main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 30 --replay False AntAgent

Fourroom environment not working

When running fourroom environment, there is a bug in the policy_network.py that makes the code crash. The observation returned for the four room environment is an integer, however, for the movement bandit, the return is a vector of current location and two goal locations. In line 49 in the policy_network.py, ob[None] will crash if ob is an integer.

Termination due to `Bus error (signal 7)`

Hi,

I'm running the code using mpirun inside a docker container. It worked at first but recently I started getting the error message

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 135
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Bus error (signal 7)

As far as I'm aware of I didn't change anything. Does anyone know where this might be coming from?
Thanks a lot!
Best, Max

Reproducing agent performance in MovementBandits

Hi, great work on the paper and code! I am working on a project that builds on top of MLSH. We implemented our own GPU optimized version of the algorithm based on your MPI based code. We observe both in our setting as well as with your code for MovementBandits that both the sub policies end up learning the same strategy of moving to just one of the bandits (the same one consistently) in a single run.

Here are the parameters from my run
mpirun -np 120 python3 main.py --task MovementBandits-v0 --num_subs 2 --macro_duration 10 --num_rollouts 2000 --warmup_time 9 --train_time 1 --replay False MovementBandits

Additionally, I tried both optimizer step sizes of 3e-4 (as mentioned in the paper) and 3e-5 (embedded as a default argument in code) and changed the seed value of 1401 also embedded in your main file.

I modified master.py to log some additional information such as the current iteration number and the real goal chosen by randomizeCorrect (our fork). Here is a snippet from one of the runs

Mini ep 10, goal 1, iteration 30: global: 18.60333333333333, local: 42.5
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 2.375
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 2.725
Mini ep 5, goal 1, iteration 35: global: 18.60333333333333, local: 43.075
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 1.05
Mini ep 2, goal 0, iteration 38: global: 18.60333333333333, local: 3.125
Mini ep 4, goal 1, iteration 36: global: 18.60333333333333, local: 44.375
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 3.275
Mini ep 4, goal 1, iteration 36: global: 18.60333333333333, local: 43.65
Mini ep 7, goal 0, iteration 33: global: 18.60333333333333, local: 5.575
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 3.975
Mini ep 8, goal 0, iteration 32: global: 18.60333333333333, local: 4.0
Mini ep 2, goal 0, iteration 38: global: 18.60333333333333, local: 0.475
Mini ep 5, goal 1, iteration 35: global: 18.60333333333333, local: 41.4
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 0.975

We observe similar behavior (confirmed upon visualization of the subpolicies with render, both of them cause the agent to move to the same disc throughout the entire training run) with our own implementation.

We are attempting to reproduce the results from the paper (figure 4, page 6) where the agent learns to get rewards around 40 after a few gradient updates. Please let us know if we are running the right hyperparameter configuration/what seeds to use with the original codebase to observe such behavior; this will greatly help with our research! Thanks.

rl_algs

could you please point out the library for rl_algs ?

--continue_iter is buggy

I used the following statement:
python3 main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 50 --continue_iter 00615 --replay True AntAgent

The thing i notice is that you need to write "0" infront of the iteration-number.
Another thing i noticed is that i needed to copy the files from "savedir" to the folder "AntAgent" to make it work.
I guess the checkpoint-algorithm uses the wrong directory for storing checkpoints.

The first output also shows "It is Iteration 0 so i'm changing [...]".
But i wanted to continue the learning process and didn't want to start from beginning.

Terminal states logic for sub-policies

nonterminal = 1-new[t+1]

Hi, shouldn't the logic for determining terminal states for sub-policies consider the case where the master action changes? If the action changes, shouldn't we designate the current state as terminal? It seems that the current implementation can bootstrap from a different sub-policy network when such a case arises.

Observation of MovementBandits env

Hello, I have tried to train the MLSH policies under the MovementBandits environment,
but outputs of the master policy seems to be random even after training.

The command I tried is here:
mpirun -np 120 python3 main.py --task MovementBandits-v0 --num_subs 2 --macro_duration 10 --num_rollouts 2000 --warmup_time 9 --train_time 1 --replay False MovementBandits

I guess the master policy has to have observation about the correct goal to select sub policies, but the current implementation provides nothing about the correct goal.
Do you have any updates about MovementBandits?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.