Giter VIP home page Giter VIP logo

gail-tf's Introduction

Check out the simpler version at openai/baselines/gail!

gail-tf

Tensorflow implementation of Generative Adversarial Imitation Learning (and behavior cloning)

disclaimers: some code is borrowed from @openai/baselines

What's GAIL?

  • model free imtation learning -> low sample efficiency in training time
    • model-based GAIL: End-to-End Differentiable Adversarial Imitation Learning
  • Directly extract policy from demonstrations
  • Remove the RL optimization from the inner loop od inverse RL
  • Some work based on GAIL:
    • Inferring The Latent Structure of Human Decision-Making from Raw Visual Inputs
    • Multi-Modal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets
    • Robust Imitation of Diverse Behaviors

Requirements

  • python==3.5.2
  • mujoco-py==0.5.7
  • tensorflow==1.1.0
  • gym==0.9.3

Run the code

I separate the code into two parts: (1) Sampling expert data, (2) Imitation learning with GAIL/BC

Step 1: Generate expert data

Train the expert policy using PPO/TRPO, from openai/baselines

Ensure that $GAILTF is set to the path to your gail-tf repository, and $ENV_ID is any valid OpenAI gym environment (e.g. Hopper-v1, HalfCheetah-v1, etc.)

Configuration
export GAILTF=/path/to/your/gail-tf
export ENV_ID="Hopper-v1"
export BASELINES_PATH=$GAILTF/gailtf/baselines/ppo1 # use gailtf/baselines/trpo_mpi for TRPO
export SAMPLE_STOCHASTIC="False"            # use True for stochastic sampling
export STOCHASTIC_POLICY="False"            # use True for a stochastic policy
export PYTHONPATH=$GAILTF:$PYTHONPATH       # as mentioned below
cd $GAILTF
Train the expert
python3 $BASELINES_PATH/run_mujoco.py --env_id $ENV_ID

The trained model will save in ./checkpoint, and its precise name will vary based on your optimization method and environment ID. Choose the last checkpoint in the series.

export PATH_TO_CKPT=./checkpoint/trpo.Hopper.0.00/trpo.Hopper.00-900
Sample from the generated expert policy
python3 $BASELINES_PATH/run_mujoco.py --env_id $ENV_ID --task sample_trajectory --sample_stochastic $SAMPLE_STOCHASTIC --load_model_path $PATH_TO_CKPT

This will generate a pickle file that store the expert trajectories in ./XXX.pkl (e.g. deterministic.ppo.Hopper.0.00.pkl)

export PICKLE_PATH=./stochastic.trpo.Hopper.0.00.pkl

Step 2: Imitation learning

Imitation learning via GAIL

python3 main.py --env_id $ENV_ID --expert_path $PICKLE_PATH

Usage:

--env_id:          The environment id
--num_cpu:         Number of CPU available during sampling
--expert_path:     The path to the pickle file generated in the [previous section]()
--traj_limitation: Limitation of the exerpt trajectories
--g_step:          Number of policy optimization steps in each iteration
--d_step:          Number of discriminator optimization steps in each iteration
--num_timesteps:   Number of timesteps to train (limit the number of timesteps to interact with environment)

To view the summary plots in TensorBoard, issue

tensorboard --logdir $GAILTF/log
Evaluate your GAIL agent
python3 main.py --env_id $ENV_ID --task evaluate --stochastic_policy $STOCHASTIC_POLICY --load_model_path $PATH_TO_CKPT --expert_path $PICKLE_PATH

Imitation learning via Behavioral Cloning

python3 main.py --env_id $ENV_ID --algo bc --expert_path $PICKLE_PATH
Evaluate your BC agent
python3 main.py --env_id $ENV_ID --algo bc --task evalaute --stochastic_policy $STOCHASTIC_POLICY --load_model_path $PATH_TO_CKPT --expert_path $PICKLE_PATH

Results

Note: The following hyper-parameter setting is the best that I've tested (simple grid search on setting with 1500 trajectories), just for your information.

The different curves below correspond to different expert size (1000,100,10,5).

  • Hopper-v1 (Average total return of expert policy: 3589)
python3 main.py --env_id Hopper-v1 --expert_path baselines/ppo1/deterministic.ppo.Hopper.0.00.pkl --g_step 3 --adversary_entcoeff 0

  • Walker-v1 (Average total return of expert policy: 4392)
python3 main.py --env_id Walker2d-v1 --expert_path baselines/ppo1/deterministic.ppo.Walker2d.0.00.pkl --g_step 3 --adversary_entcoeff 1e-3

  • HalfCheetah-v1 (Average total return of expert policy: 2110)

For HalfCheetah-v1 and Ant-v1, using behavior cloning is needed:

python3 main.py --env_id HalfCheetah-v1 --expert_path baselines/ppo1/deterministic.ppo.HalfCheetah.0.00.pkl --pretrained True --BC_max_iter 10000 --g_step 3 --adversary_entcoeff 1e-3

You can find more details here, GAIL policy here, and BC policy here

Hacking

We don't have a pip package yet, so you'll need to add this repo to your PYTHONPATH manually.

export PYTHONPATH=/path/to/your/repo/with/gailtf:$PYTHONPATH

TODO

  • Create pip package/setup.py
  • Make style PEP8 compliant
  • Create requirements.txt
  • Depend on openai/baselines directly and modularize modifications
  • openai/robotschool support

TroubleShooting

  • encounter error: Cannot compile MPI programs. Check your configuration!!! or the systme complain about mpi/h
sudo apt install libopenmpi-dev

Reference

  • Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning, [arxiv]
  • @openai/imitation
  • @openai/baselines

gail-tf's People

Contributors

andrewliao11 avatar eric-heiden avatar rubenvillegas avatar ryanjulian avatar wsjeon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gail-tf's Issues

Error about main.py

system: Ubuntu 16.04
GPU: Nvidia 1050Ti
CPU: I7
Mujoco version: pro150
ERROR:
When i start imitation learning,
after commanded:
python3 main.py --env_id $ENV_ID --expert_path $PICKLE_PATH

The Mujoco simulator appeared, then the model appeared and everything was ok, but after 2 seconds, mujoco entered an unresponsive state(Stop running).But the command terminal still running and calculating. See the picture and terminal output below for details:

Screenshot
https://drive.google.com/open?id=1RKHxacZqvodmIOkBaNGxJcfmTdLOi6dO
https://drive.google.com/open?id=12_pbRtbtbwguY-j77LkVf6D4vsTHcBKX

Terminal:

********** Iteration 802 ************
Optimizing Policy...
sampling
done in 2.158 seconds
computegrad
done in 0.003 seconds
cg
      iter residual norm  soln norm
         0       2.15          0
         1       3.63      0.364
         2       3.73      0.821
         3       2.89       1.38
         4       3.28       2.06
         5       2.47       2.77
         6       2.56        3.4
         7       2.48       3.89
         8       1.57       4.48
         9       1.43       4.89
        10      0.906       5.23
done in 0.031 seconds
Expected: 0.311 Actual: 0.352
violated KL constraint. shrinking step.
Expected: 0.311 Actual: 0.163
Stepsize OK!
vf
done in 0.109 seconds
sampling
done in 2.047 seconds
computegrad
done in 0.003 seconds
cg
      iter residual norm  soln norm
         0       1.96          0
         1       2.97      0.438
         2       3.54      0.921
         3       3.14       1.53
         4       2.65       2.26
         5       2.46       2.92
         6       2.07       3.57
         7       1.96       4.11
         8       1.45       4.56
         9       1.23       4.97
        10      0.834       5.31
done in 0.031 seconds
Expected: 0.309 Actual: 0.331
violated KL constraint. shrinking step.
Expected: 0.309 Actual: 0.158
Stepsize OK!
vf
done in 0.110 seconds
sampling
done in 2.614 seconds
computegrad
done in 0.003 seconds
cg
      iter residual norm  soln norm
         0       2.34          0
         1       4.35      0.363
         2       3.25      0.842
         3       3.98       1.34
         4        3.3       1.87
         5       3.32       2.53
         6       3.49        3.1
         7       2.46        3.6
         8       1.99       4.21
         9       1.81       4.67
        10       1.21       5.13
done in 0.032 seconds
Expected: 0.314 Actual: 0.439
violated KL constraint. shrinking step.
Expected: 0.314 Actual: 0.180
Stepsize OK!
vf
done in 0.102 seconds
Optimizing Discriminator...
generator_loss |   expert_loss |       entropy |  entropy_loss | generator_acc |    expert_acc
   0.55519825 |     0.4537142 |    0.58335435 | -0.0005833544 |    0.70410156 |    0.82421875
----------------------------------
| EpLenMean       | 135          |
| EpRewMean       | 67.860306    |
| EpThisIter      | 8            |
| EpTrueRewMean   | 701          |
| EpisodesSoFar   | 7982         |
| TimeElapsed     | 5.97e+03     |
| TimestepsSoFar  | 821432       |
| entloss         | 0.0          |
| entropy         | 24.121952    |
| ev_tdlam_before | 0.846        |
| meankl          | 0.0051494665 |
| optimgain       | 0.17973988   |
| surrgain        | 0.17973988   |
----------------------------------

the difference of set of 'stochastic' between 'traj_episode' and 'traj_segement'

Hi, andrewliao11!:
Its me again...
In 'trpo_mpi.learn()', the codes defines 'stochastic' as 'True' in 'traj_segement';
But, in In 'trpo_mpi.evaluate()', the codes defines 'stochastic' as default 'False' in 'traj_episode'.
Should it be difference in 'evaluate', when it seems output total different trajectory performance if learning with stochastic policy and evaluate with deterministic policy.
THX!

Error in evaluate GAIL agent

$python main.py --task evaluate --stochastic_policy True --expert_path gailtf/baselines/ppo1/stochastic.ppo.Hopper.0.00.pkl --load_model_path gailtf/baselines/ppo1/checkpoint/ppo.Hopper.0.00/ppo.Hopper.0.00-400

2018-08-13 08:34:00.525937: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key adversary/fully_connected/biases not found in checkpoint
Traceback (most recent call last):
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Key adversary/fully_connected/biases not found in checkpoint
	 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 125, in <module>
    main(args)
  File "main.py", line 117, in main
    number_trajs=10, stochastic_policy=args.stochastic_policy)
  File "/home/huang/gail-tf/gailtf/algo/trpo_mpi.py", line 400, in evaluate
    U.load_state(load_model_path)
  File "/home/huang/gail-tf/gailtf/baselines/common/tf_util.py", line 280, in load_state
    saver.restore(get_session(), fname)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1775, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1140, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    run_metadata)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key adversary/fully_connected/biases not found in checkpoint
	 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op 'save/RestoreV2', defined at:
  File "main.py", line 125, in <module>
    main(args)
  File "main.py", line 117, in main
    number_trajs=10, stochastic_policy=args.stochastic_policy)
  File "/home/huang/gail-tf/gailtf/algo/trpo_mpi.py", line 400, in evaluate
    U.load_state(load_model_path)
  File "/home/huang/gail-tf/gailtf/baselines/common/tf_util.py", line 279, in load_state
    else: saver = tf.train.Saver()
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1311, in __init__
    self.build()
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1320, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1357, in _build
    build_save=build_save, build_restore=build_restore)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 809, in _build_internal
    restore_sequentially, reshape)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 448, in _AddRestoreOps
    restore_sequentially)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 860, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1458, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
    op_def=op_def)
  File "/home/huang/venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

NotFoundError (see above for traceback): Key adversary/fully_connected/biases not found in checkpoint
	 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]


If you suspect this is an IPython bug, please report it at:
    https://github.com/ipython/ipython/issues
or send an email to the mailing list at [email protected]

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
    %config Application.verbose_crash=True

What is Key adversary means ?

rendering in mujoco

is there a possibility of rendering the expert policy generation (TRPO/PPO) & training period(GAIL) in mujoco.?

fail to load model

NotFoundError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./checkpoint/trpo.Hopper.0.00/trpo.Hopper.00-900
[[Node: save/RestoreV2_31 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_31/tensor_names, save/RestoreV2_31/shape_and_slices)]]

Regarding extending the current implementation to more environment

Hi Sanket, thanks for your implementation of GAIL. I plan to run it in AI2THOR for my experiment, so I guess there are changes needed in the code. Could you give some tips on how to do it? My AI2THOR already has a wrapper to interface with openAI gym, so I guess the changes are not that much? (https://github.com/TheMTank/cups-rl).

One more thing I would like to clarify with you. Does the original GAIL applicable if my experts trajectories (sample data) are for the same task but are in different environment? My gut feeling is yes, otherwise can just simply adopt behavioral cloning. But I would like to double check with you.

Thank you for replying.

why use sigmoid cross entropy as the loss function in Discriminator?

It's interesting to see the loss function trick in training the discriminator:
z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))
May I ask why u use the sigmoid instead ?
And in this code, u use (s,a) as sample to train the discriminator , it can be a high probability that existing a deference between the s generated and s expert, I guess, which may lead to a influence to the confidence of the discriminator. I wonder whether its better to use the (a|s) actions from same state to train the discriminator?

Cannot execute run_mujoco.py

Hi, thank you for publishing your code of GAIL.
I tried to run run_mujoco.py at the directory of ~/gail-tf/baselines/trpo_mpi
However I got the error message below.

[2017-10-21 17:02:39,965] Making new env: Hopper-v1
Traceback (most recent call last):
File "run_mujoco.py", line 63, in
main()
File "run_mujoco.py", line 59, in main
train(args)
File "run_mujoco.py", line 42, in train
ckpt_dir=args.checkpoint_dir, load_model_path=args.load_model_path, task=args.task)
TypeError: learn() got an unexpected keyword argument 'sample_stochastic'

How can I solve this problem?

Is this line correct?

obs = (obs_ph - self.obs_rms.mean / self.obs_rms.std)

Should the line above be:

obs = (obs_ph - self.obs_rms.mean) / self.obs_rms.std

instead of

obs = (obs_ph - self.obs_rms.mean / self.obs_rms.std)

If I am confusing something, please let me know.

Problem with U.save_state

Hi. Thanks for your nice code.

In tf_util.py, save_state is defined and used at trpo_mpi or pposgd_simple.

However, the problem is that, since save_state is inside the training loop, iterative call of save_state makes the size meta file increase.

image

I think you should modify this part by first defining tf.Saver.save outside the training loop.

Thanks.

gail performance for other MuJoCo tasks

Hi @andrewliao11 :
When applying default sets for other MuJoCo tasks(besides Hopper), It is hard to recover the gail performance as good as the results u presented. When I training HalfCheetah using HalfCheetah expert trajs (whose average returns evaluated as 5212.5), the gail results appears to be like below:
2018-01-29 22 15 46

I also have tried to adjust the default set of 'g_step' and 'd_step', it still does not work .
Can you give me some advice for training other MuJoCo tasks.
Thank you!

issue about running expert data

[2017-10-18 19:40:13,749]
I run the code in PYTHON 3.6.2 and have install the mujoco and configure the mjkey successfully:

$Making new env: Ant-v1
Traceback (most recent call last):
File "run_mujoco.py", line 56, in
main()
File "run_mujoco.py", line 52, in main
train(args)
File "run_mujoco.py", line 34, in train
task_name=task_name
TypeError: learn() got an unexpected keyword argument 'ckpt_dir'

can u help me with this issue? Appreciate it that u share GAIL code in tf BTW!

how to play this result in mujoco

Hi, I want to know that how to play the result in mujoco. I have finish training and i want to visualize it in mujoco. Can you share me some method?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.