Giter VIP home page Giter VIP logo

gail_ppo_tf's Introduction

Generative Adversarial Imitation Learning

Implementation of Generative Adversarial Imitation Learning(GAIL) using tensorflow

Dependencies

python>=3.5
tensorflow>=1.4
gym>=0.9.3

Gym environment

Env==CartPole-v0
State==Continuous
Action==Discrete

Usage

Train experts

python3 run_ppo.py     

Sample trajectory using expert

python3 sample_trajectory.py

Run GAIL

python3 run_gail.py  

Run supervised learning

python3 run_behavior_clone.py 

Test trained policy

python3 test_policy.py  

Default policy is trained with gail
--alg=bc or ppo allows you to change test policy

If you want to test bc policy, specify the number of model.ckpt-number in the directory trained_models/bc
Example

python3 test_policy.py --alg=bc --model=1000

Tensorboard

tensorboard --logdir=log

Results

Fig.1 Training results legend

LICENSE

MIT LICENSE

gail_ppo_tf's People

Contributors

uidilr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

gail_ppo_tf's Issues

About the old an new policy ?

What is happening
In this line ??

After this when we run the training graph by feeding obs into placeholders what happen to old policy? does it use assigned parameters or it use previously initialized parameters? I think it uses previously initialized params that happening in this line . Any way why you need to run assign op before training op ?

roll out of the policy and collecting data, especially reward

Hi, thanks for sharing this. Quick question, here where you are collecting rewards

rewards.append(reward)

I guess, this is not right, your actions and reward are not related, you are relating old reward to new actions, I think you need to move appending reward after the step
next_obs, reward, done, info = env.step(act)
, then you can append reward here. For run_gail.py it is okay doing this, because we are not using reward to update (rather we use discriminator reward), but for run_ppoI guess it is not correct, especially when we initially train the expert

some issue about the set of 'stochastic'

Hi Yusuke San:
I really admire your coding skills.
I have reviewed another GAIL which is written by TRPO. After reading your GAIL codes, I find a common set about parameter 'stochastic'. In run_ppo or run_gail, you use 'stochastic = True' in 'Policy.act( )'. But you use 'stochastic = False' in 'Policy.act( )' in test_policy.
So why you use STOCHASTIC policy when training but DETERMINISTIC policy when testing?
Thanks!

Is there any way we can encourage the exploration strategy of the agent ?

I've heard by adding entropy to the loss we can encourage the RL agent to explore more. I tried to train this PPO-Gail method with high dim inputs. At the beginning of training, almost all actions have the same probability. After some training, some actions get higher probability (in the direction of getting more rewards), and entropy is reduced over time.

If I want this how can I add entropy?

ppo implementation and kl divergence of gail paper

Hi, in ppo.pywould you mind explaining the implementation of the loss function, can you please explain which loss equation of the ppo paper has been implemented? Also in regard to gail paper, where the equation 18 of the gail paper has been implemented? thanks for your response,

Discriminator optimisation with each [state,action]

HI,
When you optimizing the discriminator to output probabilities(later take it as a reward) for each [state, action] tuple, you consider the whole batch.

By doing this don't we loos the sequential behavior of actions? Also we loose start to end connection of a trajectory. Because the neural network will only take one pair into account in a single time step.

What about using an LSTM or RNN?

Visualize the Gradients in the discriminator on the Tensor-board

Hi ,
I think it is useful to visualize gradients in discriminator. Instead of using Adam optimizer in the given way I used,

optimizer = tf.train.AdamOptimizer() grads = optimizer.compute_gradients(cross_entropy) train_step = optimizer.apply_gradients(grads) for index, grad in enumerate(grads): tf.summary.histogram("{}-grad".format(grads[index][1].name), grads[index])

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.