uidilr / gail_ppo_tf Goto Github PK

View Code? Open in Web Editor NEW

111.0 4.0 29.0 4.87 MB

Tensorflow implementation of Generative Adversarial Imitation Learning(GAIL) with discrete action

License: MIT License

Python 100.00%

machine-learning inverse-reinforcement-learning tensorflow ppo python gail imitation-learning

gail_ppo_tf's Introduction

Generative Adversarial Imitation Learning

Implementation of Generative Adversarial Imitation Learning(GAIL) using tensorflow

Dependencies

python>=3.5
tensorflow>=1.4
gym>=0.9.3

Gym environment

Env==CartPole-v0
State==Continuous
Action==Discrete

Usage

Train experts

python3 run_ppo.py

Sample trajectory using expert

python3 sample_trajectory.py

Run GAIL

python3 run_gail.py

Run supervised learning

python3 run_behavior_clone.py

Test trained policy

python3 test_policy.py

Default policy is trained with gail
--alg=bc or ppo allows you to change test policy

If you want to test bc policy, specify the number of model.ckpt-number in the directory trained_models/bc
Example

python3 test_policy.py --alg=bc --model=1000

Tensorboard

tensorboard --logdir=log

Results


Fig.1 Training results	legend

LICENSE

MIT LICENSE

gail_ppo_tf's People

Contributors

Stargazers

Watchers

gail_ppo_tf's Issues

Discriminator optimisation in GAIL

In this line in run_gai.py file why do you iterate over 2 times for a single agent's trajectory?

Discriminator Training with the expert's trajectories ?

I noticed you used the whole batch of expert trajectories when training discriminator. Is there an exact reason not to use batches?

no render for cartpole-v0

Is there a way to render the model while testing & training?

gaes = (gaes - gaes.mean()) / gaes.std()

What does this formula mean in ppo? Thanks

After this when we run the training graph by feeding obs into placeholders what happen to old policy? does it use assigned parameters or it use previously initialized parameters? I think it uses previously initialized params that happening in this line . Any way why you need to run assign op before training op ?

What is "temp" value used in getting the output of policy network ?

in this line you have used is to get self.act_probs by dividing the fully connected layer inputs by temp value.

Discriminator statistics are not storing on the tensor board

In this line you have added tf.summary.scalar('discriminator', loss) to the summary. But when I observe the tensorboard there are only ppo losses. I think there's a problem with this line. Here the tf.summary.merge_all() only merges ppo scalars.

Missing (-1) in the loss function in BC

Please correct this line. I think it should be (-loss)

roll out of the policy and collecting data, especially reward

Hi, thanks for sharing this. Quick question, here where you are collecting rewards

gail_ppo_tf/run_gail.py

Line 59 in 1dc3c34

rewards.append(reward)

I guess, this is not right, your actions and reward are not related, you are relating old reward to new actions, I think you need to move appending reward after the step

gail_ppo_tf/run_gail.py

Line 62 in 1dc3c34

next_obs, reward, done, info = env.step(act)

, then you can append reward here. For run_gail.py it is okay doing this, because we are not using reward to update (rather we use discriminator reward), but for run_ppoI guess it is not correct, especially when we initially train the expert

can you implement WGAN loss as the discriminator loss ?

GAIL Training - Train the policy net fist with behavioural cloning and fine tune with the GAIL ?

Hi can you merge this in a efficient way to the code? Or can you give me some advice where should I load the pre-retrained weights to the policy network ?

Next Question - What do you think about training GAIL with high dimensional data ? Let's say with images ?

Your implementation is different from the Gail paper

Nice implementation! but I find some inconsistency in your discriminator.py
If you check the formula 17 of paper General Adversarial Imitation Learning
it should be 1-expert_probability, but in line 35 in your discriminator.py, it is 1-policy_probability.
Just want to assure with you :)

some issue about the set of 'stochastic'

Hi Yusuke San:
I really admire your coding skills.
I have reviewed another GAIL which is written by TRPO. After reading your GAIL codes, I find a common set about parameter 'stochastic'. In run_ppo or run_gail, you use 'stochastic = True' in 'Policy.act( )'. But you use 'stochastic = False' in 'Policy.act( )' in test_policy.
So why you use STOCHASTIC policy when training but DETERMINISTIC policy when testing?
Thanks!

Is there any way we can encourage the exploration strategy of the agent ?

I've heard by adding entropy to the loss we can encourage the RL agent to explore more. I tried to train this PPO-Gail method with high dim inputs. At the beginning of training, almost all actions have the same probability. After some training, some actions get higher probability (in the direction of getting more rewards), and entropy is reduced over time.

If I want this how can I add entropy?

About the critic network optimisation ?

I am talking about this line. If I am correct this is the total loss according to PPO. How to optimize the critic network, which is the value function neural network?

Where do you use the get_grad function ?

In this line inside the ppo.py you have implemented get_grad. But seems like you didn't use it. Is there any reason?

Application of Batch Normalization and Drop Out

Is there any special reason you didn't use dropout and batch normalization or it's just because input dimensions are relatively small?

ppo implementation and kl divergence of gail paper

Hi, in ppo.pywould you mind explaining the implementation of the loss function, can you please explain which loss equation of the ppo paper has been implemented? Also in regard to gail paper, where the equation 18 of the gail paper has been implemented? thanks for your response,

Discriminator optimisation with each [state,action]

HI,
When you optimizing the discriminator to output probabilities(later take it as a reward) for each [state, action] tuple, you consider the whole batch.

By doing this don't we loos the sequential behavior of actions? Also we loose start to end connection of a trajectory. Because the neural network will only take one pair into account in a single time step.

What about using an LSTM or RNN?