implementation-matters / code-for-paper Goto Github PK

View Code? Open in Web Editor NEW

108.0 108.0 22.0 53 KB

Jupyter Notebook 48.84% Python 51.16%

code-for-paper's People

Stargazers

Watchers

code-for-paper's Issues

Which version of mujoco used?

Can you update the version of the libraries in the requirements.txt? Thank you!

License

Hi,

I would like to use the code for my research. Is it possible, that a license e.g. MIT is added to the repository?
Thanks.

Code for value function clipping

In the paper (sec 2) and description of value function clipping, the PPO-like objective is the minimum of clipped and unclipped. However, the code applies the maximum of them, does it have different effects from the paper said?

code-for-paper/src/policy_gradients/steps.py

Lines 89 to 96 in 094994f

 # In OpenAI's PPO implementation, we clip the value function around the previous value estimate 

 # and use the worse of the clipped and unclipped versions to train the value function 

 # Presumably the inspiration for this is similar to PPO 

 if params.VALUE_CLIPPING: 

 val_loss_mat = ch.max(val_loss_mat_unclipped, val_loss_mat_clipped) 

 else: 

 val_loss_mat = val_loss_mat_unclipped

Square after max for clipped value_gae_loss, any comment?

The original implementation first squares both the clipped and unclipped losses then takes the maximum. Any comment as to why this is handled differently here?

code-for-paper/src/policy_gradients/steps.py

Line 100 in 094994f

mse = val_loss_mat.pow(2).mean()

Issues with the RewardFilter

Hi, thanks for the great work. I have three questions if you don't mind.

In

code-for-paper/src/policy_gradients/torch_utils.py

Line 358 in 094994f

class RewardFilter:

,
the comments suggest it uses Incorrect reward normalization. I was wondering if you could elaborate. Does that mean we should avoid using RewardFilter because of the incorrect normalization and try to use Zfilter instead for the reward normalization?

Another concern I have is with the reset() call of the RewardFilter. It seems that in your customized envs,

    def reset(self):
        # Reset the state, and the running total reward
        start_state = self.env.reset()
        self.total_true_reward = 0.0
        self.counter = 0.0
        self.state_filter.reset()
        return self.state_filter(start_state, reset=True)

It seems the reward_filter will never reset. However, the reward_filter always multiply the existing returns by gamma. Could this be a bug?
The reward_filter is already using the gamma as part of its inputs, but do you still calculate the advantage using the gamma again or is this somehow omitted?

Thanks.

The result of PPO-M

When I run the PPO-M with the default parms in MuJoCo.json,the average mean_reward of the last train step over 10 agents is much smaller than the result in paper.In Walker2d-v2,my result of PPO-M is about 600-900.And in Hopper-v2,the result is about 300. In Humanoid-2,it always fail to run using the default ppo_lr_adam=1e-4

implementation-matters / code-for-paper Goto Github PK

code-for-paper's People

Stargazers

Watchers

Forkers

code-for-paper's Issues

Which version of mujoco used?

License

Code for value function clipping

Square after max for clipped value_gae_loss, any comment?

Issues with the RewardFilter

The result of PPO-M

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	# In OpenAI's PPO implementation, we clip the value function around the previous value estimate
	# and use the worse of the clipped and unclipped versions to train the value function

	# Presumably the inspiration for this is similar to PPO
	if params.VALUE_CLIPPING:
	val_loss_mat = ch.max(val_loss_mat_unclipped, val_loss_mat_clipped)
	else:
	val_loss_mat = val_loss_mat_unclipped