code-for-paper's People
Forkers
wwxfromtju bmpcc6k chloechsu sweetice srinivr tangzk zmonoid zhoubin-me phoenixera tianqi-777 xiaogengyaokeyan veilcooper lars12llt hejichao2020 curieuxjy bellz867 atlasgooo2 sanelyx jucaleb4 ltcsoar umfundii martain-liucode-for-paper's Issues
The result of PPO-M
When I run the PPO-M with the default parms in MuJoCo.json,the average mean_reward of the last train step over 10 agents is much smaller than the result in paper.In Walker2d-v2,my result of PPO-M is about 600-900.And in Hopper-v2,the result is about 300. In Humanoid-2,it always fail to run using the default ppo_lr_adam=1e-4
Square after max for clipped value_gae_loss, any comment?
The original implementation first squares both the clipped and unclipped losses then takes the maximum. Any comment as to why this is handled differently here?
code-for-paper/src/policy_gradients/steps.py
Line 100 in 094994f
Which version of mujoco used?
Can you update the version of the libraries in the requirements.txt? Thank you!
License
Hi,
I would like to use the code for my research. Is it possible, that a license e.g. MIT is added to the repository?
Thanks.
Issues with the RewardFilter
Hi, thanks for the great work. I have three questions if you don't mind.
- In
,
the comments suggest it usesIncorrect reward normalization
. I was wondering if you could elaborate. Does that mean we should avoid usingRewardFilter
because of the incorrect normalization and try to useZfilter
instead for the reward normalization?
Another concern I have is with the reset()
call of the RewardFilter
. It seems that in your customized envs,
def reset(self):
# Reset the state, and the running total reward
start_state = self.env.reset()
self.total_true_reward = 0.0
self.counter = 0.0
self.state_filter.reset()
return self.state_filter(start_state, reset=True)
-
It seems the
reward_filter
will never reset. However, thereward_filter
always multiply the existing returns bygamma
. Could this be a bug? -
The
reward_filter
is already using thegamma
as part of its inputs, but do you still calculate the advantage using thegamma
again or is this somehow omitted?
Thanks.
Code for value function clipping
In the paper (sec 2) and description of value function clipping, the PPO-like objective is the minimum of clipped and unclipped. However, the code applies the maximum of them, does it have different effects from the paper said?
code-for-paper/src/policy_gradients/steps.py
Lines 89 to 96 in 094994f
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.