Giter VIP home page Giter VIP logo

code-for-paper's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

code-for-paper's Issues

The result of PPO-M

When I run the PPO-M with the default parms in MuJoCo.json,the average mean_reward of the last train step over 10 agents is much smaller than the result in paper.In Walker2d-v2,my result of PPO-M is about 600-900.And in Hopper-v2,the result is about 300. In Humanoid-2,it always fail to run using the default ppo_lr_adam=1e-4

License

Hi,

I would like to use the code for my research. Is it possible, that a license e.g. MIT is added to the repository?
Thanks.

Issues with the RewardFilter

Hi, thanks for the great work. I have three questions if you don't mind.

  1. In ,
    the comments suggest it uses Incorrect reward normalization. I was wondering if you could elaborate. Does that mean we should avoid using RewardFilter because of the incorrect normalization and try to use Zfilter instead for the reward normalization?

Another concern I have is with the reset() call of the RewardFilter. It seems that in your customized envs,

    def reset(self):
        # Reset the state, and the running total reward
        start_state = self.env.reset()
        self.total_true_reward = 0.0
        self.counter = 0.0
        self.state_filter.reset()
        return self.state_filter(start_state, reset=True)
  1. It seems the reward_filter will never reset. However, the reward_filter always multiply the existing returns by gamma. Could this be a bug?

  2. The reward_filter is already using the gamma as part of its inputs, but do you still calculate the advantage using the gamma again or is this somehow omitted?

Thanks.

Code for value function clipping

In the paper (sec 2) and description of value function clipping, the PPO-like objective is the minimum of clipped and unclipped. However, the code applies the maximum of them, does it have different effects from the paper said?

# In OpenAI's PPO implementation, we clip the value function around the previous value estimate
# and use the worse of the clipped and unclipped versions to train the value function
# Presumably the inspiration for this is similar to PPO
if params.VALUE_CLIPPING:
val_loss_mat = ch.max(val_loss_mat_unclipped, val_loss_mat_clipped)
else:
val_loss_mat = val_loss_mat_unclipped

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.