Giter VIP home page Giter VIP logo

icem's Introduction

iCEM

improved Cross Entropy Method for trajectory optimization

[presentation and experiments at https://martius-lab.github.io/iCEM/]

Abstract: Trajectory optimizers for model-based reinforcement learning, such as the Cross-Entropy Method (CEM), can yield compelling results even in high-dimensional control tasks and sparse-reward environments. However, their sampling inefficiency prevents them from being used for real-time planning and control. We propose an improved version of the CEM algorithm for fast planning, with novel additions including temporally-correlated actions and memory, requiring 2.7-22x less samples and yielding a performance increase of 1.2-10x in high-dimensional control problems.

Requirements

Installed via the provided Pipfile with pipenv install, then pipenv shell to activate virtualenv

Running Experiments

  • Inside icem folder run python icem.py settings/[env]/[json]
  • To render all envs: set "render": true in iCEM/icem/settings/defaults/gt_default_env.json

iCEM improvements

The iCEM controller file is located here and it contains the following additions, which you can also extract and add to your codebase:

  • colored-noise, line68:
    It uses the package colorednoise which generates num_sim_traj temporally correlated action sequences along the planning horizon dimension h.
    The parameter you have to change depending on the task is noise_beta and it has an intuitive significance: higher β for low-frequency control (FETCH PICK&PLACE, RELOCATE, etc.) and lower β for high-frequency control (HALFCHEETAH RUNNING)
  iCEM/CEM with ground truth iCEM with PlaNet
horizon h 30 12
colored-noise exponent β 0.25 HALFCHEETAH RUNNING 0.25 CHEETAH RUN
  2.0 HUMANOID STANDUP 0.25 CARTPOLE SWINGUP
  2.5 DOOR 2.5 WALKER WALK
  2.5 DOOR (sparse reward) 2.5 CUP CATCH
  3.0 FETCH PICK&PLACE 2.5 REACHER EASY
  3.5 RELOCATE 2.5 FINGER SPIN
  • clipping actions at boundaries, line79:
    Instead of sampling from a truncated normal distribution, we sample from the unmodified normal distribution (or colored-noise distribution) and clip the results to lie inside the permitted action interval. This allows to sample maximal actions more frequently.

  • decay of population size, line126:
    Since the standard deviation of the CEM-distribution shrinks at every CEM-iteration, we introduce then an exponential decrease in population size of a fixed factor γ.: num_sim_traj now becomes max(self.elites_size * 2, int(num_sim_traj / self.factor_decrease_num))
    The max operation ensures that the population size is at least double the elites' size.

  • keep previous elites, line143:
    We store the elite-set generated at each inner CEM-iteration and add a small fraction of them (fraction_elites_reused) to the pool of the next iteration, instead of discarding the elite-set in each CEM-iteration.

  • shift previous elites, line131:
    We store a small fraction of the elite-set of the last CEM-iteration and add each a random action at the end to use it in the next environment step.
    This is done with the function elites_2_action_sequences.
    The reason for not shifting the entire elite-set in both cases is that it would shrink the variance of CEM drastically in the first CEM-iteration because the last elites are quite likely dominating the new samples and have small variance. We use a fraction_elites_reused=0.3 in all experiments.

  • execute best action, line163:
    The purpose of the original CEM algorithm is to estimate an unknown probability distribution. Using CEM as a trajectory optimizer detaches it from its original purpose. In the MPC context we are interested in the best possible action to be executed.
    For this reason, we choose the first action of the best seen action sequence, rather than executing the first mean action, which was actually never evaluated.

  • add mean to samples (at last iCEM-iteration), line87:
    We decided to add the mean of the iCEM distribution as a sample for two reasons:

    • because as dimensionality of the action space increases, it gets more and more difficult to sample an action sequence closer to the mean of the distribution.
    • because executing the mean might be beneficial for many tasks which require “clean” action sequences like manipulation, object-reaching, or any linear trajectory in the state-space.

    In practice, we add the mean just at the last iteration for reasons explained in the paper, section E.2, and we simply substitute it to one of the samples: sampled_from_distribution[0] = self.mean

Improvements importance

In the figure below we present the ablations and additions of the improvements mentioned above, for all environments and a selection of budgets. As we use the same hyperparameters for all experiments in some environments a few of the ablated versions perform slightly better but overall our final version has the best performance.
As we can see not all components are equally helpful in the different environments as each environment poses different challenges. For instance, in HUMANOID STANDUP the optimizer can get easily stuck in a local optimum corresponding to a sitting posture. Keeping balance in a standing position is also not trivial since small errors can lead to unrecoverable states. In the FETCH PICK&PLACE environment, on the other hand, the initial exploration is critical since the agent receives a meaningful reward only if it is moving the box. Then colored noise and keep elites and shifting elites is most important. ablation_results

icem's People

Contributors

cristinapin avatar s-bl avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.