Giter VIP home page Giter VIP logo

vwxyzjn / cleanrl Goto Github PK

View Code? Open in Web Editor NEW
4.5K 34.0 525.0 136.21 MB

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)

Home Page: http://docs.cleanrl.dev

License: Other

Python 93.31% Shell 5.53% Dockerfile 0.14% HCL 1.02%
wandb reinforcement-learning pytorch python gym machine-learning deep-reinforcement-learning deep-learning atari ale

cleanrl's People

Contributors

51616 avatar adamcakg avatar alph2h avatar bentrevett avatar bragajj avatar chutaklee avatar cool-rr avatar cosmo3769 avatar dosssman avatar felipemartins96 avatar helges avatar helpingstar avatar jkterry1 avatar joaogui1 avatar jseppanen avatar kinalmehta avatar looseterrifyingspacemonkey avatar masud99r avatar melanol avatar pseudo-rnd-thoughts avatar qgallouedec avatar sdpkjc avatar sobhanmp avatar sudo-michael avatar timoklein avatar vcharraut avatar vkurenkov avatar vwxyzjn avatar willdudley avatar yooceii avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cleanrl's Issues

GAE bug with PPO2

The following results using GAE is clearly incorrect. The last value of the advantages array is very off.
image

The reason for the bug might be related to episode_lengths = [-1], where the GAE based calculation will set the last value of the advantages array incorrectly. This requires better implementation and fix.

Broken links on README

Problem Description

Broken links on Algorithms Implemented section of README

Current Behavior

Links are pointing to unavailable pages

Expected Behavior

Link should point to desired code

Possible Solution

Remove links to removed code or change links to new location

Steps to Reproduce

The link to the following codes on README implemented algorithms section are pointing to old locations, resulting in page 404:

experiments/ppo_self_play.py
experiments/ppo_microrts.py
experiments/ppo_simple.py
experiments/ppo_simple_continuous_action.py

not sure if there are more links, i found these one as they are on the experiments folder which does not exist anymore

Work with AWS Preemptible Instance

Problem Description

For the AWS Integrations, we usually run experiments using AWS spot instances to save cost. However, sometimes there's a need to running experiments for a long time. Real use cases include running montezuma's revenge by @yooceii and certain microrts tasks by myself. So we should look more into this issue.

By consulting this resource, I am considering storing the models periodically on the associated wandb run of certain run_id, and should the aws instance terminate, we basically pull the associated models from the run with run_id and continue training.

HuggingFace's model hub integration

Problem Description

HuggingFace's model hub has become a standard place to host trained models. Now it is expanding coverage to the RL space (see huggingface.co/sb3 as an example and DLR-RM/rl-baselines3-zoo#198) & it would be nice for us to integrate, too. I spoke with @ThomasSimonini today, and he expressed interest in working with this.

10/6/22 update: I would like to rethink how we can support saved models in CleanRL as they have become increasingly relevant (e.g., recent research on using models to bootstrap RL; see reincarnate RL)

Challenges

Huggingface and SB3 make a great fit because SB3 already provides a uniform API for training and evaluation. With CleanRL, this is tricky since CleanRL is more of a repository for educational and prototyping purposes: we don't have uniform APIs as SB3 does.

Desired Features:

  • save model
  • evaluate model
  • upload model to HF
  • load model from HF

Possible Solution

I think we can start the integration in a few selected files:

  • ppo.py
  • ppo_atari.py
  • ppo_atari_envpool.py
  • dqn_atari.py
  • c51_atari.py

We can add an optional flag like the following

    parser.add_argument("--save-model-hf", type=lambda x: bool(strtobool(x)), default=False, nargs="?", const=True,
        help="whether to save model to hugging face")

To add utilities like https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/rl_zoo3/push_to_hub.py, we could add a function upload_to_hf in the cleanrl_utils folder and import it. Importantly, we should only import it when the flag is turned on, so we don't make the single-file implementation dependent on the cleanrl_utils.upload_to_hf.

@ThomasSimonini has a demo here https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit8/unit8.ipynb

Numerical instability in C51

C51 does a cross-entropy loss which could have numerical instability depending on the implementation. See link for an overview. Usually calculating the cross-entropy loss directly from the logits is more numerically stable. However, I am not sure how to do it exactly.

Deepmind's dqn_zoo has an implementation that seems to use the logits directly:

https://github.com/deepmind/dqn_zoo/blob/f011d683529d8d23b017a95194ebbb41a4962fe8/dqn_zoo/c51/agent.py#L35
https://github.com/deepmind/rlax/blob/42bbcf97a69ef9b21cb88322b83169ade7930363/rlax/_src/value_learning.py#L703
https://github.com/deepmind/rlax/blob/42bbcf97a69ef9b21cb88322b83169ade7930363/rlax/_src/value_learning.py#L543

Personally, I am recording this issue but in practice often it's enough to do

loss = (-(target_pmfs * old_pmfs.clamp(min=1e-5, max=1-1e-5).log()).sum(-1)).mean()
# instead of 
# unstable_loss = (-(target_pmfs * old_pmfs.log()).sum(-1)).mean()

The more stable loss results in

image

and unstable_loss results in

image

See more at #102

If anyone is interested in digging into this, that will be fantastic.

LOMPO and COMBO implementation for visual offline RL

Hi,

I really like this repo and use it in my research. I found the visual CQL implementation very interesting and wonder if there is a plan to add COMBO and LOMPO algorithms for visual and offline RL?

Thank you

Cloud Integration Support

It is desirable to be able to run experiments on scale by leveraging cloud providers such as AWS, GCP, Azure, or even on-premise servers. This issue will track all of the commits that are related to cloud integrations.

Problems with PPO value loss

v_loss = 0.5 *((new_values - b_returns[minibatch_ind]) ** 2)

I believe this line is missing a .mean()

Also, are you meant to be multiplying the value loss by 0.5 in lines 377 and 379? Isn't that the purpose of args.vf_coef?

I notice there is a number of PPO implementations, and it looks like many of them have the same issue.

GitPod link not working

Problem Description

Link to GitPod dev environment on instructions page not working. Dev environment not opening in GitPod. Error message saying that there is not GitPod file in repository.

Current Behavior

See above.

Expected Behavior

Dev environment opening in GitPod.

Possible Solution

Fix GitPod link

Steps to Reproduce

Try to click on GitPod banner-link on instructions page.

Print out episode reward for debugging without tensorboard

The current implementation doesn't print out anything once the scripts start running. Perhaps it would be more beginner-friendly if we print out something like the following just to let the user know that the script is actually running.

global_step=3442, episode_reward=15.527101137763129
global_step=3456, episode_reward=23.907788285943155
global_step=3472, episode_reward=19.012161566288178
global_step=3483, episode_reward=15.243719686337442
global_step=3497, episode_reward=16.92203202540712
global_step=3529, episode_reward=30.636879754445644
global_step=3553, episode_reward=28.04640999748334

And additional desired metrics to be printed should be the episode length.

Add `rnd_ppo.py` documentation and refactor

rnd_ppo.py is a bit dated, and I recommend refactoring it to match other PPO style, which would include:

  • change the name from rnd_ppo.py to ppo_rnd.py
  • use from gym.wrappers.normalize import RunningMeanStd instead of the implementing ourselves (note the implementation might be a bit different).
  • create a make_env function like
    def make_env(env_id, seed, idx, capture_video, run_name):
    def thunk():
    env = gym.make(env_id)
    env = gym.wrappers.RecordEpisodeStatistics(env)
    if capture_video:
    if idx == 0:
    env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
    env = NoopResetEnv(env, noop_max=30)
    env = MaxAndSkipEnv(env, skip=4)
    env = EpisodicLifeEnv(env)
    if "FIRE" in env.unwrapped.get_action_meanings():
    env = FireResetEnv(env)
    env = ClipRewardEnv(env)
    env = gym.wrappers.ResizeObservation(env, (84, 84))
    env = gym.wrappers.GrayScaleObservation(env)
    env = gym.wrappers.FrameStack(env, 4)
  • remove the visualization (i.e., ProbsVisualizationWrapper)
  • use def get_value and def get_action_and_value for the Agent class
  • remove

    cleanrl/cleanrl/rnd_ppo.py

    Lines 706 to 708 in 0b3f8ea

    class Flatten(nn.Module):
    def forward(self, input):
    return input.view(input.size(0), -1)
  • maybe log the average curiosity_reward instead?
    f"global_step={global_step}, episodic_return={info['episode']['r']}, curiosity_reward={curiosity_rewards[step][idx]}"
  • name total_reward_per_env to curiosity_return
    total_reward_per_env = np.array(
  • Add SPS (steps per second) metric.

Overall I suggest selecting ppo_atari.py and rnd_ppo.py and use Compare Selected on VSCode to see the file difference and minimize the file difference:

image

Types of changes

  • Bug fix
  • New feature
  • New algorithm
  • Documentation

Checklist:

  • I've read the CONTRIBUTION guide (required).
  • I have ensured pre-commit run --all-files passes (required).
  • I have updated the documentation accordingly.
  • I have updated the tests accordingly (if applicable).

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments.

  • I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
  • I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
  • I have updated the documentation and previewed the changes via mkdocs serve.
    • I have explained note-worthy implementation details.
    • I have explained the logged metrics.
    • I have added links to the original paper and related papers (if applicable).
    • I have added links to the PR related to the algorithm.
    • I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
    • I have added the learning curves (in PNG format with width=500 and height=300).
    • I have added links to the tracked experiments.
  • I have updated the tests accordingly (if applicable).

Friendlier `CONTRIBUTING.md`

We should make contribution guidelines clearer. We got the feedback that "Although the current project seems welcoming to new developers, there is little concrete information on how to get involved. For example, what code style does the project follow? What are the criteria for including a new algorithm: what documentation does it need to have, test cases, etc? There is little in the way of API docs either, this admittedly is not critical as the files are themselves quite readable, but I'd suggest adding at least one-line docstrings to classes, methods, etc (and consider splitting up the scripts into more separate methods as well)"

This is great feedback. To address it, let's re-think the checklist for including a new algorithm. Such a checklist should include:

  1. pre-commit utilities: sort dependencies, remove unused variables and imports, format code using black, and check word spelling #107.
  2. Empirical analysis and benchmark: we should adopt a similar guide from sb3-contrib with a bit of our spin. The implemented algorithm should come with tracked experiments that
    • match the reported performance in the paper (if applicable)
    • match the reported performance in a high-quality reference implementation (SB3, Tianshou, and others) (if applicable).
    • We should also add documentation on how exactly we want the tracked experiments to be done (i.e., what W&B project? should they capture video recording?)
  3. Documentation: the proposed algorithm should also come with documentation at https://docs.cleanrl.dev/rl-algorithms/ to
    • explain crucial implementation details
    • add links to the original paper and related papers (if applicable)
    • add links to the PR related to the algorithm
    • add links to the tracked experiments and benchmark results.
  4. Tests: the proposed algorithm should come with an e2e test that makes sure the algorithm does not crash.

I will try to make some examples next week.

Support Procgen Environments

Problem Description

Procgen Environments (https://github.com/openai/procgen) are new environments to test out the generalization ability of agents. It would be nice to include some of the games into the Open RL Benchmark (http://benchmark.cleanrl.dev/)

This is a good first issue for contributors. I think contributors can simply modify the network model slightly (

self.network = nn.Sequential(
) to handle the Procgen Environments.

0.3 Release

Here is a list of TODOs for 0.2 release

[ ] Include a benchmark PNG file of all the algorithms using seaborn
[ ] Consider using a wrapper to replace the functions in common.py
[ ] Consider wrapping the env and only use torch arrays
[ ] Benchmark Atari games
[ ] Better documentation for cloud support
[ ] Optimize the DQN memory usage

Something not potentially for 0.2 but more likely for the future releases.
[ ] Evaluate using the VecEnv

`dqn.py` does not respect seed

Problem Description

python dqn.py --seed 1 could yield different results. See the following demo, where the first run yields global_step=24830, episodic_return=245.0 and the second run yields global_step=24975, episodic_return=178.0.

asciicast

Various minor PPO refactors

Problem Description

A lot of the formatting changes are suggested by @Howuhh

1. Refactor on next_done

The current code to handle done looks like this

            next_obs, reward, done, info = envs.step(action.cpu().numpy())
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(done).to(device)

which is fine, but when I tried to adapt isaacgym it became an issue. Specifically, I thought the to(device) code is no longer needed so just did

            next_obs, reward, done, info = envs.step(action)

but this is wrong because I should have done next_done = done. The current next_done = torch.Tensor(done).to(device) just does not make a lot of sense.

We should refactor it to

            next_obs, reward, next_done, info = envs.step(action.cpu().numpy())
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(next_done).to(device)

2. make_env refactor

if capture_video:
    if idx == 0:
        env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")

to

if capture_video and idx == 0:
    env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")

3. flatten batch

        b_obs = obs.reshape((-1,) + envs.single_observation_space.shape)
        b_logprobs = logprobs.reshape(-1)
        b_actions = actions.reshape((-1,) + envs.single_action_space.shape)
        b_advantages = advantages.reshape(-1)
        b_returns = returns.reshape(-1)
        b_values = values.reshape(-1)

to

        b_obs = obs.flatten(0, 1)
        b_actions = actions.flatten(0, 1)
        b_logprobs = logprobs.reshape(-1)
        b_returns = returns.reshape(-1)
        b_advantages = advantages.reshape(-1)
        b_values = values.reshape(-1)

4.


            if args.target_kl is not None:
                if approx_kl > args.target_kl:
                    break

to

            if args.target_kl is not None and approx_kl > args.target_kl:
                break

5.

global_step += 1 * args.num_envs

to

global_step += args.num_envs

6.

move

num_updates = args.total_timesteps // args.batch_size

to the argparse.

Refactoring on Class Arguments

Problem Description

Since CleanRL has been using single-file implementations with no main() function, a lot of global variables are created, and sometimes we have code within a class access global variables when those global variables should be passed to the classes.

Current Behavior

As an example, in https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari_visual.py,

class QNetwork(nn.Module):
    def __init__(self, frames=4):
        super(QNetwork, self).__init__()
        self.network = nn.Sequential(
            Scale(1/255),
            nn.Conv2d(frames, 32, 8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, stride=1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(3136, 512),
            nn.ReLU(),
            Linear0(512, env.action_space.n)
        )

    def forward(self, x):
        x = torch.Tensor(x).to(device)
        return self.network(x)

The forward function uses the global variable device. This is slightly undesirable

Expected Behavior

The forward function should be

    def forward(self, x, device):
        x = torch.Tensor(x).to(device)
        return self.network(x)

It would be great if someone is willing to take some time and refactor files. This problem is present in almost all of the files..

Refactor PPO Buffer

The idea is to create a buffer that you can add an entire episode in it, and calculate the corresponding advantages. Until some certain limits, keep adding episodes in the buffer using the same returns, obs, actions array under the buffer.

AWS example?

Do you guys have an example of how to run this on AWS? I've never done it but it sounds intriguing.
Also, what kind of costs do you pay for training a single game, for example?

Support StarCraft II Mini-game Environments (pysc2)

Problem Description

StarCraft II Environments (https://github.com/deepmind/pysc2) have some challenging mini-games. It would be nice to include some of the games into the Open RL Benchmark (http://benchmark.cleanrl.dev/)

For experienced researcher, this might be a good issue. I have already created a gym wrapper for sc2 (https://github.com/vwxyzjn/gym-pysc2), and here is an example run (https://wandb.ai/cleanrl/cleanrl.benchmark/runs/2qy45w8y?workspace=)

SAC Consistency

Problem Description

Currently, the SAC script has debugging outputs that are inconsistent with other scripts in the repository (see here: it outputs
print(f"Episode: {global_episode} Step: {global_step}, Ep. Reward: {episode_reward}")
instead of
print(f"global_step={global_step}, episode_reward={episode_reward}")

Additionally, the parameter args.learning_start is also used differently. In sac_continuous_action.py, it has

if len(rb.buffer) > args.batch_size: # starts update as soon as there is enough data.

whereas in other scripts we had

if global_step > args.learning_start:

@dosssman would you mind looking into this? Thanks!

Both dqn_atari and dqn_atari_visual use different ReplayBuffers compared to other implementations

Thanks for creating this repo - I always have difficulty understanding RL algorithms when they're spread across huge libraries.

One question, both the dqn_atari.py and dqn_atari_visual.py both use the following ReplayBuffer class:

# modified from https://github.com/openai/baselines/blob/master/baselines/deepq/replay_buffer.py
class ReplayBuffer(object):
    def __init__(self, size):
        self._storage = []
        self._maxsize = size
        self._next_idx = 0

    def put(self, data):
        if self._next_idx >= len(self._storage):
            self._storage.append(data)
        else:
            self._storage[self._next_idx] = data
        self._next_idx = (self._next_idx + 1) % self._maxsize

    def sample(self, batch_size):
        idxes = np.random.choice(len(self._storage), batch_size, replace=True)
        obses_t, actions, rewards, obses_tp1, dones = [], [], [], [], []
        for i in idxes:
            data = self._storage[i]
            obs_t, action, reward, obs_tp1, done = data
            obses_t.append(np.array(obs_t, copy=False))
            actions.append(np.array(action, copy=False))
            rewards.append(reward)
            obses_tp1.append(np.array(obs_tp1, copy=False))
            dones.append(done)
        return np.array(obses_t), np.array(actions), np.array(rewards), np.array(obses_tp1), np.array(dones)

However, dqn.py, c51.py, c51_atari.py, cs51_atari_visual.py, sac_continuous_action.py, and td3_continuous_action.py all use the following ReplayBuffer:

# modified from https://github.com/seungeunrho/minimalRL/blob/master/dqn.py#
class ReplayBuffer():
    def __init__(self, buffer_limit):
        self.buffer = collections.deque(maxlen=buffer_limit)
    
    def put(self, transition):
        self.buffer.append(transition)
    
    def sample(self, n):
        mini_batch = random.sample(self.buffer, n)
        s_lst, a_lst, r_lst, s_prime_lst, done_mask_lst = [], [], [], [], []
        
        for transition in mini_batch:
            s, a, r, s_prime, done_mask = transition
            s_lst.append(s)
            a_lst.append(a)
            r_lst.append(r)
            s_prime_lst.append(s_prime)
            done_mask_lst.append(done_mask)

        return np.array(s_lst), np.array(a_lst), \
               np.array(r_lst), np.array(s_prime_lst), \
               np.array(done_mask_lst)

As far as I am aware, they look identical - but wanted to know if there was some implementation reason for why they are different?

DDPG Actor missing 1 argument: 'env'

Identified while testing changes for PR #67

actor = Actor().to(device)

When running:

python ddpg_continuous_actions.py

returns the error:

$ python ddpg_continuous_action.py 
/home/d055/anaconda3/envs/cleanrl-py3.7.1/lib/python3.7/site-packages/ale_py/roms/utils.py:90: DeprecationWarning: SelectableGroups dict interface is deprecated. Use select.
  for external in metadata.entry_points().get(self.group, []):
pybullet build time: Oct 11 2021 20:59:00
/home/d055/anaconda3/envs/cleanrl-py3.7.1/lib/python3.7/site-packages/gym/spaces/box.py:74: UserWarning: WARN: Box bound precision lowered by casting to float32
  "Box bound precision lowered by casting to {}".format(self.dtype)
Traceback (most recent call last):
  File "ddpg_continuous_action.py", line 167, in <module>
    actor = Actor().to(device)
TypeError: __init__() missing 1 required positional argument: 'env'

[WinError 193] %1 is not a valid Win32 application

I try to install this package on windows. When I installed poetry, when I run poetry run ppo.py, I met the following errors:

[WinError 193] %1 is not a valid Win32 application

at c:\users\username\onedrive\anaconda3\lib\subprocess.py:1311 in _execute_child
1307│ sys.audit("subprocess.Popen", executable, args, cwd, env)
1308│
1309│ # Start the process
1310│ try:
→ 1311│ hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
1312│ # no special security
1313│ None, None,
1314│ int(not close_fds),
1315│ creationflags,

Normalized Env Bug

There has been an issue with the NormalizedEnv in ppo2_continuous_action.py. It uses the underlying RunningMeanStd, which is incorrectly implemented as illustrated below:

import numpy as np
# taken from https://github.com/openai/baselines/blob/master/baselines/common/vec_env/vec_normalize.py
class RunningMeanStd(object):
    def __init__(self, epsilon=1e-4, shape=()):
        self.mean = np.zeros(shape, 'float64')
        self.var = np.ones(shape, 'float64')
        self.count = epsilon

    def update(self, x):
        batch_mean = np.mean(x, axis=0)
        batch_var = np.var(x, axis=0)
        batch_count = 1
        self.update_from_moments(batch_mean, batch_var, batch_count)

    def update_from_moments(self, batch_mean, batch_var, batch_count):
        self.mean, self.var, self.count = update_mean_var_count_from_moments(
            self.mean, self.var, self.count, batch_mean, batch_var, batch_count)

def update_mean_var_count_from_moments(mean, var, count, batch_mean, batch_var, batch_count):
    delta = batch_mean - mean
    tot_count = count + batch_count

    new_mean = mean + delta * batch_count / tot_count
    m_a = var * count
    m_b = batch_var * batch_count
    M2 = m_a + m_b + np.square(delta) * count * batch_count / tot_count
    new_var = M2 / tot_count
    new_count = tot_count

    return new_mean, new_var, new_count

print("incorrect RunningMeanStd uses the same mean across data array")
ob_rms = RunningMeanStd(shape=(2,))
state = np.array([0.52191359, 0.24749929])
print(ob_rms.mean)
print(ob_rms.var)
ob_rms.update(state)
print(ob_rms.mean)
print(ob_rms.var)

class RunningMeanStd(object):
    def __init__(self, epsilon=1e-4, shape=()):
        self.mean = np.zeros(shape, 'float64')
        self.var = np.ones(shape, 'float64')
        self.count = epsilon

    def update(self, x):
        batch_mean = np.mean([x], axis=0)
        batch_var = np.var([x], axis=0)
        batch_count = 1
        self.update_from_moments(batch_mean, batch_var, batch_count)

    def update_from_moments(self, batch_mean, batch_var, batch_count):
        self.mean, self.var, self.count = update_mean_var_count_from_moments(
            self.mean, self.var, self.count, batch_mean, batch_var, batch_count)

print("correct RunningMeanStd uses different means across dimensions")
ob_rms = RunningMeanStd(shape=(2,))
state = np.array([0.52191359, 0.24749929])
print(ob_rms.mean)
print(ob_rms.var)
ob_rms.update(state)
print(ob_rms.mean)
print(ob_rms.var)

And the output is

incorrect RunningMeanStd uses the same mean across data array
[0. 0.]
[1. 1.]
[0.38466797 0.38466797]
[0.01893871 0.01893871]
correct RunningMeanStd uses different means across dimensions
[0. 0.]
[1. 1.]
[0.5218614  0.24747454]
[0.00012722 0.00010611]

Documentation Site

Problem Description

Although CleanRL generally has a simplistic implementation, it will be desirable to have a documentation site for some situations. For example, I'm not sure where to put instructions on how to do start and resume with CleanRL's scripts. See #33, #14.

Deprecating `apex_dqn_atari.py`

Problem Description

The current apex_dqn_atari.py cannot meet the level of performance in published results. Given about 4 hours of training time, ApeX-DQN published result significantly outperforms ours.

Environment apex_dqn_atari.py result ApeX-DQN Published result
BreakoutNoFrameskip-v4 356.95 ± 46.40 ~450
PongNoFrameskip-v4 19.61 ± 0.54 ~20
BeamRiderNoFrameskip-v4 2852.69 ± 706.75 ~35000
SpaceInvadersNoFrameskip-v4 927.29 ± 146.49 ~12000
QbertNoFrameskip-v4 2613.47 ± 796.08 ~16000

ApeX-DQN Published result:
image

Cause

Admittedly, the hardware used is drastically different: ApeX-DQN uses 360 actors while our apex_dqn_atari.py uses 4 actors. There are many other implementation differences, too. apex_dqn_atari.py started out as my toy script, but as we hold a high bar for CleanRL's implementation, we may have to deprecate it, or at least remove it from the current repository and re-submit.

Refactor documentation

What is the problem

The current documentation requires more work. First, some of the implemented algorithms such as Apex-DQN, TD3, and SAC are not documented at https://docs.cleanrl.dev. Second, even the documented algorithm such as PPO does not have complete documentation: for example, the ppo_atari_envpool.py is not really documented. Third, there doesn't seem to be a single-source place to put documentation.

Going forward, I'd like to impose a specific documentation style and improve the overall workflow, which will also help #117.

Proposed solution

I was thinking maybe we can put a documentation link at the beginning of each file. For example, we could add these two lines at ppo.py.

https://github.com/vwxyzjn/cleanrl/blob/c8faef93fc8dbc9528183840ab75b8962df7b9c4/cleanrl/ppo.py#L1-L7

And this link of https://cleanrl-553u0zazz-vwxyzjn.vercel.app/rl-algorithms/ppo/#ppopy will point to the corresponding documentation that has

  • Brief overview of the algorithm
  • Original paper and relevant resources
  • Short description of what ppo.py specifically does
  • Explanation of important implementation details
  • Experiment results (and how they compare to the original paper or/and other reference implementations)
  • Learning curves
  • Tracked experiments

Which roughly looks like below (haven't added the tracked experiments)

ppodemo.mp4

List of files needed to add documentation

PPO: Shouldn't advantages be recomputed after every minibatch update?

https://github.com/openai/baselines/blob/master/baselines/ppo2/runner.py - here, lines 65-66
I was trying to reproduce the ppo paper results myself, and I noticed that in openai/baselines repo they compute GAE for each trajectory segment, but they don't use these advantages directly. They do some trick regarding the fact that Adv = Return - Value. Thus Return = Adv + Value.
And then at https://github.com/openai/baselines/blob/master/baselines/ppo2/ppo2.py lines 165-166 they only pass returns to train method.
And finally they compute advantages again regarding equation Adv = Return - Value here at line 136: https://github.com/openai/baselines/blob/master/baselines/ppo2/model.py

In your implementations you compute advantages only once, if I'm not mistaken. But to be honest, I'm not sure if it is actually crucial

Improving offline RL scripts

Problem Description // Current behavior

A reminder for the cleanrl/offline related issues mentioned in #130

  1. SPS logging is missing (attempted to match #126 #130

  2. Monitor cannot be imported anymore due to gym=0.23.0 update

  3. Tests for those two scripts are missing

  4. Unlike dqn_atari, the wrappers are not imported from SB3

  5. The offline-env-id that is required to load the dataset does not seem to work anymore. Is there any dependency missing, such as d4rl or d4rl_atari for example ?

(cleanrl) d055@kara:~/random/rl/cleanrl/cleanrl/offline$ python offline_dqn_atari_visual.py 

A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd)
[Powered by Stella]
Traceback (most recent call last):
  File "offline_dqn_atari_visual.py", line 559, in <module>
    data_loader = iter(torch.utils.data.DataLoader(ExperienceReplayDataset(), batch_size=args.batch_size, num_workers=2))
  File "offline_dqn_atari_visual.py", line 544, in __init__
    self.dataset_env = gym.make(args.offline_env_id)
  File "/home/d055/anaconda3/envs/cleanrl/lib/python3.8/site-packages/gym/envs/registration.py", line 676, in make
    return registry.make(id, **kwargs)
  File "/home/d055/anaconda3/envs/cleanrl/lib/python3.8/site-packages/gym/envs/registration.py", line 490, in make
    versions = self.env_specs.versions(namespace, name)
  File "/home/d055/anaconda3/envs/cleanrl/lib/python3.8/site-packages/gym/envs/registration.py", line 220, in versions
    self._assert_name_exists(namespace, name)
  File "/home/d055/anaconda3/envs/cleanrl/lib/python3.8/site-packages/gym/envs/registration.py", line 297, in _assert_name_exists
    raise error.NameNotFound(message)
gym.error.NameNotFound: Environment `breakout-expert` doesn't exist. Did you mean: `Breakout-ram`?

Possible Solution

  1. Straight forward to add, but did not want to overload the #130

  2. and 4. : Make do without Monitor, use SB3's wrapper instead

  3. Straight forward to add

  4. Investigate the missing dependencies, as well as the generation of the args.offline_env_id

Dict observation space

Hi, is the PPO implementation in this repo able to handle the Dict observation space? Many thanks!

Investigate DQN's regression in `MountainCar-v0`

Problem Description

In the previous version of Open RL Benchmark, we clearly observed that our dqn.py was able to solve MountainCar-v0 (see link). However, I could no longer reproduce this result with the latest dqn.py using the exact same hyperparameters. See here for the regression report.

image

Looking into the root cause

After looking into this further, it turns out the "culprit" is SB3's replay buffer. Our upstream SB3's replay buffer starts to properly handle truncation vs termination (see DLR-RM/stable-baselines3#243), and by disabling the proper handling of truncation via handle_timeout_termination=False I was able to reproduce past performance... ironically (see https://wandb.ai/costa-huang/cleanRL/reports/MountainCar-v0-Regression-Investigation--VmlldzoxODEyMzgw).

image

Where to go from here

I don't think finding proper hyperparameters for dqn.py should block #121, but this is something we can look into in the future.

Proper entropy regularized PPO

Problem Description

Seems like the current implementation of PPO use only one-step entropy bonus (not including the entropy bonus in the overall return). I see this as a ease of implementation that passed along from other popular repos. Do you consider implementing the proper entropy regularization in this repo? It seems like the performance gain might me crucial in some cases as shown in this paper (section 6.1). The main difference is shown in eq.79-80 in section 6.1

Current Behavior

Using one-step entropy bonus

Expected Behavior

Using proper entropy bonus

Possible Solution

The entropy bonus should be added to the rewards before computing the advantage. This should be simple to implement as it changes r to r + entropy then the rest of the process should be the same if I am not mistaken.

Re-think Open RL Benchmark.

I am thinking of re-doing the Open RL Benchmark that also includes benchmarks from other popular RL libraries, and CleanRL is just one of them. So we would have wandb and github projects like

  • openrlbenchmark/cleanrl
  • openrlbenchmark/baselines
  • openrlbenchmark/sb3
  • openrlbenchmark/tianshou
  • openrlbenchmark/rllib

And the good thing is that anyone can use the recorded metrics in the Open RL Benchmark like explained here: wandb/wandb#3231; so the contribution is that no one has to re-run the baselines experiments if they just want to compare the results.

Investigate ` nn.utils.clip_grad_norm_` for DQN, DDPG, and TD3

Problem Description

Compared to the original implementations, our DQN, DDPG, and TD3 implementations additionally do global gradient clipping, a code-level optimization done in PPO. It is unclear if global gradient clipping offers real performance benefits, so we should look into it and remove it if necessary.

  • dqn_atari.py
  • ddpg_continuous_action.py
  • td3_continuous_action.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.