chainer / chainerrl Goto Github PK

View Code? Open in Web Editor NEW

1.2K 71.0 226.0 14.19 MB

ChainerRL is a deep reinforcement learning library built on top of Chainer.

License: MIT License

Python 97.82% Shell 2.18%

chainer reinforcement-learning deep-learning machine-learning python dqn actor-critic

chainerrl's Introduction

ChainerRL and PFRL

ChainerRL (this repository) is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement algorithms in Python using Chainer, a flexible deep learning framework. PFRL is the PyTorch analog of ChainerRL.

Installation

ChainerRL is tested with 3.6. For other requirements, see requirements.txt.

ChainerRL can be installed via PyPI:

pip install chainerrl

It can also be installed from the source code:

python setup.py install

Refer to Installation for more information on installation.

Getting started

You can try ChainerRL Quickstart Guide first, or check the examples ready for Atari 2600 and Open AI Gym.

For more information, you can refer to ChainerRL's documentation.

Algorithms

Algorithm	Discrete Action	Continous Action	Recurrent Model	Batch Training	CPU Async Training
DQN (including DoubleDQN etc.)	✓	✓ (NAF)	✓	✓	x
Categorical DQN	✓	x	✓	✓	x
Rainbow	✓	x	✓	✓	x
IQN	✓	x	✓	✓	x
DDPG	x	✓	✓	✓	x
A3C	✓	✓	✓	✓ (A2C)	✓
ACER	✓	✓	✓	x	✓
NSQ (N-step Q-learning)	✓	✓ (NAF)	✓	x	✓
PCL (Path Consistency Learning)	✓	✓	✓	x	✓
PPO	✓	✓	✓	✓	x
TRPO	✓	✓	✓	✓	x
TD3	x	✓	x	✓	x
SAC	x	✓	x	✓	x

Following algorithms have been implemented in ChainerRL:

A2C (Synchronous variant of A3C)
- examples: [atari (batched)] [general gym (batched)]
A3C (Asynchronous Advantage Actor-Critic)
- examples: [atari reproduction] [atari] [general gym]
ACER (Actor-Critic with Experience Replay)
- examples: [atari] [general gym]
Asynchronous N-step Q-learning
- examples: [atari]
Categorical DQN
- examples: [atari] [general gym]
DQN (Deep Q-Network) (including Double DQN, Persistent Advantage Learning (PAL), Double PAL, Dynamic Policy Programming (DPP))
- examples: [atari reproduction] [atari] [atari (batched)] [flickering atari] [general gym]
DDPG (Deep Deterministic Policy Gradients) (including SVG(0))
- examples: [mujoco reproduction] [mujoco] [mujoco (batched)]
IQN (Implicit Quantile Networks)
- examples: [atari reproduction] [general gym]
PCL (Path Consistency Learning)
- examples: [general gym]
PPO (Proximal Policy Optimization)
- examples: [mujoco reproduction] [atari] [mujoco] [mujoco (batched)]
Rainbow
- examples: [atari reproduction]
REINFORCE
- examples: [general gym]
SAC (Soft Actor-Critic)
- examples: [mujoco reproduction]
TRPO (Trust Region Policy Optimization) with GAE (Generalized Advantage Estimation)
- examples: [mujoco]
TD3 (Twin Delayed Deep Deterministic policy gradient algorithm)
- examples: [mujoco reproduction]

Following useful techniques have been also implemented in ChainerRL:

NoisyNet
- examples: [Rainbow] [DQN/DoubleDQN/PAL]
Prioritized Experience Replay
- examples: [Rainbow] [DQN/DoubleDQN/PAL]
Dueling Network
- examples: [Rainbow] [DQN/DoubleDQN/PAL]
Normalized Advantage Function
- examples: [DQN] (for continuous-action envs only)
Deep Recurrent Q-Network
- examples: [DQN]

Visualization

ChainerRL has a set of accompanying visualization tools in order to aid developers' ability to understand and debug their RL agents. With this visualization tool, the behavior of ChainerRL agents can be easily inspected from a browser UI.

Environments

Environments that support the subset of OpenAI Gym's interface (reset and step methods) can be used.

Contributing

Any kind of contribution to ChainerRL would be highly appreciated! If you are interested in contributing to ChainerRL, please read CONTRIBUTING.md.

License

MIT License.

Citations

To cite ChainerRL in publications, please cite our JMLR paper:

@article{JMLR:v22:20-376,
  author  = {Yasuhiro Fujita and Prabhat Nagarajan and Toshiki Kataoka and Takahiro Ishikawa},
  title   = {ChainerRL: A Deep Reinforcement Learning Library},
  journal = {Journal of Machine Learning Research},
  year    = {2021},
  volume  = {22},
  number  = {77},
  pages   = {1-14},
  url     = {http://jmlr.org/papers/v22/20-376.html}
}

chainerrl's People

Contributors

Stargazers

Watchers

Forkers

chagge mylearning2017 rarilurelo kiyukuta corochann muupan rl-gan-vision-privacy-finance-projects ajaytalati originholic trigrass2 rhythm92 tigerneil mmizutani toslunar lichnak codeaudit harendranathvegi9 gongqingyi-github liyi14 kastnerkyle ariewahyu ethancaballero happywu markjingnb arita37 jmgasper shivajid zhexiaozhe orangejuiceblues tenninyan ohagikato narendluffy ghosthamlet adityab liuweiming wildturtles knorth55 gaofangshu celestial-intelligence iory pchalasani geekstor ichimunemasa dustinul6 ikeyasu s-maruyama njustesen llenroc mitmul delta2323 holden1 alibaheri takeratta duoergun0729 ando-takahiro droiter meelement johnjohnsp1 rkawajiri hanfengchi mr4msm hulalazz pazocal neuralnetworkingtechnologies lyx-x shyamalschandra williamissirius howl-anderson salemameen hageshogun abiraja2004 ayberkydn tajimaa4csel rahowa atiroms uidilr tony32769 tamamachi kuni-kuni afcarl aserun jimhalverson shubhampachori12110095 idvhfd terminiter daominglyu mmilk1231 bilio jleni mbrukman kaushalya kazuaki-17 ocechain yuishihara waffoo lorenzom1997 prabhatnagarajan speedcell4 parsonszeng iloveopenworld

chainerrl's Issues

FeUdal Networks for HRL

FeUdal Networks for Hierarchical Reinforcement Learning
https://arxiv.org/pdf/1703.01161.pdf

env.monitor has been deprecated as of 12/23/2016

gym.error.Error: env.monitor has been deprecated as of 12/23/2016. Remove your call to env.monitor.start(directory) and instead wrap your env with env = gym.wrappers.Monitor(env, directory) to record data.

average_loss always 0 when using episodic_replay=True (DQN)

Trying this two different q_functions:

(non recurrent)

class QFunction(chainer.Chain, StateQFunction):

        def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
            self.n_actions = n_actions
            self.n_input_channels = n_input_channels
            conv_layers = chainer.ChainList(
                L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
                L.Convolution2D(32, 64, 4, stride=2, bias=bias),
                L.Convolution2D(64, 64, 3, stride=1, bias=bias),
                L.Convolution2D(64, 128, 7, stride=1, bias=bias)
                )

            lin_layer = L.Linear(128, 128)                     

            a_stream = MLP(128,n_actions,[2])
            v_stream = MLP(128,1,[2])

            super().__init__(conv_layers=conv_layers, lin_layer=lin_layer, a_stream=a_stream,v_stream=v_stream)

        def __call__(self, x, test=False):
            """
            Args:
                x (ndarray or chainer.Variable): An observation
                test (bool): a flag indicating whether it is in test mode
            """
            h = x
            for l in self.conv_layers:
                h = F.relu(l(h))
            h = self.lin_layer(h)

            batch_size = x.shape[0]
            ya = self.a_stream(h, test=test)
            mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
            ya, mean = F.broadcast(ya,mean)
            ya -= mean

            ys = self.v_stream(h,test=test)
            
            ya,ys = F.broadcast(ya,ys)
            q = ya+ys
            return chainerrl.action_value.DiscreteActionValue(q)

(recurrent)

class QFunctionRecurrent(chainer.Chain, StateQFunction):

        def __init__(self, n_input_channels=3, n_actions = 4, bias=0.1):
            self.n_actions = n_actions
            self.n_input_channels = n_input_channels
            conv_layers = chainer.ChainList(
                L.Convolution2D(n_input_channels, 32, 8, stride=4, bias=bias),
                L.Convolution2D(32, 64, 4, stride=2, bias=bias),
                L.Convolution2D(64, 64, 3, stride=1, bias=bias),
                L.Convolution2D(64, 128, 7, stride=1, bias=bias)
                )

            lstm_layer = L.LSTM(128, 128)                     

            a_stream = MLP(128,n_actions,[2])
            v_stream = MLP(128,1,[2])

            super().__init__(conv_layers=conv_layers, lstm_layer=lstm_layer, a_stream=a_stream,v_stream=v_stream)

        def __call__(self, x, test=False):
            """
            Args:
                x (ndarray or chainer.Variable): An observation
                test (bool): a flag indicating whether it is in test mode
            """
            h = x
            for l in self.conv_layers:
                h = F.relu(l(h))
            h = self.lstm_layer(h)

            batch_size = x.shape[0]
            ya = self.a_stream(h, test=test)
            mean = F.reshape(F.sum(ya,axis=1) / self.n_actions, (batch_size,1))
            ya, mean = F.broadcast(ya,mean)
            ya -= mean

            ys = self.v_stream(h,test=test)
            
            ya,ys = F.broadcast(ya,ys)
            q = ya+ys
            return chainerrl.action_value.DiscreteActionValue(q)

I found that for the non-recurrent version the loss is not zero and the agent will eventually master the gym environment provided.

However, changing nothing else than adding an lstm layer and setting episodic_replay to True the average_loss will become 0 all the time and the agents is not able to learn to better interact with its environent.

First, I thought that this was due to some kind of rounding issues so I set the minibatch_size=1, episodic_update_len = 1 (assuming that one episodic replay will now only containg one time step) but still no changes.

I wonder if this is some kind of bug or (which I think is more likely) an error on my side.

Any help is very much appreciated!

Extend gym.Wrapper instead of env_modifiers

Since gym has introduced its own interface to modify envs using gym.Wrapper, I think it is better to use it in ChainerRL instead of directly modifying methods as in env_modifiers.

train_ddpg_gym.py doesn't work

Got this error while trying to run train_ddpg_gym.py:
init() got an unexpected keyword argument 'update_interval'

Neural Episodic Control

https://arxiv.org/abs/1703.01988

There is already an implementation by @ISakony, though it is not tested against Atari. https://github.com/ISakony/NEC_chainerrl_CartPole-v0

PyTorch as an additional backend

I'm curious about whether ChainerRL can support PyTorch as an additional NN backend. Its interface is similar to Chainer's, but I'm not sure how easy it would be to support both. Any suggestions and opinions are welcome.

License

interpolated policy gradient (IPG)

Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep RL

https://arxiv.org/pdf/1706.00387.pdf

MuJoCo-ACER Examples

Are there any example codes of ACER in the continuous action spaces, using the MuJoCo environments?

Shared Prioritized Replay Buffer

I think (Async) PCL would improve if 1 Prioritized Replay Buffer is shared among all the agents.

Add suppression option for print messages during training loop?

In chainerrl.experiments.train_agent, statistical information is reported via print per episode during the training loop. However, this sometimes looks so verbose and I want to suppress these messages, but currently there is no good way to do so. Adding some option that enables/disables these prints might be beneficial.

How well does PCL perform on Atari?

Have you tested PCL on Atari Environments?

env.spec.timestep_limit has been deprecated

gym now complains:

DEPRECATION WARNING: env.spec.timestep_limit has been deprecated. Replace your call to env.spec.timestep_limit with env.spec.tags.get('wrapper_config.TimeLimit.max_episode_steps'). This change was made 12/28/2016 and is included in version 0.7.0

Some tests fail with chainer==1.21.0

This is due to chainer/chainer#2394.

Until this issue is fixed, please use chainer==1.20.0.1 to run tests.

Multi-agent example

Maybe TicTacToe or other easy-to-implement games?

Write requirements in setup.py

Write docstrings for batch_states

Record the number of episodes in scores.txt

Examples for stateful q learning and a3c

Would it be possible to provide examples for stateful models for q learning and a3c?

Run CI

Documentation on usage of recurrent models

In ChainerRL, to use user-defined recurrent models, you need to make sure they implement chainerrl.recurent.Recurrent interface, otherwise they won't be treated as recurrent models.

When your model's recurrent-ness comes from chainer.links.LSTM, all you have to do is inheriting chainer.recurrent.RecurrentChainMixin.

This kind of information is missing in the document.

The tutorial code causes TypeError on python 3.4

On python 3.4, random.sample don't accept collections.deque, so I got such error.

Traceback (most recent call last):
  File "quickstart.py", line 111, in <module>
    action = agent.act_and_train(obs, reward)
  File "/opt/rl/lib/python3.4/site-packages/chainerrl/agents/dqn.py", line 340, in act_and_train
    self.replay_updator.update_if_necessary(self.t)
  File "/opt/rl/lib/python3.4/site-packages/chainerrl/replay_buffer.py", line 194, in update_if_necessary
    transitions = self.replay_buffer.sample(self.batchsize)
  File "/opt/rl/lib/python3.4/site-packages/chainerrl/replay_buffer.py", line 42, in sample
    return random.sample(self.memory, n)
  File "/opt/rl/lib/python3.4/random.py", line 311, in sample
    raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")
TypeError: Population must be a sequence or set.  For dicts, use list(d).

python 2.7 works fine, and meybe 3.5+.

Performance evaluation and comparison of algorithms

It will be great to add performance evaluation and comparisons of algorithms available in ChainerRL.

Add a quick start guide

I think I need a more simplified example with detailed comments.

ValueError: On entry to SGEMV parameter number 8 had an illegal value

Travis CI failed on examples/gym/train_ddpg_gym.py:

Traceback (most recent call last):
  File "examples/gym/train_ddpg_gym.py", line 173, in <module>
    main()
  File "examples/gym/train_ddpg_gym.py", line 170, in main
    max_episode_len=timestep_limit)
  File "/home/travis/build/pfnet/chainerrl/chainerrl/experiments/train_agent.py", line 144, in train_agent_with_evaluation
    logger=logger)
  File "/home/travis/build/pfnet/chainerrl/chainerrl/experiments/train_agent.py", line 52, in train_agent
    action = agent.act_and_train(obs, r)
  File "/home/travis/build/pfnet/chainerrl/chainerrl/agents/ddpg.py", line 314, in act_and_train
    self.replay_updater.update_if_necessary(self.t)
  File "/home/travis/build/pfnet/chainerrl/chainerrl/replay_buffer.py", line 327, in update_if_necessary
    self.update_func(transitions)
  File "/home/travis/build/pfnet/chainerrl/chainerrl/agents/ddpg.py", line 246, in update
    self.actor_optimizer.update(lambda: self.compute_actor_loss(batch))
  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/chainer/optimizer.py", line 416, in update
    loss.backward()
  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/chainer/variable.py", line 398, in backward
    gxs = func.backward(in_data, out_grad)
  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/chainer/functions/connection/linear.py", line 59, in backward
    gW = gy.T.dot(x).astype(W.dtype, copy=False)
ValueError: On entry to SGEMV parameter number 8 had an illegal value

This may be the same issue as chainer/chainer#2744

Evolution Strategies

https://arxiv.org/abs/1703.03864

How much does performance change when disable_online_update is set to True for PCL/ACER?

Does it become more_deterministic / less_stable ?

Unified PCL

See section 5.1 for new more performant update to PCL: https://arxiv.org/pdf/1702.08892.pdf

The Reactor: A Sample-Efficient Actor-Critic Architecture [1704.04651]

The Reactor: A Sample-Efficient Actor-Critic Architecture
https://arxiv.org/abs/1704.04651

ACER for continuous actions

Neural Fictitious Self Play NFSP

Neural Fictitious Self Play

https://arxiv.org/abs/1603.01121

Stop DDPG and PGT inheriting DQN

because they are clearly not DQN and their current code structures are difficult to understand.

Visualization of models?

Is there an existing method to visualize models?

Date and time format for experiments

Human-readability of the name of the subdirectory for an experiment might be improved. The current imprementation (time_str = datetime.datetime.now().strftime('%Y%m%d%H%M%S%f') in chainerrl/experiments/prepare_output_dir.py) produces e.g. 21120903182945898662. How about

strftime('%Y%m%d-%H%M%S-%f') (e.g. 21120903-182945-898662), or
the basic format in ISO 8601 (e.g. 21120903T182945.898662+0900)?

Make learning rate decaying in train_agent_async customizable

Currently, chainerrl.experiments.train_agent_async always decays the learning rate linearly to zero. This kind of behavior should be customizable.

Question on gym action space

Hi, I've defined my own OpenAI gym and have specified my actions in the environment as follows:

`
self.actions = ["NOOP", "LEFT", "RIGHT", "FIRE", "CLOAK"]

self.action_space = spaces.Discrete(len(self.actions))
`

When I try my environment against the 'train_dqn_gym.py' example I can see from my debug output that the training is correctly resulting in trying a variety of different actions.

However, with both 'train_a3c_gym.py' and 'train_acer_gym.py' the action value provided to my step value is always 0 (NOOP) - it never tries any other action.

Have I coded something wrong in my environment? I would appreciate any tips on how to investigate my issue further.

THIRD-PERSON IMITATION LEARNING

https://arxiv.org/pdf/1703.01703.pdf

average_loss is not updated when episodic_update=True

DQN.update has code updating average_loss:
https://github.com/pfnet/chainerrl/blob/03c2ff975e1fca64c67d6ad1a1daaa9638d04e66/chainerrl/agents/dqn.py#L232

However, DQN.update_from_episode doesn't have such code:
https://github.com/pfnet/chainerrl/blob/03c2ff975e1fca64c67d6ad1a1daaa9638d04e66/chainerrl/agents/dqn.py#L287

This keeps average_loss zero. This problem is reported in #82.

Specify successful configurations for examples

Current examples don't specify in what configuration they work well, except newer ones (train_pcl_gym.py and train_reinforce_gym.py). Such instructions are important because users can easily confirm that the implementations actually work.

chainerrl.experiments.prepare_output_dir should not assume git

Currently, chainerrl.experiments.prepare_output_dir assume that the current directory is version-controlled by git, but it should not.

REINFORCE

A simple REINFORCE implementation that doesn't require a value function would be helpful.

Remove atari_py part from the installation guide

Now that atari_py is not required, this part should be updated: http://chainerrl.readthedocs.io/en/latest/install.html#for-windows-users

Specify requirements that are only needed by Python2.7

There are requirements that just backports py3 features to py2:

funcsigs
futures
statistics
fastcache

They should be skipped for py3.

Register PyPI

Add roboschool to environment in chainerrl

Any plans to add openai roboschool for robot-simulation?

Type of observation and action space

I think the examples of gym interface show, the agent and q_functions expect the observation and action to be a Box or a Discrete for each. Is it correct?

If so, how should I use the observation and action other typed, especially a Tuple?
Do I need to modify the environment as to return a Box, or do I have another choice?

Thanks,

Create documents

PCL (Path Consistency Learning)

https://arxiv.org/abs/1702.08892

Windows Bash Run Chainerrl unknown cuda error

recently I got a windows 10 computer, successfully installed bash on ubuntu for windows, cuda, cudnn, chainer and chainerrl. But to run the example, I got the following error. Any suggestions?

(py2env) neil@DESKTOP-C22605O:~/chainerrl$ xvfb-run -s "-screen 0 1400x900x24" python examples/gym/train_dqn_gym.py
Output files are saved in dqn_out/20170324141722891586
INFO:gym.envs.registration:Making new env: Pendulum-v0
Traceback (most recent call last):
File "examples/gym/train_dqn_gym.py", line 179, in
main()
File "examples/gym/train_dqn_gym.py", line 154, in main
episodic_update=args.episodic_replay, episodic_update_len=16)
File "/home/neil/py2env/local/lib/python2.7/site-packages/chainerrl/agents/dqn.py", line 115, in init
cuda.get_device(gpu).use()
File "cupy/cuda/device.pyx", line 75, in cupy.cuda.device.Device.use (cupy/cuda/device.cpp:2083)
File "cupy/cuda/device.pyx", line 81, in cupy.cuda.device.Device.use (cupy/cuda/device.cpp:2035)
File "cupy/cuda/runtime.pyx", line 178, in cupy.cuda.runtime.setDevice (cupy/cuda/runtime.cpp:2915)
File "cupy/cuda/runtime.pyx", line 130, in cupy.cuda.runtime.check_status (cupy/cuda/runtime.cpp:2241)
cupy.cuda.runtime.CUDARuntimeError: cudaErrorUnknown: unknown error

chainer / chainerrl Goto Github PK

chainerrl's Introduction

ChainerRL and PFRL

Installation

Getting started

Algorithms

Visualization

Environments

Contributing

License

Citations

chainerrl's People

Contributors

Stargazers

Watchers

Forkers

chainerrl's Issues

Recommend Projects

Recommend Topics

Recommend Org