A recurrent, multi-process and readable PyTorch implementation of the deep reinforcement learning algorithms:
inspired by 3 repositories:
- Tensor or dict of tensors observation space
- Discrete & continuous action space
- Entropy regularization
- Reward shaping
- Recurrent policy by specifying the recurrence
- Fast:
- Multiprocessing for collection trajectories in multiple environments simultaneously
- GPU (CUDA) for tensor operations
- Tensorboard
- Pytorch 0.4.0
You have to clone the repository and then install the module:
pip3 install -e torch_rl
To gets updates from the code, you just need to do a git pull
. No need to install the module again.
The module consists of:
- 2 classes
torch_rl.A2CAlgo
andtorch_rl.PPOAlgo
for, respectively, A2C and PPO algorithms - 2 abstract classes
torch_rl.ACModel
andtorch_rl.RecurrentACModel
for, respectively, non-recurrent and recurrent actor-critic models - 1 class
torch_rl.DictList
for making dictionnaries of lists batch-friendly
I will detail here the points that can't be understood immediately by looking at the definition files of the classes, or by looking at the arguments of scripts/train.py
with scripts/train.py --help
command.
torch_rl.A2CAlgo
and torch_rl.PPOAlgo
have 2 methods:
__init__
that may take, among the other parameters :- an
acmodel
actor-critic model that is an instance of a class that inherits from one of the two abstract classestorch_rl.ACModel
ortorch_rl.RecurrentACModel
. - a
preprocess_obss
function that transforms a list of observations given by the environment into an objectX
. This objectX
must allow to retrieve from it a sublist of preprocessed observations given a list of indexesindexes
withX[indexes]
. By default, the observations given by the environment are transformed into a Pytorch tensor. - a
reshape_reward
function that takes into parameter, in the order, an observationobs
, the actionaction
of the model, the rewardreward
and the terminal statusdone
and returns a new reward. - a
recurrence
number to specify over how many timestep gradient will be backpropagated. This number is only considered if a recurrent model is used and must divide thenum_frames_per_agent
parameter and, for PPO, thebatch_size
parameter.
- an
update_parameters
that returns some logs.
torch_rl.ACModel
has 2 abstract methods:
__init__
that takes into parameter theobservation_space
and theaction_space
given by the environment.forward
that takes into parameter N preprocessed observationsobs
and returns a Pytorch distributiondist
and a tensor of valuesvalue
. The tensor of values must be of size N, not N x 1.
torch_rl.RecurrentACModel
has 3 abstract methods:
__init__
that takes into parameter the same parameters thantorch_rl.ACModel
.forward
that takes into parameter the same parameters thantorch_rl.ACModel
along with a tensor of N memoriesmemory
of size N x M where M is the size of a memory. It returns the same thing thantorch_rl.ACModel
plus a tensor of N memoriesmemory
.memory_size
that returns the size M of a memory.
For speed purposes, the observations are only preprocessed once. Hence, because of the use of batches in PPO, the preprocessed observations X
must allow to retrieve from it a sublist of preprocessed observations given a list of indexes indexes
with X[indexes]
. If your preprocessed observations are a Pytorch tensor, you are already done, and if you want your preprocessed observations to be a dictionnary of lists or of tensors, you will also be already done if you use the torch_rl.DictList
class as follow:
>>> d = DictList({"a": [[1, 2], [3, 4]], "b": [[5], [6]]})
>>> d.a
[[1, 2], [3, 4]]
>>> d[0]
DictList({"a": [1, 2], "b": [5]})
Note : if you use a RNN, you will need to set batch_first
to True
.
An example of use of torch_rl.A2CAlgo
and torch_rl.PPOAlgo
classes is given in scripts/train.py
.
An example of implementation of torch_rl.RecurrentACModel
abstract class is given in model.py
.
An example of use of torch_rl.DictList
and an example of a preprocess_obss
function is given in the ObsPreprocessor.__call__
function of utils/format.py
.
OMP_NUM_THREADS
affects the number of threads used by MKL. The default value may severly damage your performance. This may be avoided if set to 1:
export OMP_NUM_THREADS=1
Along with the torch_rl
package, I provide 3 general reinforcement learning scripts:
train.py
for training an actor-critic model with A2C or PPO.enjoy.py
for visualizing your trained model acting.evaluate.py
for evaluating the performances of your trained model over X episodes.
For your own purposes, you will probabily need to change:
- the model in
model.py
, - the
ObssPreprocessor.__call__
method inutils.format
.
They were designed especially for the MiniGrid environments. These environments give an observation containing an image and a textual instruction to the agent and a reward of 1 if it successfully executes the instruction, 0 otherwise. They are used in what follows for illustrating purposes.
These scripts assume that you have already installed the gym
package (with pip3 install gym
for example). By default, models and logs are stored in the storage
folder. You can define a different folder in the environment variable TORCH_RL_STORAGE
.
scripts/train.py
enables you to load a model, trains it with the specified actor-critic algorithm and save it in the storage
folder.
2 arguments are required:
--algo ALGO
: name of the actor-critic algorithm.--env ENV
: name of the environment to train on.
and a bunch of optional arguments are available among which:
--model MODEL
: name of the model, used for loading and saving it. If not specified, it is the_
-concatenation of the environment name and algorithm name.--frames-per-proc FRAMES_PER_PROC
: number of frames per process before updating parameters.- ... (see more using
--help
)
Here is an example of command:
python3 -m scripts.train --algo ppo --env MiniGrid-DoorKey-5x5-v0 --model DoorKey --save-interval 10 --frames 1000000
This will print some logs in your terminal:
where:
- "U" is for "Update".
- "F" is for the total number of "Frames".
- "FPS" is for "Frames Per Second".
- "D" is for "Duration".
- "rR" is for "reshaped Return" per episode. The 4 following numbers are, in the order, the mean
x̄
, the standard deviationσ
, the minimumm
and the maximumM
of the reshaped return per episode during the update. - "F" is for the number of "Frames" per episode. The 4 following numbers are again, in the order, the mean, the standard deviation, the minimum, the maximum of the number of frames per episode during the update.
- "H" is for "Entropy".
- "V" is for "Value".
- "pL" is for "policy Loss".
- "vL" is for "value Loss".
- "∇" is for the gradient norm.
These logs are also saved in a log file in storage
.
If you add --tb
to the command, logs are also plotted in Tensorboard using the tensorboardX
package that you can install with pip3 install tensorboardX
. Then, you just have to execute:
tensorboard --logdir storage
and you will get something like this:
scripts/enjoy.py
enables you to visualize your trained model acting.
2 arguments are required:
--env ENV
: name of the environment to act on.--model MODEL
: name of the trained model.
and several optional arguments are available (see more using --help
).
Here is an example of command:
python3 -m scripts.enjoy --env MiniGrid-DoorKey-5x5-v0 --model DoorKey
In the MiniGrid-DoorKey-6x6-v0
environment, the agent has to reach the green goal. In particular, it has to learn how to open a locked door.
In the MiniGrid-GoToDoor-5x5-v0
environment, the agent has to open a door specified by its color. In particular, it has to understand textual instructions.
In the MiniGrid-RedBlueDoors-6x6-v0
environment, the agent has to open the red door and then the blue door. Because the agent initially faces the blue door, it has to remember if the red door is opened.
scripts/evaluate.py
enables you to evaluate the performance of your trained model on X episodes.
2 arguments are required:
--env ENV
: name of the environment to act on.--model MODEL
: name of the trained model.
and several optional arguments are available (see more using --help
).
By default, the model is tested on 100 episodes with a random seed set to 2 instead of 1 during training.
Here is an example of command:
python3 -m scripts.evaluate --env MiniGrid-DoorKey-5x5-v0 --model DoorKey
This will print the evaluation in your terminal:
where "R" is for "Return" per episode.