pfnet / pfrl Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 158.0 15.17 MB

PFRL: a PyTorch-based deep reinforcement learning library

License: MIT License

Shell 2.28% Python 97.72%

pfrl's People

Stargazers

Watchers

Forkers

dendisuhubdy satoshirobatofujimoto ml-lab keisuke-nakata keisukefukuda davidberglund majingxiang marioyc satpreetsingh personalkits prabhatnagarajan reinholdm djinnzy gorogoroyasu hobbit19 takuma-yoneda watakandai shu65 stevenhailin kinziro shmpwk rl-gan-vision-privacy-finance-projects jfpettit kokuno1122 alpha-soliton jshirius jumpei-arima muupan cross32768 staminatang intelligent-robotic-group zmonoid tmats chachay xavifrog tristandeleu hotwaterman justindujardin g-votte belldandyxtq williamd4112 ktechb mikelty wataru-y lin826 xlnwel joeloren kondounagi ryota-mo ina299 tkelestemur xylee95 yskim525 amir-abolfazli tarokiritani neevparikh lerrytang arg-nctu yuehchuan zeta1999 yjpark1 unknown-yuser jakur torchpaired toslunar rl-code-lib maxweinotmaxwell alex-moore-loveholidays mad-dee chao0716 vishalbelsare crazynicolas behroozomidvar knshnb milkytipo zhouzypaul samlobel mahdi-95 freephys yorgoon acsouzajr duanshengqi yexiong-zeng tmako123 tianyuzelin yhisaki hirotakasaito douglascvas fratim forks-learning wansongying marcus-arcadius 3dalgolab yhablayou superdiode spiliosev chenyuelu velythyl zhuzhenping leeshi2006

pfrl's Issues

why duelingDQN does not have obversation size?

I thought DuelingDQN is similar to DQN and they have same inputs. But the implementation of DuelingDQN in q_functions does not have the obs_size variable, which confuses me. How does it know the dimension of inputs?
Besides, I got an Error when I replace DQN with DuelingDQN:
RuntimeError: Expected 4-dimensional input for 4-dimensional weight [32, 2, 8, 8], but got 2-dimensional input of size [1, 2] instead

ACER - Examples on continuous action space

Hello,

I am working on an RL project, where I want to use the ACER algorithm on continuous action space problems (Pybullet environments), but I have difficulties implementing it using Your framework. Would it be possible for You to add an example of how to use this algorithm on this class of problems?

PPO stack.pop() from empty list ?

Hey, thanks for this library, it is very helpful!
I am implementing a simple recurrent ppo and I keep getting the following error message:

INFO:pfrl.experiments.train_agent_batch:outdir:rnn_run step:2040 episode:22 last_R: 0.0 average_R:0.25239999999999996
INFO:pfrl.experiments.train_agent_batch:statistics: [('average_value', -0.06298958), ('average_entropy', 1.9389327), ('average_value_loss', nan), ('average_policy_loss', nan), ('n_updates', 0), ('explained_variance', nan)]
INFO:pfrl.experiments.train_agent_batch:Saved the agent to rnn_run/2040_except
Traceback (most recent call last):
File "rnn_minigrid_ppo.py", line 99, in
pfrl.experiments.train_agent_batch(
File "/home/sharan/Reccurent-GAIL/pfrl/pfrl/experiments/train_agent_batch.py", line 82, in train_agent_batch
agent.batch_observe(obss, rs, dones, resets)
File "/home/sharan/Reccurent-GAIL/pfrl/pfrl/agents/ppo.py", line 681, in batch_observe
self._batch_observe_train(batch_obs, batch_reward, batch_done, batch_reset)
File "/home/sharan/Reccurent-GAIL/pfrl/pfrl/agents/ppo.py", line 807, in _batch_observe_train
self._update_if_dataset_is_ready()
File "/home/sharan/Reccurent-GAIL/pfrl/pfrl/agents/ppo.py", line 429, in _update_if_dataset_is_ready
self._update_recurrent(dataset)
File "/home/sharan/Reccurent-GAIL/pfrl/pfrl/agents/ppo.py", line 628, in _update_recurrent
for minibatch in _yield_subset_of_sequences_with_fixed_number_of_items(
File "/home/sharan/Reccurent-GAIL/pfrl/pfrl/agents/ppo.py", line 165, in _yield_subset_of_sequences_with_fixed_number_of_items
sequence = stack.pop()
IndexError: pop from empty list

I have changed the batch_size, max_recurrent_sequence_len, and no of parallel envs but I consistently get this error at step: 2040
Can you help me out please?

Tuples are not unpacked when using 'unpack_sequences_as_one_step_batch'

It seems that 'unpack_sequences_as_one_step_batch' method in pfrl/utils/recurrent.py does not work well.
When called in 'one_step_forward' method with an argument shaped like '( … , (…) )', this method fails to 'untuple' correctly.

CUDA error when sampling action from SoftmaxCategorialHead distribution

I have a custom environment implemented in Gym API. It has a 3-channel image observations and 4 actions. I'm training PPO with a CNN-based policy network. I get a CUDA error when sampling from the SoftmaxCategorialHead. The error happens at a different step even tough I'm using pfrl.utils.set_random_seed(args.seed).

The error is below with CUDA_LAUNCH_BLOCKING=1:

/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:190: sampleMultinomialOnce: block: [0,0,0], thread: [3,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:190: sampleMultinomialOnce: block: [3,0,0], thread: [0,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:190: sampleMultinomialOnce: block: [3,0,0], thread: [1,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:190: sampleMultinomialOnce: block: [3,0,0], thread: [2,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:190: sampleMultinomialOnce: block: [3,0,0], thread: [3,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:190: sampleMultinomialOnce: block: [4,0,0], thread: [0,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:190: sampleMultinomialOnce: block: [4,0,0], thread: [1,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:190: sampleMultinomialOnce: block: [4,0,0], thread: [2,0,0] Assertion `val >= zero` failed.
/pytorch/aten/src/ATen/native/cuda/MultinomialKernel.cu:190: sampleMultinomialOnce: block: [4,0,0], thread: [3,0,0] Assertion `val >= zero` failed.
THCudaCheck FAIL file=/pytorch/torch/csrc/generic/serialization.cpp line=31 error=710 : device-side assert triggered
Traceback (most recent call last):
  File "/home/tarik/projects/pfrl/pfrl/experiments/train_agent_batch.py", line 71, in train_agent_batch
    actions = agent.batch_act(obss)
  File "/home/tarik/projects/pfrl/pfrl/agents/ppo.py", line 654, in batch_act
    return self._batch_act_train(batch_obs)
  File "/home/tarik/projects/pfrl/pfrl/agents/ppo.py", line 712, in _batch_act_train
    batch_action = action_distrib.sample().cpu().numpy()
  File "/home/tarik/venvs/research/lib/python3.8/site-packages/torch/distributions/categorical.py", line 107, in sample
    samples_2d = torch.multinomial(probs_2d, sample_shape.numel(), True).T
RuntimeError: CUDA error: device-side assert triggered

My network:

model = nn.Sequential(
  IMPALACNN(),
  pfrl.nn.Branched(
      nn.Sequential(
          lecun_init(nn.Linear(512, n_actions), 1e-2),
          SoftmaxCategoricalHead(),
      ),
      lecun_init(nn.Linear(512, 1))
  )
)

where IMPALACNN can be seen here.

As far as I understand the issues comes from sampling with infinite logits but I don't know why would the last linear layer would produce infinite values.

Edit: I printed out the probs and the logits of the Categorical dist and it seems like the policy network produces NaN values for a single batch of observation data. I double checked my environment and it doesn't return any NaN or inf values in the observations. The strange thing is that all the policy network returns NaN for each environment. I'm training with 16 environment with GPU. Here is the output of the probs and logits before the error:

probs tensor([[nan, nan, nan, nan],                                                                                                                                                                          
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan]], device='cuda:0')                                                                                                                                                              
logts tensor([[nan, nan, nan, nan],                                                                                                                                                                          
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan],                                                                                                                                                                                
        [nan, nan, nan, nan]], device='cuda:0')

Logged average_q is NaN for CategoricalDQN.get_statistics()

For CategoricalDQN, logged average_q values in statistics are always nan.

average_q stats is the mean value of self.q_record.
https://github.com/pfnet/pfrl/blob/d420891573/pfrl/agents/dqn.py#L695

However, CategoricalDQN overrides _compute_loss function,
https://github.com/pfnet/pfrl/blob/d420891573/pfrl/agents/categorical_dqn.py#L172-L198
which is responsible to append a Q value to self.q_record, and thus the mean is always nan.

Obtaining scalar Q values in categorical algorithms is a bit unclear since they calculate distribution of Q values instead of scalar, but I think that taking expectation value of the distribution is natural. (as greedy_action does.)

ACER raises an error when GaussianHeadWithFixedCovariance is used

Reported in #143

ACER assumes that all the parameters of a distribution (defined by get_params_of_distribution) require grad so that the algorithm can compute the gradient wrt the parameters.

pfrl/pfrl/agents/acer.py

Lines 172 to 180 in 44bf2e4

 def get_params_of_distribution(distrib): 

 if isinstance(distrib, torch.distributions.Independent): 

 return get_params_of_distribution(distrib.base_dist) 

 elif isinstance(distrib, torch.distributions.Categorical): 

 return (distrib._param,) 

 elif isinstance(distrib, torch.distributions.Normal): 

 return distrib.loc, distrib.scale 

 else: 

 raise NotImplementedError("{} is not supported by ACER".format(type(distrib)))

pfrl/pfrl/agents/acer.py

Lines 218 to 221 in 44bf2e4

 distrib_params = get_params_of_distribution(distrib) 

 for param in distrib_params: 

 assert param.shape[0] == 1 

 assert param.requires_grad

However, GaussianHeadWithFixedCovariance (

pfrl/pfrl/policies/gaussian_policy.py

Line 96 in 44bf2e4

class GaussianHeadWithFixedCovariance(nn.Module):

) is used, the scale parameter of the torch.distributions.Normal distribution does not require grad, resulting in an assertion error.

Batch and async training do not work with macOS/Windows and Python >= 3.8

In Python 3.8 the default mode of multiprocessing for macOS was changed

For reference: chainer/chainerrl#572

Make Recurrent DDPG

DDPG does not support recurrence.

Stratified sampling is not used in PrioritizedReplayBuffer

According to Appendix B.2.1 of the PER paper (http://arxiv.org/abs/1511.05952), the original PER implementation uses stratified sampling:

To sample a minibatch of size k, the range [0, ptotal] is divided equally into k ranges. Next, a value is uniformly sampled from each range. Finally the transitions that correspond to each of these sampled values are retrieved from the tree.

This is different from what PFRL's PrioritizedReplayBuffer does right now, i.e., sampling proportionally without replacement k times:

pfrl/pfrl/collections/prioritized.py

Lines 262 to 268 in 322fa45

 root = self.root 

 ixl, ixr = self.bounds 

 for _ in range(n): 

 ix = _find(ixl, ixr, root, np.random.uniform(0.0, root[2])) 

 val = self._write(ix, 0.0) 

 ixs.append(ix) 

 vals.append(val)

It is not clear if stratified sampling leads to better performance. In a sense PFRL's way could be better since it can strictly prevent the same minnibatch from having duplicate transitions. However, the difference should be noted, and it is good to support and evaluate stratified sampling as well.

CI lint error: Duplicate module named 'basetest_ddpg'

https://ci.preferred.jp/pfrl.lint/67862/#L75

This seems caused by mypy==0.800, which now recursively searches directories.

https://mypy-lang.blogspot.com/2021/01/mypy-0800-released.html

When you run mypy as mypy , look for source files recursively also inside directories without a init.py (PR 9614)

tests/agents_tests/test_double_dqn.py with gpu is flaky

Timeout occurs occasionally.

Cannot pass multiple inputs to a recurrent policy with PPO

Here is the definition of my policy that recurrently maps two inputs to an action and value estimation. The policy takes two PackedSequences put in a tuple. The model works (more or less) as I expected.

class Foo(nn.Module):
  def __init__(self):
    super().__init__()
    self.cnn = nn.Sequential(nn.Conv2d(3, 32, 3),
                  nn.ReLU(),
                  nn.Conv2d(32, 64, 3),
                  nn.ReLU(),
                  nn.Flatten())
 
  def forward(self, x):
    cnn_out = self.cnn(x[0])
    out = torch.cat((cnn_out, x[1]), 1)
    return out
 
foo = pfrl.nn.RecurrentSequential(
    Foo(),
    nn.GRU(num_layers=1, input_size=64 * 4 * 4 + 12, hidden_size=128),
    pfrl.nn.Branched(
        nn.Sequential(nn.Linear(128, 4),
                      SoftmaxCategoricalHead(),),
        nn.Linear(128, 1),
    )
)
 
print(foo((torch.nn.utils.rnn.pack_sequence(torch.rand(1, 32, 3, 8, 8)), torch.nn.utils.rnn.pack_sequence(torch.rand(1, 32, 12))), None))

I am trying to use this with PPO. This time I put two tensors in a tuple hoping that they are converted to two PackedSequences in the agent. However, the preprocessing of the tensors throws the following error:

opt = torch.optim.Adam(foo.parameters(), lr=2.5e-4, eps=1e-5)

def phi(x):
    return x
 
agent = PPO(
        foo,
        opt,
        gpu=-1,
        phi=phi,
        update_interval=8,
        minibatch_size=32*8,
        epochs=4,
        clip_eps=0.1,
        clip_eps_vf=None,
        standardize_advantages=True,
        entropy_coef=1e-2,
        recurrent=True,
        max_grad_norm=0.5,
    )

agent.batch_act(
    (
        (torch.rand([1, 32, 3, 8, 8]), torch.rand([1, 32, 12]),)
     ,),
)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-70b107928cd6> in <module>()
      1 agent.batch_act(
      2     (
----> 3         (torch.rand([1, 32, 3, 8, 8]), torch.rand([1, 32, 12]),)
      4      ,),
      5 )

3 frames
/usr/local/lib/python3.6/dist-packages/pfrl/agents/ppo.py in batch_act(self, batch_obs)
    652     def batch_act(self, batch_obs):
    653         if self.training:
--> 654             return self._batch_act_train(batch_obs)
    655         else:
    656             return self._batch_act_eval(batch_obs)

/usr/local/lib/python3.6/dist-packages/pfrl/agents/ppo.py in _batch_act_train(self, batch_obs)
    706                     self.train_recurrent_states,
    707                 ) = one_step_forward(
--> 708                     self.model, b_state, self.train_prev_recurrent_states
    709                 )
    710             else:

/usr/local/lib/python3.6/dist-packages/pfrl/utils/recurrent.py in one_step_forward(rnn, batch_input, recurrent_state)
    139         object: New batched recurrent state.
    140     """
--> 141     pack = pack_one_step_batch_as_sequences(batch_input)
    142     y, recurrent_state = rnn(pack, recurrent_state)
    143     return unpack_sequences_as_one_step_batch(y), recurrent_state

/usr/local/lib/python3.6/dist-packages/pfrl/utils/recurrent.py in pack_one_step_batch_as_sequences(xs)
    115         return tuple(pack_one_step_batch_as_sequences(x) for x in xs)
    116     else:
--> 117         return nn.utils.rnn.pack_sequence(xs[:, None])
    118 
    119 

TypeError: list indices must be integers or slices, not tuple

The input tuple is converted to a list by pfrl.util.batch_states(), which is called inside pfrl.agents.PPO._batch_act_train(). The list is then passed to pfrl.util.recurrent.pack_one_step_batch_as_sequences() , but it expects a tuple. Maybe we can just collect multiple inputs in a tuple instead of a list in pfrl.util.batch_states()?

I am still figuring out pfrl, and perhaps I am not correctly passing multiple inputs to a recurrent policy. Suggestions are welcome.

The snippet is found here: https://colab.research.google.com/drive/1wqEtZTvwu0IN7oZnbrp34W7lBxVyGhp6?usp=sharing

MultiDiscrete action spaces

I have a custom environment with a MultiDiscrete action space. The MultiDiscrete action space allows controlling an agent with n-dimensional discrete action spaces.

In my environment, I have 4 dimensions where each dimension has 11 actions. I'm trying to use A2C with a Softmax policy. Below is the implementation of the policy and value networks. The output of the policy gives me [N, 4, 11] tensor where N is the batch size. The softmax is applied to the last dimension of this tensor so basically, I have 4 action distributions. I thought this would work but I'm getting the following error:

Do I need to make changes to the A2C or am I doing something wrong?

  File "train_rl.py", line 90, in <module>
    train()
  File "train_rl.py", line 80, in train
    experiments.train_agent_batch(
  File "/home/tarik/venvs/tacto/lib/python3.8/site-packages/pfrl/experiments/train_agent_batch.py", line 86, in train_agent_batch
    agent.batch_observe(obss, rs, dones, resets)
  File "/home/tarik/venvs/tacto/lib/python3.8/site-packages/pfrl/agents/a2c.py", line 224, in batch_observe
    self._batch_observe_train(batch_obs, batch_reward, batch_done, batch_reset)
  File "/home/tarik/venvs/tacto/lib/python3.8/site-packages/pfrl/agents/a2c.py", line 288, in _batch_observe_train
    self.update()
  File "/home/tarik/venvs/tacto/lib/python3.8/site-packages/pfrl/agents/a2c.py", line 183, in update
    action_log_probs = action_log_probs.reshape(
RuntimeError: shape '[5, 2]' is invalid for input of size 40

policy = torch.nn.Sequential(
    torch.nn.Linear(44, 128),
    torch.nn.Tanh(),
    torch.nn.Linear(128, 128),
    torch.nn.Tanh(),
    torch.nn.Linear(128, 44),
    torch.nn.Unflatten(1, (4, 11)),
    SoftmaxCategoricalHead()
)

value = torch.nn.Sequential(
    torch.nn.Linear(44, 128),
    torch.nn.Tanh(),
    torch.nn.Linear(128, 128),
    torch.nn.Tanh(),
    torch.nn.Linear(128, 1),
)

model = pfrl.nn.Branched(policy, value)

loading optimizer parameters

I'm working on a task-specific curriculum learning for RL. Basically, I train an PPO agent for a simple task, and using the saved weights, I train on a harder task. Before I start the training for the second task, I use agent.load() function to load network parameters. I realized that agent.load() function also loads the parameters for the optimizer.

The problem is that I use learning rate decay for the first task. So if I the load last saved parameters, the optimizer would get a learning rate close to zero. Did I understand this correctly? If this is the case, the saved_attributes should be a parameter when we're creating the PPO agents.

Edit: below is a snippet from my training script:

def train(args):
    pfrl.utils.set_random_seed(args.seed)
    env = create_multi_env(args)
    args.num_actions = env.action_space.n
    args.obs_channels = env.observation_space.shape[0]
    print('Environment observation space: {}'.format(env.observation_space.shape))
    print('Environment action space     : {}'.format(env.action_space.n))

    model, opt = create_model(args)

    agent = create_agent(model, opt)

    if args.preload:
        preload_dir = os.path.join(args.outdir, args.preload, '20000000_finish')
        print('Loading pre-trained weights from {}...'.format(preload_dir))
        agent.load(preload_dir)

    args.outdir = os.path.join(args.outdir, args.model_name)

    def lr_setter(env, agent, value):
        for param_group in agent.optimizer.param_groups:
            param_group["lr"] = value

    step_hooks = [experiments.LinearInterpolationHook(args.steps, args.lr, 0, lr_setter)]

    print('Starting training...')
    experiments.train_agent_batch_with_evaluation(
        agent=agent,
        env=env,
        outdir=args.outdir,
        steps=args.steps,
        eval_n_steps=None,
        eval_n_episodes=args.eval_num_runs,
        eval_interval=args.eval_interval,
        checkpoint_freq=args.checkpoint_freq,
        log_interval=args.log_interval,
        save_best_so_far_agent=True,
        step_hooks=step_hooks,
    )

DDPG agent: Tensor object does not have attribute sample

pfrl/pfrl/agents/ddpg.py

Lines 258 to 262 in 70f3da9

 def _batch_select_greedy_actions(self, batch_obs): 

 with torch.no_grad(), evaluating(self.policy): 

 batch_xs = self.batch_states(batch_obs, self.device, self.phi) 

 batch_action = self.policy(batch_xs).sample() 

 return batch_action.cpu().numpy()

Here, self.policy is of type nn.Module, so calling self.policy(batch_xs) returns a pytorch Tensor, which does not have attribute sample and i get the error. What is wrong with this?

Performance comparison with another popular RL library

Thank you so much for open-sourcing this wonderful library. I wonder how does PFRL compare against this popular PyTorch RL library: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail

From the learning curves of PPO on Gym environments, it looks like that repo has much higher scores? Here is their half-cheetah curve, which reaches ~3000 score at only 1M steps.

VS PFRL's curve:

Possible Hierarchical RL PR

Hello, I am a RL researcher, and my team and I have recently implemented HIRO (Data Efficient Hierarchical Reinforcement Learning with Off-Policy Correction) with PFRL. I'm wondering if a PR of an HRL algorithm (which required some large changes) would be encouraged on this platform.

Thanks!

Recurrent DDPG implemented?

Per the code for DDPG, it does not seem like recurrent networks are implemented yet but seem to be supported:

https://github.com/pfnet/pfrl/blob/master/pfrl/agents/ddpg.py

Can you provide some information about what algorithm/paper you are referring to?
Is it: Memory-based control with recurrent neural networks
https://arxiv.org/abs/1512.04455

When this could be expected to be implemented? Thanks!

Try OpenAI VecEnv support instead of PFRL's VectorEnv

The score of `train_dqn_gym.py` with `--actor-learner` is lower than the baseline score.

It seems that the score of train_dqn_gym.py with --actor-learner is lower than the baseline score.

the result of train_dqn_gym.py with --actor-learner

$python train_dqn_gym.py --env CartPole-v0 --gpu -1 --actor-learner

$cat scores.txt
steps   episodes        elapsed mean    median  stdev   max     min     average_q       average_loss    cumulative_steps        n_updates       rlen
10102   295     27.7667019367218        142.13  143.5   31.933850505080155      200.0   66.0    1.1826544       0.002722540542599745    10102   3161    10102
20125   377     50.456093072891235      153.35  153.5   28.010595037296767      200.0   89.0    0.9678241       0.0008152683400840032   20125   6378    20125
30044   433     73.47023296356201       183.16  200.0   29.932152234240313      200.0   89.0    0.81571275      0.001155181206850102    30044   10039   30044
40125   504     95.93452501296997       166.15  187.5   42.48193317711244       200.0   21.0    0.81241447      0.0008331169206940104   40125   13427   40125
50072   567     118.65022873878479      179.05  200.0   31.635989925808527      200.0   105.0   0.76570725      0.0011051068906817818   50072   16768   50072
60069   630     141.68070459365845      140.9   119.0   38.606143856054096      200.0   99.0    0.7382145       0.0004422377867740579   60069   19830   60069
70140   688     165.85664129257202      195.39  200.0   13.644768117109832      200.0   135.0   0.6611406       0.00039179762254207165  70140   23619   70140
80124   754     190.2280843257904       187.3   200.0   20.73424922463051       200.0   125.0   0.56931555      0.00041157132740408996  80124   27049   80124
90022   807     216.45293831825256      188.56  200.0   26.396479716320915      200.0   99.0    0.47856167      0.0001967151611461304   90022   30542   90022
100000  866     244.79555416107178      188.37  200.0   21.623806091874528      200.0   110.0   0.3955511       0.00030046377703911274  100000  34203   100000

the result of the baseline (without --actor-learner)

$python train_dqn_gym.py --env CartPole-v0 --gpu -1

$cat scores.txt
steps   episodes        elapsed mean    median  stdev   max     min     average_q       average_loss    cumulative_steps        n_updates       rlen
10036   229     55.154892921447754      123.64  121.0   9.809828095982482       161.0   111.0   2.9230406       0.012885378097416833    10036   9037    10036
20026   320     107.90914940834045      98.46   99.0    3.4035631952767416      108.0   91.0    4.9529943       0.024560987005243076    20026   19027   20026
30177   400     164.05463528633118      197.4   199.0   2.9059326290271157      200.0   190.0   4.6993313       0.0120574060222134      30177   29178   30177
40074   458     215.77593541145325      200.0   200.0   0.0     200.0   200.0   4.9031615       0.021290321972919628    40074   39075   40074
50089   520     270.9923541545868       180.73  200.0   29.913631567152684      200.0   123.0   4.1760216       0.009463735535391607    50089   49090   50089
60020   596     328.6313564777374       200.0   200.0   0.0     200.0   200.0   3.2222836       0.00954763395129703     60020   59021   60020
70101   649     388.06564927101135      200.0   200.0   0.0     200.0   200.0   2.484032        0.006800949496391695    70101   69102   70101
80088   707     444.6856653690338       200.0   200.0   0.0     200.0   200.0   1.7810422       0.00335951144239516     80088   79089   80088
90091   768     501.1645920276642       200.0   200.0   0.0     200.0   200.0   1.1870649       0.0015449915380304447   90091   89092   90091
100000  830     557.4196665287018       125.91  126.0   3.8193949491742507      133.0   117.0   0.78206486      0.0015336718078833656   100000  99001   100000

Hindsight Experience Replay

Hindsight Experience Replay with bit-flipping example: https://arxiv.org/abs/1707.01495

About training time

May I know the training speed for a single trial?

In your reproduction section, e.g. DQN:

Training time (in days) across all runs (# domains x # seeds)
--
Mean 3.613

Does this mean that, the training for a single trial is:

200,000,000 frames / (3.613 * 24 * 3600 sec) = 640 frames/sec

Above calculation is inline with Deepmind. With default frame skip = 4, the actual speed is:

50,000,000 frames (collected) / (3.613 * 24 * 3600 sec) = 160 frames/sec

Is my understanding correct?

Also, may I know your hardware?

Suppress flake8 E741

Example: https://ci.preferred.jp/pfrl.lint/64999/134134

addcmul_ used in RMSpropEpsInsideSqrt has been deprecated

UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
  square_avg.mul_(alpha).addcmul_(1 - alpha, grad, grad)

Memory leak (?) when run without a GPU

Disclaimer: I am not completely sure if this is a bug of PFRL.

When I ran SAC, and TD3 on my university's cluster without a GPU, I observed that memory usage gradually increased and finally reached to 24 GB, which is the amount of RAM assigned to jobs. I confirmed that this occurred on a local workstation as well. My collaborator also confirmed that this occurred on his environment too. He told me that this did not occur when he ran experiments with a GPU. Would you check if this memory leak (?) occurs too on your workstation or cluster? If this occurs in other environments too, this might be a bug of PFRL.

PyTorch version is 1.6.0+cpu, and PFRL is the latest one obtained by git clone .... The command I used is python3 examples/mujoco/reproduction/soft_actor_critic/train_soft_actor_critic.py --env Humanoid-v2 --gpu -1 --num-envs 3. (num-envs and env seem to be unrelated, though.)

I use singularity, and my collaborator use docker, so there is some possibility that this occurs only when PFRL is run in a container. However, I think it is unlikely.

Migrates or remove tests with @pytest.mark.skip

Some tests still have @pytest.mark.skip annotation from pre-migration. These tests should be updated or removed.

Add Pretrained Models

Add pretrained models for all scripts in the atari/reproduction and mujoco/reproduction examples

SAC-Discrete Implementation

Just wondering if there will be an upcoming SAC-Discrete implementation?

Thanks,
Christian

Recurrent model with two types of inputs [question]

I've been trying to implement a recurrent model for the PPO agent that can take two different types of observations. My environment has a Tuple observation space and consists a 2-channel image and a timeseries data. I want to feed the images into a series of 2d-conv layers and feed the time series into 1d-conv data. The output of these layers would be flattened and feed into a fully connected layer and output of the this layer would go into an LSTM.

My plan is to implement a torch.nn.Module subclass in which the forward function takes the tuple of observations and feed the them into corresponding layers and returns the output of the fully connected layer. Finally, I would put this model inside the pfrl.nn.RecurrentSequential.

This was the only way I could think of implementing this, however, I'm not sure it would work since the tuple observations might break something in the data collection. If this wouldn't work, how I can deal with using two types of observations and recurrent model?

Wrong comment in NormalizeActionSpace

pfrl/pfrl/wrappers/normalize_action_space.py

Line 18 in 7ff643b

# action is in [0, 1]

Here must be action is in [-1, 1]

Support Environment Evaluation Statistics for Batch/Vector Environments

how to deal output of GaussianHeadWithStateIndependentCovariance

Thank you for releasing quite powerful products.

I would like to use PPO on pfrl in case of Discrete Action.
I have got vector like [ 1.1933773 -0.24673517 0.6604848 -1.5786057 0.8695493 ] from agent.act mothod.
As it doesn't seem like probability, how can I decide a action with this vector ?

I use below model of PPO sample but I don't understand GaussianHeadWithStateIndependentCovariance.

policy = torch.nn.Sequential(
nn.Linear(obs_size, 64),
nn.Tanh(),
nn.Linear(64, 64),
nn.Tanh(),
nn.Linear(64, action_size),
pfrl.policies.GaussianHeadWithStateIndependentCovariance(
action_size=action_size,
var_type="diagonal",
var_func=lambda x: torch.exp(2 * x), # Parameterize log std
var_param_init=0, # log std = 0 => std = 1
),
)

Thank you.

Double IQN

Add Double IQN with a double update (as in ChainerRL - https://github.com/chainer/chainerrl/blob/master/chainerrl/agents/double_iqn.py)

Error when using no_grad () and lazy_property

when I used a custom action value whose greedy_actions use torch.nn.Module,
I got the following error.

File "/pfrl/agents/dqn.py", line 447, in batch_act
    batch_argmax = batch_av.greedy_actions.cpu().numpy()
RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead.

The reason for this error is that lazy_property disables no_grad().
Detail: pytorch/pytorch#7708

Therefore, it is necessary to create an alternative to lazy_property.

Support for Multiple GPUs

Hi,

First, thanks for the work on this repo - it's great.

Second, what is the priority level of getting models to work on multiple GPUs? I was curious since my university's supercomputer cluster allows for using multiple GPUs, which would definitely speed up training.

Thanks!

ACER test may be flaky

After merging #112 test_acer fails, but shouldn't be related to this PR

Add Atari benchmark scores

For all reproducibility scripts under examples/atari/reproduction, add benchmark scores.

A3C
DQN
IQN
Rainbow

Apply isort in CI

Test quickstart.ipynb in CI

Actor processes hang in `train_agent_async` when `use_tensorboard=True`

When I turned on Tensorboard with the actor-learner mode in train_dqn_gym.py, the program froze after the first evaluation. I summarized below the repro steps and analysis for this problem.

Reproduction

I've faced the following problem with this commit, the latest master as of Nov. 3rd, 2020.

Steps to reproduce

In examples/gym/train_dqn_gym.py, add use_tensorboard=True as an argument of train_agent_async() (here)
Run python examples/gym/train_dqn_gym.py --actor-learner

Result

The actor process hangs during the first set of evaluation, after showing the following log.

...
INFO:pfrl.experiments.train_agent_async:evaluation episode 96 length:200 R:-1494.8058766440454
INFO:pfrl.experiments.train_agent_async:evaluation episode 97 length:200 R:-1592.9273165459317
INFO:pfrl.experiments.train_agent_async:evaluation episode 98 length:200 R:-1533.3344787068036
INFO:pfrl.experiments.train_agent_async:evaluation episode 99 length:200 R:-1570.1153000497297

Expected behavior

The actor process continuously runs without the hang.

Analysis

The actor process stops here, during summary_writer.add_scalar, where Tensorboard's SummaryWriter seems to suffer from a deadlock.

I suspect that this problem happens because the _AsyncWriterThread, which is internally used in SummaryWriter, does not work in actor processes. Actor processes are forked from the root process with copy of SummaryWriter, but the associated threads, including the one for _AsyncWriterThread, are not copied in a POSIX-based system. Consequently, the queue of the writer is not consumed and jams after it reaches the full capacity. This prevents each actor from adding a new scalar to Tensorboard and the actor gets stuck there.

[Feature Request/Proposal] use MLFlow option or best practice of experiments management

Thank you for the excellent RL library. PFRL makes my life so much easy.

As the management of the experiment becomes complicated, I have tried PFRL with MLFlow. And I'm satisfied with the initial implementation (see code below). MLFlow helps to compare the performances of algorithms, to manage trained models and to monitor training results remotely.

On the other hand, if PFRL natively supports MLFlow, it would be even easier to use and I can expect the wisdom of various experiment management efforts from other users. At the moment, Tensorboard support was just added a few months ago, and I'm sure that each wants to use different tools, so I've listed this as an issue to discuss.

The motivations for native support:

to reduce overlapped logging functions
to give more detailed evaluation score access to MLFlow

An alternative instead of MLFlow native support:

to update record_tb_stats(self.tb_writer, agent_stats, eval_stats, t) to more general implementation (like eval_hooks)

More general question about the management..:

to log the history of reward shaping and observation feature engineering on custom environments (git diff is a bit hard to read through)

How to use PFRL and MLFLOW together

        existing_exp = mlflow.get_experiment_by_name(args.env)
        if not existing_exp:
            mlflow.create_experiment(args.env)
        mlflow.set_experiment(args.env)

        def log_mlflow(env, agent, evaluator, step, eval_score):
            mlflow.log_metric("R_mean", eval_score, step=step)

        try:
            with mlflow.start_run():
                mlflow.log_param("Algo", "SAC")
                mlflow.log_artifacts(args.outdir)
                mlflow.log_param("OutDir", args.outdir)

                experiments.train_agent_with_evaluation(
                        agent=agent,
                        env=make_env(0, False),
                        eval_env=make_env(0, True),
                        outdir=args.outdir,
                        steps=args.steps,
                        eval_n_steps=None,
                        eval_n_episodes=args.eval_n_runs,
                        eval_interval=args.eval_interval,
                        save_best_so_far_agent=True,
                        evaluation_hooks=(log_mlflow,),
                )
        finally:
            mlflow.log_artifacts(args.outdir)
            mlflow.end_run()

Usage of Recurrent PPO for MuJoCo Reproduction

Hi,

I'm curious if recurrent policy for PP is supported for other environments besides atari?

I've tried adapting the train_ppo_ale.py code shown in atari in the mujoco reproduction code, but I'm faced multiple errors. The train_ppo_ale.py for atari works but when I switch the policy in the train_ppo.py for mujoco to the recurrent_sequential class, I'm getting the following error. Does using recurrent policies require modification on the mujoco environments?

Enable auto assign for a PR that was a draft PR but now is marked ready for review

It seems not enabled right now, and manually rerunning the job has no effect.

https://github.com/pfnet/pfrl/pull/98/checks?check_run_id=1451297003

Skips the process to add reviewers/assignees since PR type is draft

PFRL for discrete actions with continuous parameters

Hi all,

I am working on an RL problem that has discrete actions, but each action has a continuous parameter. I would be appreciative of any advice you could give me as to how to get going within the PFRL framework.

Thanks,

~Space Ghost

Remove six dependency

pfrl/pfrl/utils/pretrained_models.py

Line 15 in 0639d2f

from six.moves.urllib import request

We no longer support Python 2.

Suppress flake8 E402 errors

It seems that flake8's behavior has changed recently: https://gitlab.com/pycqa/flake8/-/issues/638

Example: https://ci.preferred.jp/pfrl.lint/64999/134134

Integration of PFRL with Deep Graph Library

I am currently trying to use PFRL with deep graph neural networks, but the dataloading produces errors because the type is not accepted. This looks like a pytorch problem, but it would be nice to get some input. Do you think that this problem is fixable?

The Error is
:/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py in (.0)
82 raise RuntimeError('each element in list of batch should be of equal size')
83 transposed = zip(*batch)
---> 84 return [default_collate(samples) for samples in transposed]
85
86 raise TypeError(default_collate_err_msg_format.format(elem_type))

/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
84 return [default_collate(samples) for samples in transposed]
85
---> 86 raise TypeError(default_collate_err_msg_format.format(elem_type))

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'dgl.heterograph.DGLHeteroGraph'>

Saving training statistics

Hello all,

I'm training a custom environment that follow gym API. I've been using PPO and it seems it's able to learn successfully. However, I couldn't figure out how to save the statistics such as average_R. If I wrap the environment with Monitor class, I can get the statistics of indivudial environments but I also would like to save the training statistics. Note that I don't use evaluations during training. My training scrips is below:

def train(args):
    pfrl.utils.set_random_seed(args.seed)
    env = create_multi_env(seed=args.seed, num_envs=args.num_envs)

    model = create_model()
    opt = torch.optim.Adam(model.parameters(), lr=2.5e-4, eps=1e-5)
    agent = create_agent(model, opt)

    experiments.train_agent_batch(
        agent=agent,
        env=env,
        outdir=args.outdir,
        steps=args.steps,
        checkpoint_freq=int(1000),
        log_interval=int(1000)
    )

    agent.save('agents')

PS. Thanks for the great library!

ActionValue classes and backward hooks cause an error when used together

Abstract:

ActionValue classes and backward hooks cause an error when used together.

Details

Several pre-defined models in PFRL, such as FCQuadraticStateQFunction, return ActionValue from its forward function.
However, when backward-hook is used, Torch expects that a return value from a forward function is Tensor or dict (of which values() have at least one tensor). If an ActionValue is returned, the loop repeats var = var[0] to the value and finally causes an error like this:

"/.../site-packages/torch/nn/modules/module.py", line 739, in _call_impl
    var = var[0]
  File "/.../pfrl/pfrl/action_value.py", line 316, in __getitem__
    max_action=self.max_action,
  File "/.../pfrl/pfrl/action_value.py", line 267, in __init__
    self.batch_size = self.mu.shape[0]
IndexError: tuple index out of range

	def get_params_of_distribution(distrib):
	if isinstance(distrib, torch.distributions.Independent):
	return get_params_of_distribution(distrib.base_dist)
	elif isinstance(distrib, torch.distributions.Categorical):
	return (distrib._param,)
	elif isinstance(distrib, torch.distributions.Normal):
	return distrib.loc, distrib.scale
	else:
	raise NotImplementedError("{} is not supported by ACER".format(type(distrib)))

	distrib_params = get_params_of_distribution(distrib)
	for param in distrib_params:
	assert param.shape[0] == 1
	assert param.requires_grad

	root = self.root
	ixl, ixr = self.bounds
	for _ in range(n):
	ix = _find(ixl, ixr, root, np.random.uniform(0.0, root[2]))
	val = self._write(ix, 0.0)
	ixs.append(ix)
	vals.append(val)

	def _batch_select_greedy_actions(self, batch_obs):
	with torch.no_grad(), evaluating(self.policy):
	batch_xs = self.batch_states(batch_obs, self.device, self.phi)
	batch_action = self.policy(batch_xs).sample()
	return batch_action.cpu().numpy()