toshikwa / fqf-iqn-qrdqn.pytorch Goto Github PK

PyTorch implementation of FQF, IQN and QR-DQN.

License: MIT License

Python 100.00%

fqf-iqn-qrdqn.pytorch's Introduction

FQF, IQN and QR-DQN in PyTorch

This is a PyTorch implementation of Fully parameterized Quantile Function(FQF)[1], Implicit Quantile Networks(IQN)[2] and Quantile Regression DQN(QR-DQN)[3]. I tried to make it easy for readers to understand algorithms. Please let me know if you have any questions. Also, any pull requests are welcomed.

UPDATE

2020.6.9
- Bump torch up to 1.5.0.
2020.5.10
- Refactor codes.
- Fix Prioritized Experience Replay and Noisy Networks.
- Test IQN with Rainbow's components.
2020.6.9
- Bump Torch up to 1.5.0.

Setup

If you are using Anaconda, first create the virtual environment.

conda create -n fqf python=3.8 -y
conda activate fqf

You can install Python liblaries using pip.

pip install --upgrade pip
pip install -r requirements.txt

If you're using other than CUDA 10.2, you may need to install PyTorch. See instructions for more details.

Examples

You can train FQF agent using hyperparameters here.

python train_fqf.py --cuda --env_id PongNoFrameskip-v4 --seed 0 --config config/fqf.yaml

You can also train IQN or QR-DQN agent in the same way. Note that we log results with the number of frames, which equals to the number of agent's steps multiplied by 4 (e.g. 100M frames means 25M agent's steps).

Results

Results of examples (without n-step rewards, double q-learning, dueling network nor noisy net) are shown below, which is comparable (if no better) with the paper. Scores below are evaluated arfer every 1M frames (250k agent's steps). Result are averaged over 2 seeds and visualized with min/max.

Note that I reported the "mean" score, not the "best" score as in the paper. Also, I only trained a limited number of frames due to limited resources (e.g. 100M frames instead of 200M).

BreakoutNoFrameskip-v4

I tested FQF, IQN and QR-DQN on BreakoutNoFrameskip-v4 for 30M frames to see algorithms worked.

BerzerkNoFrameskip-v4

I also tested FQF and IQN on BerzerkNoFrameskip-v4 for 100M frames to see the difference between FQF's performance and IQN's, which is quite obvious on this task.

IQN-Rainbow

I also tested IQN with Rainbow's components on PongNoFrameskip-v4 (just 1 seed). Note that I decreased num_steps to 7500000(30M frames), but kept start_steps as the same.

TODO

Implement risk-averse policies for IQN.
Test FQF-Rainbow agent.

References

[1] Yang, Derek, et al. "Fully Parameterized Quantile Function for Distributional Reinforcement Learning." Advances in Neural Information Processing Systems. 2019.

[2] Dabney, Will, et al. "Implicit quantile networks for distributional reinforcement learning." arXiv preprint. 2018.

[3] Dabney, Will, et al. "Distributional reinforcement learning with quantile regression." Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

fqf-iqn-qrdqn.pytorch's People

Contributors

Stargazers

Watchers

fqf-iqn-qrdqn.pytorch's Issues

Question on QR-DQN calculate_quantile_huber_loss

When calculating the quantile huber loss in QR-DQN (here), the whole term torch.abs(taus[..., None] - (td_errors.detach() < 0).float()) * element_wise_huber_loss is divided by self.kappa.
I cannot find this equation in the paper. Is there any reason for this implementation?

one of the variables needed for gradient computation has been modified by an inplace operation

I am using pytorch 1.5.0 and i am getting error "one of the variables needed for gradient computation has been modified by an inplace operation".

When i enabled torch anomaly detection to find which tensor was being modified inplace.
I got error at line :
https://github.com/ku2482/fqf-iqn-qrdqn.pytorch/blob/542a6e57cdbc8c467495215c5348800942037bfa/fqf_iqn_qrdqn/network.py#L71

Note: It works when i downgraded to pytorch 1.4.0
I am unable to find where is the issue to make it work on torch 1.5.0

Question about running FQF agent on Breakout

Hi, I ran the FQF agent on Breakout, but just turned out to get a curve that collapses in the middle(like in the attachment).

The command I was using is
python3 -u train_fqf.py --cuda --env_id BreakoutNoFrameskip-v4 --seed 0 --config config/fqf.yaml .
And the hyperparameters is the default ones.
Should I adjust some hyperparameters to get a curve that you have achieved?

A function about continuing training

Hi, guys!
I often have a problem that when I train the agent in the computer, the code is terminated. So, I think if it can be realized that the agent terminated unexpectedly can continue to complete the training steps.

And another question, how long does it take to train for 200M frames per game?

thanks

Questions of several implementation details

Hello @ku2482

May I ask you several implementation details and why you made these decisions?

In this line, you compute signs by comparing sa_quantiles[I] with sa_quantiles[i-1](except the first one). Why don't you use values_1>0 as the signs?
In this line, you initialize the weights of the FractionProposalNetwork using Xavier initialize with gain=0.01. What makes you choose this initialization?

No performance in all three algorithms

I use the following command to run three algorithms on Pong respectively, but returns are always around -20 (by replacing <algo> with fqf and so on).

python train_<algo>.py --cuda --env_id PongNoFrameskip-v4 --seed 0 --config config/<algo>.yaml

Is there anything wrong now at master branch (b4928f9)?

Incorrect Q-Value calculation in "qrdqn" agent

Hi,

First, thank you for providing such a clear and easy-to-follow implementation for some important Distributional RL algorithms.
I found the following line not to be correct according to the QRDQN paper:

https://github.com/ku2482/fqf-iqn-qrdqn.pytorch/blob/11d70bb428e449fe5384654c05e4ab2c3bbdd4cd/fqf_iqn_qrdqn/model/qrdqn.py#L74

I believe it should have been:

q = 1 / 200 * torch.sum(quantiles, dim=1)

which corresponds to the following equation in the paper:

And, here q_j is fixed to 1 / 200.

Question on `update_interval` argument

Hello @ku2482,

First off, great work on the repo, the code is very well written.

I do have a question regarding the default value of the update_interval argument defined here. As the environment setup mentioned here implies we are already skipping to every 4^th frame, doesn't update_interval=4 mean the learning occurs every 16th frame?

calculation of loss

Hi,
This is actually more or of question rather than a issue.

In the utils.py, line#39 where "quantile huber loss" is calculated, there is .detach() statement on the td_errors.
Could you please explain what is the reason?
Thanks

fraction proposal network of FQF

Hi, guys!
I have some problems about fraction proposal network of FQF:

why set fraction_lr=2.5e-9, which is very small? And I found that the tau_hats distibution almost had no change during the training.
why apply initialize_weights_xavier(x, gain=0.01)? When I trained, I found if I didn't apply this initialization, gradient explosion would happen sometiomes.
why use RMSprop, and set alpha=0.95, eps=0.00001, of which the default values are 0.99 and 10e-8 respectively.
And I found that the tau_hats distibution almost had no change during the training of qbert. Is it the key of this algorithm?
thanks!

A question for Fraction Proposal Network in FQF

I am learning FQF recent days. Thanks for the repo that I can learn the algorithm more efficiently~ I found that the Fraction Proposal Net's input in FQF is (s, a) which mentioned in the paper(Algorithm 1). But your implementation made all actions share quantiles/taus for the same state. I'm looking forward to your reply to the conflict. Thank you very much!

Question with running the code

Hi, I ran the code for basically around 4M steps, and suddenly it stoped training and output the model. Do you have any idea about what's wrong here?

Could you please help me with the proof for proposition 1?

Hi @ku2482

Thanks for the code. May I ask you a question?

The author gives proposition 1 and its proof as follows:

I'm quite confused about how they compute the third step, which involves the integral over a quantile function. Could you please help me with that?

Questions of several implementation details

Hello @ku2482

May I ask you several implementation details and why you made these decisions?

In this line, you compute signs by comparing sa_quantiles[I] with sa_quantiles[i-1](except the first one). Why don't you use values_1>0 as the signs?
In this line, you initialize the weights of the FractionProposalNetwork using Xavier initialize with gain=0.01. What makes you choose this initialization?

Element-wise product

I think you're forgetting the element-wise product in the IQN paper, am I wrong? At the end of the first paragraph of section 3.1

Could you please tell me how it to run 'BreakoutDeterministic-v4'?

Hello,
Could you please tell me how it to run 'BreakoutDeterministic-v4'?

BTW, I run it by: python train_iqn.py --cuda --env_id BreakoutNoFrameskip-v4 --seed 0 --config config/iqn.yaml
and
python train_iqn.py --cuda --env_id BreakoutDeterministic-v4 --seed 0 --config config/iqn.yaml
I want to compare with, but after run this code, it's return seems strange and not the same as mine. I commented (fqf_iqn_qrdqn/env.py line 275)
assert 'NoFrameskip' in env.spec.id
it seems work, but it's return is strange too. At the beginning of trainning of my code's traning process, it score is larger than your's. It puzzled me

Convolution issue with Dimensions in states/embedded_states

Hi Toshiki

I'm having problems with running your code. It seems to me like there is a problem with the CNN architecture for base DQN.

For all three algorithms, FQF, IQN, QRDQN, the experiments fail at different episode numbers.
For instance running the following:

python train_qrdqn.py --cuda --env_id PongNoFrameskip-v4 --seed 0 --config config/qrdqn.yaml

returns
RuntimeError: Expected 4-dimensional input for 4-dimensional weight [32, 15, 8, 8], but got 3-dimensional input of size [32, 15, 7] instead

Other errors I've observed are.

RuntimeError: Calculated padded input size per channel: (n x n). Kernel size: (8 x 8). Kernel size can't be greater than actual input size.
I also an error pertaining to 5-dimensional input of size [x,x,x,x,x] as opposed to 3-dimeniosnal input.

For instance at episode: 4788

Traceback (most recent call last):
  File "train_qrdqn.py", line 46, in <module>
    run(args)
  File "train_qrdqn.py", line 35, in run
    agent.run()
  File "/home/brian/ML/Distribute-Reinforcement-Learning/fqf_iqn_qrdqn/agent/base_agent.py", line 89, in run
    self.train_episode()
  File "/home/brian/ML/Distribute-Reinforcement-Learning/fqf_iqn_qrdqn/agent/base_agent.py", line 176, in train_episode
    self.train_step_interval()
  File "/home/brian/ML/Distribute-Reinforcement-Learning/fqf_iqn_qrdqn/agent/base_agent.py", line 197, in train_step_interval
    self.learn()
  File "/home/brian/ML/Distribute-Reinforcement-Learning/fqf_iqn_qrdqn/agent/qrdqn_agent.py", line 71, in learn
    quantile_loss, mean_q, errors = self.calculate_loss(
  File "/home/brian/ML/Distribute-Reinforcement-Learning/fqf_iqn_qrdqn/agent/qrdqn_agent.py", line 94, in calculate_loss
    self.online_net(states=states),
  File "/home/brian/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/brian/ML/Distribute-Reinforcement-Learning/fqf_iqn_qrdqn/model/qrdqn.py", line 48, in forward
    state_embeddings = self.dqn_net(states)
  File "/home/brian/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/brian/ML/Distribute-Reinforcement-Learning/fqf_iqn_qrdqn/network.py", line 48, in forward
    state_embedding = self.net(states)
  File "/home/brian/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/brian/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/home/brian/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/brian/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "/home/brian/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 419, in _conv_forward
    return F.conv2d(input, weight, self.bias, self.stride,
RuntimeError: Expected 4-dimensional input for 4-dimensional weight [32, 15, 8, 8], but got 3-dimensional input of size [32, 15, 7] instead

I've tried to use unsqueeze and squeeze methods in pytorch to change dimensions and get around this but I think perhaps the CNN network is causing this. Have you come across this before?

PS: torch version is 1.7.1

Thanks
Brian