Random Network Distillation pytorch

License: MIT License

Python 99.57% Shell 0.43%

rnd extrinsic-reward random-network-distillation pytorch intrinsic-reward curiosity-driven reinforcement-learning

random-network-distillation-pytorch's Issues

Action values are incremented by 1 for the Breakout game ?

Hi,
Is the reason for the following code modifying the actions for the breakout game is eliminating the NOOP action from the available set of actions that can be taken by the agent ?
envs.py:

if 'Breakout' in self.env_id: 
    action += 1

train.py:

if 'Breakout' in env_id:
    output_size -= 1

How long did you get 6100?

Hello,

I also built RND model, but I am stuck at 2500... How many total steps will agent improve further? I am not sure whether it is related to bug of my code, so I wanna check with you. Thank you.

Use several GPUs if they exist

timestamp, pci.bus_id, temperature.gpu, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2019/01/11 15:41:41.215, 00000000:01:00.0, 60, 0 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:41.216, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:42.217, 00000000:01:00.0, 60, 0 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:42.218, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:43.218, 00000000:01:00.0, 60, 0 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:43.219, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:44.220, 00000000:01:00.0, 60, 0 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:44.220, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:45.221, 00000000:01:00.0, 60, 0 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:45.221, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:46.222, 00000000:01:00.0, 60, 79 %, 1 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:46.222, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:47.223, 00000000:01:00.0, 60, 0 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:47.223, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:48.224, 00000000:01:00.0, 60, 0 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:48.225, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:49.225, 00000000:01:00.0, 61, 50 %, 1 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:49.226, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:50.226, 00000000:01:00.0, 65, 97 %, 51 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:50.227, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:51.227, 00000000:01:00.0, 65, 96 %, 53 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:51.227, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:52.228, 00000000:01:00.0, 61, 3 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:52.229, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:53.229, 00000000:01:00.0, 61, 3 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB

Reward converge at 4600

Thanks for the code, this is much understandable than the original one.

But, the agent I trained can only get a maximum score of 4600 and still maintain the performance no matter how many times I trained.

Note that OpenAI can reach a score of 10,000.
Do I miss something?

what is meaning of line in envs.py:

what is meaning of line in envs.py:
r = int(info.get('flag_get', False))

Generalized Advantage Estimator problem

Hello I have problem in utils.py.

in make_train_data:

if use_gae:
        gae = np.zeros_like([num_worker, ])
        for t in range(num_step - 1, -1, -1):
            delta = reward[:, t] + gamma * value[:, t + 1] * (1 - done[:, t]) - value[:, t]
            gae = delta + gamma * lam * (1 - done[:, t]) * gae

            discounted_return[:, t] = gae + value[:, t]

            # For Actor
        adv = discounted_return - value[:, :-1]

I am confused, in my recognize gae is

I haven't no idea why need to add V(t)

discounted_return[:, t] = gae + value[:, t]

Can you explain what I missing ? thx

if i want to employe this work to a new env, what should i do?

Tanks for the great work!
I'd like to konw if i want employe this work on the new continuous env which is created by myself, what should i do?Do you have any suggests?

In your code (envs.py), I saw that you first use MaxAndSkipEnv() to wrap the environment, and then apply the sticky action.
However, in RND's author's code, I found that they first wrap the env by StickyActionEnv(), then wrap it by MaxAndSkipEnv(). So, it seems your agent will have more "sticky" actions. I think this makes things a little bit different.

training error

I find the code loads the pretrained weights in training. I tried to train without pretrained weight. But it seems a wrong operations. There is my result.

input_size in model.py?

I'm working with a different env (other than atari or mario) and I want to change the input shape to the CNN. It seems like self.input_size is ignored? Do you have any explanation for the math going on when setting up the network (the parameters for each layer)? When I change the size from (84, 84) to anything else, I get size mismatch errors at the linear layer.

Thanks!

global_grad_norm_ has no effect

Mario eval is slow

Hi @jcwleo,

Your implementation is amazing. I was currently searching for a simple but powerful implementation, I modified it a bit and train it with SuperMarioBros. However, when I want to eval my agent (using eval.py), the framerate is very slow.

Do you know why?

Thanks,
Have a nice day,

Extrinsic reward clipping

In the RND paper on page 15, it mentions that extrinsic rewards are clipped in [-1,1].
But in the official RND code in atari_wrappers.py it clips extrinsic rewards using the ClipRewardEnv function which does:

"""Bin reward to {+1, 0, -1} by its sign."""
        return float(np.sign(reward))

I believe the implementation and the explanation in the paper is a little different.
In your implementation (jcwleo) you are clipping by doing:

        total_reward = total_reward.reshape([num_step, num_env_workers]).transpose().clip(-1, 1)

I believe this is different than the official implementation. Does anyone have an explanation of this discrepancy and what to use ?

Input shape is not correct in Linear-8 layer in CnnActorCriticNetwork feature model

Problem: Input shape is not correct in Linear-8 layer in CnnActorCriticNetwork feature model. Or maybe there is a typo in Conv2d-5 Layer kernel_size. In predictor model and target model kernel_size == 3, but in feature model kernel_size == 4.

https://github.com/jcwleo/random-network-distillation-pytorch/blob/master/model.py#L97-L106

Feature model summary: input_shape (4,84,84)

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1            [4, 32, 20, 20]           8,224
              ReLU-2            [4, 32, 20, 20]               0
            Conv2d-3              [4, 64, 9, 9]          32,832
              ReLU-4              [4, 64, 9, 9]               0
            Conv2d-5              [4, 64, 7, 7]          36,928
              ReLU-6              [4, 64, 7, 7]               0
           Flatten-7                  [4, 3136]               0
            Linear-8                   [4, 256]         803,072
              ReLU-9                   [4, 256]               0
           Linear-10                   [4, 448]         115,136
             ReLU-11                   [4, 448]               0
================================================================
Total params: 996,192
Trainable params: 996,192
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.43
Forward/backward pass size (MB): 1.43
Params size (MB): 3.80
Estimated Total Size (MB): 5.66
----------------------------------------------------------------

Traceback:

Traceback (most recent call last):
  File "/Applications/PyCharm CE 2018.3 EAP.app/Contents/helpers/pydev/pydevd.py", line 1689, in <module>
    main()
  File "/Applications/PyCharm CE 2018.3 EAP.app/Contents/helpers/pydev/pydevd.py", line 1683, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/Applications/PyCharm CE 2018.3 EAP.app/Contents/helpers/pydev/pydevd.py", line 1083, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm CE 2018.3 EAP.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "train.py", line 274, in <module>
    main()
  File "train.py", line 152, in main
    actions, value_ext, value_int, policy = agent.get_action(np.float32(states) / 255.)
  File "/Users/kslazarev/PycharmProjects/random-network-distillation-pytorch/agents.py", line 59, in get_action
    policy, value_ext, value_int = self.model(state)
  File "/Users/kslazarev/.pyenv/versions/3.6.7/envs/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/kslazarev/PycharmProjects/random-network-distillation-pytorch/model.py", line 158, in forward
    x = self.feature(state)
  File "/Users/kslazarev/.pyenv/versions/3.6.7/envs/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/kslazarev/.pyenv/versions/3.6.7/envs/env/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/Users/kslazarev/.pyenv/versions/3.6.7/envs/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/kslazarev/.pyenv/versions/3.6.7/envs/env/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 67, in forward
    return F.linear(input, self.weight, self.bias)
  File "/Users/kslazarev/.pyenv/versions/3.6.7/envs/env/lib/python3.6/site-packages/torch/nn/functional.py", line 1352, in linear
    ret = torch.addmm(torch.jit._unwrap_optional(bias), input, weight.t())
RuntimeError: size mismatch, m1: [16 x 2304], m2: [3136 x 256] at /Users/administrator/nightlies/pytorch-1.0.0/wheel_build_dirs/wheel_3.6/pytorch/aten/src/TH/generic/THTensorMath.cpp:940

Line 68 in train.py

should use int_gamma instead of gamma.

I tried to train system but get error

When I try to train system from start by removing the pretrained models it complains that it can not find the models.

python train.py
{'OPTIONS': {'envtype': '[atari, mario]', 'trainmethod': 'RND', 'envid': 'MontezumaRevengeNoFrameskip-v4', 'maxstepperepisode': '4500', 'extcoef': '2.', 'learningrate': '1e-4', 'numenv': '2', 'numstep': '128', 'gamma': '0.999', 'intgamma': '0.99', 'lambda': '0.95', 'stableeps': '1e-8', 'statestacksize': '4', 'preprocheight': '84', 'proprocwidth': '84', 'usegae': 'True', 'usegpu': 'True', 'usenorm': 'False', 'usenoisynet': 'False', 'clipgradnorm': '0.5', 'entropy': '0.001', 'epoch': '4', 'minibatch': '4', 'ppoeps': '0.1', 'intcoef': '1.', 'stickyaction': 'True', 'actionprob': '0.25', 'updateproportion': '0.25', 'lifedone': 'False', 'obsnormstep': '50'}}
load model...
Traceback (most recent call last):
  File "train.py", line 281, in <module>
    main()
  File "train.py", line 100, in main
    agent.model.load_state_dict(torch.load(model_path))
  File "/home/rjn/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 356, in load
    f = open(f, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'models/MontezumaRevengeNoFrameskip-v4.model'

Intrinsic reward calculation, sum or mean?

Hi!

I have a question related to how the intrinsic rewards are calculated.
Why do you use the sum(1) instead of mean(1)?

random-network-distillation-pytorch/agents.py

Line 76 in e383fb9

 intrinsic_reward = (target_next_feature - predict_next_feature).pow(2).sum(1) / 2 

That would calculate the sum along the 512 output neurons, which is different than calculating the mean along those outputs.

At the original release with tensorflow, they use reduce_mean, and im a little bit confused.
https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/policies/cnn_gru_policy_dynamics.py#L241

Hope you could clear me,
Thank you in advance

jcwleo / random-network-distillation-pytorch Goto Github PK

random-network-distillation-pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org