jcwleo / random-network-distillation-pytorch Goto Github PK
View Code? Open in Web Editor NEWRandom Network Distillation pytorch
License: MIT License
Random Network Distillation pytorch
License: MIT License
Hi,
Is the reason for the following code modifying the actions for the breakout game is eliminating the NOOP action from the available set of actions that can be taken by the agent ?
envs.py:
if 'Breakout' in self.env_id:
action += 1
train.py:
if 'Breakout' in env_id:
output_size -= 1
Hello,
I also built RND model, but I am stuck at 2500... How many total steps will agent improve further? I am not sure whether it is related to bug of my code, so I wanna check with you. Thank you.
timestamp, pci.bus_id, temperature.gpu, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
2019/01/11 15:41:41.215, 00000000:01:00.0, 60, 0 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:41.216, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:42.217, 00000000:01:00.0, 60, 0 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:42.218, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:43.218, 00000000:01:00.0, 60, 0 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:43.219, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:44.220, 00000000:01:00.0, 60, 0 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:44.220, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:45.221, 00000000:01:00.0, 60, 0 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:45.221, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:46.222, 00000000:01:00.0, 60, 79 %, 1 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:46.222, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:47.223, 00000000:01:00.0, 60, 0 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:47.223, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:48.224, 00000000:01:00.0, 60, 0 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:48.225, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:49.225, 00000000:01:00.0, 61, 50 %, 1 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:49.226, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:50.226, 00000000:01:00.0, 65, 97 %, 51 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:50.227, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:51.227, 00000000:01:00.0, 65, 96 %, 53 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:51.227, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:52.228, 00000000:01:00.0, 61, 3 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
2019/01/11 15:41:52.229, 00000000:02:00.0, 33, 0 %, 0 %, 11178 MiB, 11168 MiB, 10 MiB
2019/01/11 15:41:53.229, 00000000:01:00.0, 61, 3 %, 0 %, 11177 MiB, 3366 MiB, 7811 MiB
Thanks for the code, this is much understandable than the original one.
But, the agent I trained can only get a maximum score of 4600 and still maintain the performance no matter how many times I trained.
Note that OpenAI can reach a score of 10,000.
Do I miss something?
what is meaning of line in envs.py:
r = int(info.get('flag_get', False))
Hello I have problem in utils.py
.
in make_train_data
:
if use_gae:
gae = np.zeros_like([num_worker, ])
for t in range(num_step - 1, -1, -1):
delta = reward[:, t] + gamma * value[:, t + 1] * (1 - done[:, t]) - value[:, t]
gae = delta + gamma * lam * (1 - done[:, t]) * gae
discounted_return[:, t] = gae + value[:, t]
# For Actor
adv = discounted_return - value[:, :-1]
I am confused, in my recognize gae is
I haven't no idea why need to add V(t)
discounted_return[:, t] = gae + value[:, t]
Can you explain what I missing ? thx
Tanks for the great work!
I'd like to konw if i want employe this work on the new continuous env which is created by myself, what should i do?Do you have any suggests?
Hi,
In your code (envs.py), I saw that you first use MaxAndSkipEnv() to wrap the environment, and then apply the sticky action.
However, in RND's author's code, I found that they first wrap the env by StickyActionEnv(), then wrap it by MaxAndSkipEnv(). So, it seems your agent will have more "sticky" actions. I think this makes things a little bit different.
I'm working with a different env (other than atari or mario) and I want to change the input shape to the CNN. It seems like self.input_size is ignored? Do you have any explanation for the math going on when setting up the network (the parameters for each layer)? When I change the size from (84, 84) to anything else, I get size mismatch errors at the linear layer.
Thanks!
Hi @jcwleo,
Your implementation is amazing. I was currently searching for a simple but powerful implementation, I modified it a bit and train it with SuperMarioBros. However, when I want to eval my agent (using eval.py), the framerate is very slow.
Do you know why?
Thanks,
Have a nice day,
In the RND paper on page 15, it mentions that extrinsic rewards are clipped in [-1,1].
But in the official RND code in atari_wrappers.py it clips extrinsic rewards using the ClipRewardEnv function which does:
"""Bin reward to {+1, 0, -1} by its sign."""
return float(np.sign(reward))
I believe the implementation and the explanation in the paper is a little different.
In your implementation (jcwleo) you are clipping by doing:
total_reward = total_reward.reshape([num_step, num_env_workers]).transpose().clip(-1, 1)
I believe this is different than the official implementation. Does anyone have an explanation of this discrepancy and what to use ?
Problem: Input shape is not correct in Linear-8 layer in CnnActorCriticNetwork feature model. Or maybe there is a typo in Conv2d-5 Layer kernel_size. In predictor model and target model kernel_size == 3, but in feature model kernel_size == 4.
https://github.com/jcwleo/random-network-distillation-pytorch/blob/master/model.py#L97-L106
Feature model summary: input_shape (4,84,84)
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [4, 32, 20, 20] 8,224
ReLU-2 [4, 32, 20, 20] 0
Conv2d-3 [4, 64, 9, 9] 32,832
ReLU-4 [4, 64, 9, 9] 0
Conv2d-5 [4, 64, 7, 7] 36,928
ReLU-6 [4, 64, 7, 7] 0
Flatten-7 [4, 3136] 0
Linear-8 [4, 256] 803,072
ReLU-9 [4, 256] 0
Linear-10 [4, 448] 115,136
ReLU-11 [4, 448] 0
================================================================
Total params: 996,192
Trainable params: 996,192
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.43
Forward/backward pass size (MB): 1.43
Params size (MB): 3.80
Estimated Total Size (MB): 5.66
----------------------------------------------------------------
Traceback:
Traceback (most recent call last):
File "/Applications/PyCharm CE 2018.3 EAP.app/Contents/helpers/pydev/pydevd.py", line 1689, in <module>
main()
File "/Applications/PyCharm CE 2018.3 EAP.app/Contents/helpers/pydev/pydevd.py", line 1683, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm CE 2018.3 EAP.app/Contents/helpers/pydev/pydevd.py", line 1083, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Applications/PyCharm CE 2018.3 EAP.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "train.py", line 274, in <module>
main()
File "train.py", line 152, in main
actions, value_ext, value_int, policy = agent.get_action(np.float32(states) / 255.)
File "/Users/kslazarev/PycharmProjects/random-network-distillation-pytorch/agents.py", line 59, in get_action
policy, value_ext, value_int = self.model(state)
File "/Users/kslazarev/.pyenv/versions/3.6.7/envs/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/Users/kslazarev/PycharmProjects/random-network-distillation-pytorch/model.py", line 158, in forward
x = self.feature(state)
File "/Users/kslazarev/.pyenv/versions/3.6.7/envs/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/Users/kslazarev/.pyenv/versions/3.6.7/envs/env/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/Users/kslazarev/.pyenv/versions/3.6.7/envs/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/Users/kslazarev/.pyenv/versions/3.6.7/envs/env/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 67, in forward
return F.linear(input, self.weight, self.bias)
File "/Users/kslazarev/.pyenv/versions/3.6.7/envs/env/lib/python3.6/site-packages/torch/nn/functional.py", line 1352, in linear
ret = torch.addmm(torch.jit._unwrap_optional(bias), input, weight.t())
RuntimeError: size mismatch, m1: [16 x 2304], m2: [3136 x 256] at /Users/administrator/nightlies/pytorch-1.0.0/wheel_build_dirs/wheel_3.6/pytorch/aten/src/TH/generic/THTensorMath.cpp:940
should use int_gamma instead of gamma.
When I try to train system from start by removing the pretrained models it complains that it can not find the models.
python train.py
{'OPTIONS': {'envtype': '[atari, mario]', 'trainmethod': 'RND', 'envid': 'MontezumaRevengeNoFrameskip-v4', 'maxstepperepisode': '4500', 'extcoef': '2.', 'learningrate': '1e-4', 'numenv': '2', 'numstep': '128', 'gamma': '0.999', 'intgamma': '0.99', 'lambda': '0.95', 'stableeps': '1e-8', 'statestacksize': '4', 'preprocheight': '84', 'proprocwidth': '84', 'usegae': 'True', 'usegpu': 'True', 'usenorm': 'False', 'usenoisynet': 'False', 'clipgradnorm': '0.5', 'entropy': '0.001', 'epoch': '4', 'minibatch': '4', 'ppoeps': '0.1', 'intcoef': '1.', 'stickyaction': 'True', 'actionprob': '0.25', 'updateproportion': '0.25', 'lifedone': 'False', 'obsnormstep': '50'}}
load model...
Traceback (most recent call last):
File "train.py", line 281, in <module>
main()
File "train.py", line 100, in main
agent.model.load_state_dict(torch.load(model_path))
File "/home/rjn/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 356, in load
f = open(f, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'models/MontezumaRevengeNoFrameskip-v4.model'
Hi!
I have a question related to how the intrinsic rewards are calculated.
Why do you use the sum(1) instead of mean(1)?
That would calculate the sum along the 512 output neurons, which is different than calculating the mean along those outputs.
At the original release with tensorflow, they use reduce_mean, and im a little bit confused.
https://github.com/openai/random-network-distillation/blob/f75c0f1efa473d5109d487062fd8ed49ddce6634/policies/cnn_gru_policy_dynamics.py#L241
Hope you could clear me,
Thank you in advance
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.