Hands-on Deep Reinforcement Learning, published by Packt

License: MIT License

Python 54.17% Shell 0.42% Jupyter Notebook 40.33% TeX 4.42% Kaitai Struct 0.66%

deep-reinforcement-learning-hands-on's Introduction

Deep Reinforcement Learning Hands-On

Code samples for Deep Reinforcement Learning Hands-On book

Versions and compatibility

This repository is being maintained by book author Max Lapan. I'm trying to keep all the examples working under the latest versions of PyTorch and gym, which is not always simple, as software evolves. For example, OpenAI Universe, extensively being used in chapter 13, was discontinued by OpenAI. List of current requirements is present in requirements.txt file.

Examples require python 3.6.

And, of course, bugs in examples are inevitable, so, exact code might differ from code present in the book text.

Too keep track of major code change, I'm using tags and branches, for example:

tag 01_release marks code state right after book publication in June 2018
branch master has the latest version of code updated for the latest stable PyTorch 0.4.1
branch torch_1.0 keeps the activity of porting examples to PyTorch 1.0 (not yet released)

Chapters' examples

Deep Reinforcement Learning Hands-On

This is the code repository for Deep Reinforcement Learning Hands-On, published by Packt. It contains all the supporting project files necessary to work through the book from start to finish.

About the Book

Recent developments in reinforcement learning (RL), combined with deep learning (DL), have seen unprecedented progress made towards training agents to solve complex problems in a human-like way. Google’s use of algorithms to play and defeat the well-known Atari arcade games has propelled the field to prominence, and researchers are generating new ideas at a rapid pace.

Deep Reinforcement Learning Hands-On is a comprehensive guide to the very latest DL tools and their limitations. You will evaluate methods including Cross-entropy and policy gradients, before applying them to real-world environments. Take on both the Atari set of virtual games and family favorites such as Connect4. The book provides an introduction to the basics of RL, giving you the know-how to code intelligent learning agents to take on a formidable array of practical tasks. Discover how to implement Q-learning on ‘grid world’ environments, teach your agent to buy and trade stocks, and find out how natural language models are driving the boom in chatbots.

Download a free PDF

If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Simply click on the link to claim your free PDF.

https://packt.link/free-ebook/9781838826994

deep-reinforcement-learning-hands-on's People

Contributors

Stargazers

Watchers

Forkers

airob oserttas-math dsp6414 zxspectrumz80 philipz bilio tony32769 nidhishas mch5048 ashishpatel26 jsg71 helinwang phpmind mdasifhasan jmorrow1000 hotco87 dhaopku kukabunga stereoboy yowtzu quantsegu kbetts nathanielwei cottrell tongluoiupui snowmasaya ml-lab roshray raoulhatterer jingweiz jamac22 billxi daohongmei andyyang13 wushicanasl dreadlord1984 coastline2018 jiamim shanliwa1 alphadl baldr-y allensmile jdc08161063 wpf535236337 ggq1996 machanic guanlongtianzi cheungsingyi yskim525 polyakovyevgeniy fancyerii yarenty nanite-git dinglvlq naftalic rockycamp lqiang2003cn sycor4x hikarivina viveksck chen0031 izabael jshuadvd mukitmomin cambyse andrey999333 iit-lab chuong98 thewertzgroup sucrerouge ipoole fileung psxz simonestefani tdesfont vr3dpc aaron8tang saurabhsinalkar rikirolly amimul vr3d jimmycao pepsalehi wizquierdo aishgrt1 moralej sstadelman andrefsp cappelchi zyq11223 dg32 neilslater jorge-vasquez-2301 bmobear dmquant danromuald silverweihit zzhuuh2 vikassbadami freedragon

deep-reinforcement-learning-hands-on's Issues

Chapter 8: How to load data for continuation of training

Hello, I am trying to continue training on previous data. My computer went into a forced restart. I started with, in the command prompt, "python -r YNDX_150101_151231.csv".

The last checkpoint was "checkpoint- 8.data"
The last mean_val was "mean_val-0.828.data"

How would I continue train from where it left off?

02_pong_a2c.py not working with argument --cuda

Traceback (most recent call last):
File "d:/Python/WS/PyTorch/Deep-Reinforcement-Learning-Hands-On/Chapter10/02_pong_a2c.py", line 159, in
tb_tracker.track("advantage", adv_v, step_idx)
File "D:\Python\WinPython3670\python-3.6.7.amd64\lib\site-packages\ptan\common\utils.py", line 329, in track
self.writer.add_scalar(param_name, np.mean(data), iter_index)
File "D:\Python\WinPython3670\python-3.6.7.amd64\lib\site-packages\numpy\core\fromnumeric.py", line 2957, in mean
out=out, **kwargs)
File "D:\Python\WinPython3670\python-3.6.7.amd64\lib\site-packages\numpy\core_methods.py", line 80, in _mean
ret = ret.dtype.type(ret / rcount)
AttributeError: 'torch.dtype' object has no attribute 'type'

Effect of larger sample queue?

Looking at chapter 7 some more,
you have two examples for breakout (breakout-small, and breakout), which have different sample queue sizes. What should we expect from a larger queue? Slower/faster convergence, a stable/smooth grow in reward value, or is it just something of a trial and error to see what works?

One more question, are these Atari games (pong, space-invaders, and breakout) all deterministic? So all this training would be useless on a game with a random seed. Such as the ball in another starting position, paddle in a different starting position, invaders moving in a different direction,?

Thanks,
Chris

tensorboard first example problem

On MacOS High Sierra, after installing virtual environment and requirements, obtained this error when trying first tensorboard demo:

(rlenv) ➜ Chapter03 git:(master) ✗ tensorboard --logdir runs --host localhost
Traceback (most recent call last):
File "/anaconda3/bin/tensorboard", line 6, in
from tensorboard.main import run_main
File "/anaconda3/lib/python3.6/site-packages/tensorboard/init.py", line 4, in
from .writer import FileWriter, SummaryWriter
File "/anaconda3/lib/python3.6/site-packages/tensorboard/writer.py", line 28, in
from .summary import scalar, histogram, image, audio, text
File "/anaconda3/lib/python3.6/site-packages/tensorboard/summary/init.py", line 25, in
from tensorboard.summary import v1
File "/anaconda3/lib/python3.6/site-packages/tensorboard/summary/v1.py", line 24, in
from tensorboard.plugins.audio import summary as _audio_summary
File "/anaconda3/lib/python3.6/site-packages/tensorboard/plugins/audio/summary.py", line 36, in
from tensorboard.plugins.audio import metadata
File "/anaconda3/lib/python3.6/site-packages/tensorboard/plugins/audio/metadata.py", line 21, in
from tensorboard.compat.proto import summary_pb2
File "/anaconda3/lib/python3.6/site-packages/tensorboard/compat/proto/summary_pb2.py", line 15, in
from tensorboard.compat.proto import tensor_pb2 as tensorboard_dot_compat_dot_proto_dot_tensor__pb2
File "/anaconda3/lib/python3.6/site-packages/tensorboard/compat/proto/tensor_pb2.py", line 15, in
from tensorboard.compat.proto import resource_handle_pb2 as tensorboard_dot_compat_dot_proto_dot_resource__handle__pb2
File "/anaconda3/lib/python3.6/site-packages/tensorboard/compat/proto/resource_handle_pb2.py", line 22, in
serialized_pb=_b('\n.tensorboard/compat/proto/resource_handle.proto\x12\x0btensorboard"r\n\x13ResourceHandleProto\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tBn\n\x18org.tensorflow.frameworkB\x0eResourceHandleP\x01Z=github.com/tensorflow/tensorflow/tensorflow/go/core/framework\xf8\x01\x01\x62\x06proto3')
File "/anaconda3/lib/python3.6/site-packages/google/protobuf/descriptor.py", line 878, in new
return _message.default_pool.AddSerializedFile(serialized_pb)
TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "tensorboard/compat/proto/resource_handle.proto":
tensorboard.ResourceHandleProto.device: "tensorboard.ResourceHandleProto.device" is already defined in file "tensorboard/src/resource_handle.proto".
tensorboard.ResourceHandleProto.container: "tensorboard.ResourceHandleProto.container" is already defined in file "tensorboard/src/resource_handle.proto".
tensorboard.ResourceHandleProto.name: "tensorboard.ResourceHandleProto.name" is already defined in file "tensorboard/src/resource_handle.proto".
tensorboard.ResourceHandleProto.hash_code: "tensorboard.ResourceHandleProto.hash_code" is already defined in file "tensorboard/src/resource_handle.proto".
tensorboard.ResourceHandleProto.maybe_type_name: "tensorboard.ResourceHandleProto.maybe_type_name" is already defined in file "tensorboard/src/resource_handle.proto".
tensorboard.ResourceHandleProto: "tensorboard.ResourceHandleProto" is already defined in file "tensorboard/src/resource_handle.proto".

Chapter 7: 03_dqn_double.py

Hi,

I've been playing with your double dqn implementation and have found that unfortunately the model does not seem converge.

To fix this I added a '.detach()' to line 30 because I was worried that backpropping the net for the action selection could be the cause of the issues. This seems to work but I have to admit that I'm confused as to why. I would have thought that the detach in line 36 would already block the gradients.

Do you have any idea what might be going on?

Cheers,
Jamie

Dueling DQN implementation possibly wrong?

Hi Max,

this is the Dueling DQN implementation from DeepMind: https://arxiv.org/pdf/1511.06581.pdf

Formula 9 here shows that the advantage is using the mean of over the actions of a given state. That's also in line with the normal definition of the advantage operator I believe.

In your implementation, however, you seem to subtract the mean of the advantages over all states:

return val + adv - adv.mean()

whereas I believe this should be more correct:

return val + adv - adv.mean(dim=1).unsqueeze(1)

What do you think?

Chapter 7: What is the use of eq_mask?

Hello,
Thanks for the book. I'm having fun reading and implementing different DQN models.

The below snippet is slightly confusing to me.

Deep-Reinforcement-Learning-Hands-On/Chapter07/lib/common.py

Lines 174 to 178 in 2e171a1

 eq_mask = u == l 

 eq_dones = dones.copy() 

 eq_dones[dones] = eq_mask 

 if eq_dones.any(): 

 proj_distr[eq_dones, l[eq_mask]] = 1.0

1)Am I correct in assuming we are trying to get index of values which fall directly on the atom(l == u) and have their dones set to True?
2) Can you also please explain line 176(the Boolean tensor taking a mask of itself is slightly confusing)?

Thanks

Chapter07/04_dqn_noisy_net.py stability

Hi Max,

I've found the NoisyNets implementation to have unstable training dynamics. In my experiments only 1-2 out of 5 runs converge when using the shortened Pong hyperparams (using both Independent Gaussians and Factored Gaussians). I've found that reducing the learning rate from 1e-4 to 5e-5 seems to increase the stability to 4-5 runs out of 5 with minimal increase to the convergence speed. I hope this helps anybody else out there who might be having trouble with it.

Cheers,
Dave

Ch. 17 lib/i2a.py using tensor with wrong type as an index

While running 03_i2a.py, I get a runtime error on line 61 of lib/i2a.py:
act_planes_v[range(batch_size), actions] = 1.0

actions is being used as an index, and the tensor types needs to be int or byte.
To fix this, I made the following change on line 166:
actions_t = torch.tensor(actions,dtype=torch.int64).to(batch.device)

I am now able to run without issues.

Chapter 6: dqn_pong Confusing statement

This is regarding line 77(new_state = new_state) in dqn_pong.

    new_state, reward, is_done, _ = self.env.step(action)

    self.total_reward += reward

    new_state = new_state

was this supposed to be self.state = new_state (which is done in line 81)?

a3c not working on windows?

I can't run the 01_a3c_data.py and 02_a3c_grad.py file in chapter11 on windows.
I get this error:
THCudaCheck FAIL file=c:\users\administrator\downloads\new-builder\win- wheel\pytorch\torch\csrc\generic\StorageSharing.cpp line=253 error=63 : OS call failed or operation not supported on this OS<br>

What can I do to continue the tutorial?

Chapter08 - run_model.py return size mismatch

Hi after running the "train_model_conv.py", I tried to test the saved models using "run_model.py" using the below command but got size mismatch error as stated below. Any idea how I could fix it? Many thanks in advance.

Chapter08$ python3 run_model.py -d data/YNDX_160101_161231.csv -m saves/runs/mean_val-0.824.data -b 10 -n test --conv

Reading data/YNDX_160101_161231.csv
Read done, got 131542 rows, 99752 filtered, 0 open prices adjusted
Traceback (most recent call last):
File "run_model.py", line 35, in
net.load_state_dict(torch.load(args.model, map_location=lambda storage, loc: storage))
File "/home/lamhk/Deep-RL-Hands-On/venv/Deep-RL/lib/python3.5/site-packages/torch/nn/modules/module.py", line 719, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DQNConv1D:
size mismatch for fc_val.0.weight: copying a param of torch.Size([512, 256]) from checkpoint, where the shape is torch.Size([512, 5376]) in current model.
size mismatch for fc_adv.0.weight: copying a param of torch.Size([512, 256]) from checkpoint, where the shape is torch.Size([512, 5376]) in current model.

Chapter 8: run_model

I'm confused about this code for two reasons: 1) Is position_steps actually used for something? It's always None and seems like it does nothing.
2) this ties into my second question, but when I run the code and add some print statements to see what it is choosing as actions (0,1,2), it seems that it can buy as many shares as it wants at a time? How can I change the code so that it only buys 1 share at a time as with the training code?

[I MUST add that I love this book and your code examples. I've spent so much enjoyable time working through the Atari implementations especially. This book is becoming a bible to me.]

Thank you!

Ch. 8's Dueling implementation appears to be incorrect

Deep-Reinforcement-Learning-Hands-On/Chapter08/lib/models.py

Line 56 in a307a95

return val + adv - adv.mean()

This is a minor hiccup -- the line currently takes the average across the entire batch_size * n_actions tensor, making adv.mean() just a scalar value. Instead, each observation in the mini-batch should have the mean subtracted off, so it should be
out = val + adv - adv.mean(dim=1, keepdim=True)which has batch_size elements.

Converting to support GPU

Hi,
I've got a question regarding the code in chapter 3 of (Deep Reinforcement Learning Hands-On). Can you explain how to make this run on the GPU? I've tried to implement this myself but the code crashes.

Crashed with the error:

(python36) c:\Anaconda\Deep-Reinforcement-Learning-Hands-On-master\Chapter04>python 03_frozenlake_tweaked.py --cuda
Traceback (most recent call last):
File "03_frozenlake_tweaked.py", line 109, in
for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
File "03_frozenlake_tweaked.py", line 58, in iterate_batches
act_probs_v = sm(net(obs_v))
File "C:\Anaconda\envs\python36\lib\site-packages\torch\nn\modules\module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "03_frozenlake_tweaked.py", line 43, in forward
return self.net(x)
File "C:\Anaconda\envs\python36\lib\site-packages\torch\nn\modules\module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "C:\Anaconda\envs\python36\lib\site-packages\torch\nn\modules\container.py", line 92, in forward
input = module(input)
File "C:\Anaconda\envs\python36\lib\site-packages\torch\nn\modules\module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "C:\Anaconda\envs\python36\lib\site-packages\torch\nn\modules\linear.py", line 67, in forward
return F.linear(input, self.weight, self.bias)
File "C:\Anaconda\envs\python36\lib\site-packages\torch\nn\functional.py", line 1352, in linear
ret = torch.addmm(torch.jit._unwrap_optional(bias), input, weight.t())
RuntimeError: Expected object of backend CUDA but got backend CPU for argument #4 'mat1'

Added/Made the following changes to: 03_frozenlake_tweaked.py

#!/usr/bin/env python3
import random
import gym
import gym.spaces
import argparse
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter

import torch
import torch.nn as nn
import torch.optim as optim

HIDDEN_SIZE = 128
BATCH_SIZE = 100
PERCENTILE = 30
GAMMA = 0.9

class DiscreteOneHotWrapper(gym.ObservationWrapper):
def init(self, env):
super(DiscreteOneHotWrapper, self).init(env)
assert isinstance(env.observation_space, gym.spaces.Discrete)
self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), dtype=np.float32)

def observation(self, observation):
    res = np.copy(self.observation_space.low)
    res[observation] = 1.0
    return res

class Net(nn.Module):
def init(self, obs_size, hidden_size, n_actions):
super(Net, self).init()
self.net = nn.Sequential(
nn.Linear(obs_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, n_actions)
)

def forward(self, x):
    return self.net(x)

Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

def iterate_batches(env, net, batch_size):
batch = []
episode_reward = 0.0
episode_steps = []
obs = env.reset()
sm = nn.Softmax(dim=1)
while True:
obs_v = torch.FloatTensor([obs])
act_probs_v = sm(net(obs_v))
act_probs = act_probs_v.data.numpy()[0]
action = np.random.choice(len(act_probs), p=act_probs)
next_obs, reward, is_done, _ = env.step(action)
episode_reward += reward
episode_steps.append(EpisodeStep(observation=obs, action=action))
if is_done:
batch.append(Episode(reward=episode_reward, steps=episode_steps))
episode_reward = 0.0
episode_steps = []
next_obs = env.reset()
if len(batch) == batch_size:
yield batch
batch = []
obs = next_obs

def filter_batch(batch, percentile):
disc_rewards = list(map(lambda s: s.reward * (GAMMA ** len(s.steps)), batch))
reward_bound = np.percentile(disc_rewards, percentile)

train_obs = []
train_act = []
elite_batch = []
for example, discounted_reward in zip(batch, disc_rewards):
    if discounted_reward > reward_bound:
        train_obs.extend(map(lambda step: step.observation, example.steps))
        train_act.extend(map(lambda step: step.action, example.steps))
        elite_batch.append(example)

return elite_batch, train_obs, train_act, reward_bound

if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument("--cuda", default=False, action='store_true', help="Enable cuda computation")
args = parser.parse_args()
device = torch.device("cuda" if args.cuda else "cpu")

random.seed(12345)
env = DiscreteOneHotWrapper(gym.make("FrozenLake-v0"))
# env = gym.wrappers.Monitor(env, directory="mon", force=True)
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

net = Net(obs_size, HIDDEN_SIZE, n_actions).to(device)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.001)
writer = SummaryWriter(comment="-frozenlake-tweaked")

full_batch = []
for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
    reward_mean = float(np.mean(list(map(lambda s: s.reward, batch))))
    full_batch, obs, acts, reward_bound = filter_batch(full_batch + batch, PERCENTILE)
    if not full_batch:
        continue
    #obs_v = torch.FloatTensor(obs)#, device=device)
    #acts_v = torch.LongTensor(acts)#, device=device)
    obs_v = torch.tensor(obs).to(device)
    acts_v = torch.tensor(acts).to(device)
	
    full_batch = full_batch[-500:]

    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = objective(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()
    print("%d: loss=%.3f, reward_mean=%.3f, reward_bound=%.3f, batch=%d" % (
        iter_no, loss_v.item(), reward_mean, reward_bound, len(full_batch)))
    writer.add_scalar("loss", loss_v.item(), iter_no)
    writer.add_scalar("reward_mean", reward_mean, iter_no)
    writer.add_scalar("reward_bound", reward_bound, iter_no)
    if reward_mean > 0.8:
        print("Solved!")
        break
writer.close()

Chapter03/03_atari_gan.py crashes

Running 03_atari_gan.py with all dependencies properly installed dumps this to the screen.
Traceback (most recent call last):
File "03_atari_gan.py", line 9, in
from tensorboardX import SummaryWriter
File "/home/jason/anaconda3/envs/nn/lib/python3.7/site-packages/tensorboardX/init.py", line 4, in
from .writer import FileWriter, SummaryWriter
File "/home/jason/anaconda3/envs/nn/lib/python3.7/site-packages/tensorboardX/writer.py", line 24, in
from .src import event_pb2
File "/home/jason/anaconda3/envs/nn/lib/python3.7/site-packages/tensorboardX/src/event_pb2.py", line 16, in
from tensorboard.src import summary_pb2 as tensorboard_dot_src_dot_summary__pb2
File "/home/jason/anaconda3/envs/nn/lib/python3.7/site-packages/tensorboard/init.py", line 4, in
from .writer import FileWriter, SummaryWriter
File "/home/jason/anaconda3/envs/nn/lib/python3.7/site-packages/tensorboard/writer.py", line 28, in
from .summary import scalar, histogram, image, audio, text
File "/home/jason/anaconda3/envs/nn/lib/python3.7/site-packages/tensorboard/summary/init.py", line 25, in
from tensorboard.summary import v1
File "/home/jason/anaconda3/envs/nn/lib/python3.7/site-packages/tensorboard/summary/v1.py", line 24, in
from tensorboard.plugins.audio import summary as _audio_summary
File "/home/jason/anaconda3/envs/nn/lib/python3.7/site-packages/tensorboard/plugins/audio/summary.py", line 36, in
from tensorboard.plugins.audio import metadata
File "/home/jason/anaconda3/envs/nn/lib/python3.7/site-packages/tensorboard/plugins/audio/metadata.py", line 21, in
from tensorboard.compat.proto import summary_pb2
File "/home/jason/anaconda3/envs/nn/lib/python3.7/site-packages/tensorboard/compat/proto/summary_pb2.py", line 15, in
from tensorboard.compat.proto import tensor_pb2 as tensorboard_dot_compat_dot_proto_dot_tensor__pb2
File "/home/jason/anaconda3/envs/nn/lib/python3.7/site-packages/tensorboard/compat/proto/tensor_pb2.py", line 15, in
from tensorboard.compat.proto import resource_handle_pb2 as tensorboard_dot_compat_dot_proto_dot_resource__handle__pb2
File "/home/jason/anaconda3/envs/nn/lib/python3.7/site-packages/tensorboard/compat/proto/resource_handle_pb2.py", line 22, in
serialized_pb=_b('\n.tensorboard/compat/proto/resource_handle.proto\x12\x0btensorboard"r\n\x13ResourceHandleProto\x12\x0e\n\x06\x64\x65vice\x18\x01 \x01(\t\x12\x11\n\tcontainer\x18\x02 \x01(\t\x12\x0c\n\x04name\x18\x03 \x01(\t\x12\x11\n\thash_code\x18\x04 \x01(\x04\x12\x17\n\x0fmaybe_type_name\x18\x05 \x01(\tBn\n\x18org.tensorflow.frameworkB\x0eResourceHandleP\x01Z=github.com/tensorflow/tensorflow/tensorflow/go/core/framework\xf8\x01\x01\x62\x06proto3')
File "/home/jason/anaconda3/envs/nn/lib/python3.7/site-packages/google/protobuf/descriptor.py", line 878, in new
return _message.default_pool.AddSerializedFile(serialized_pb)
TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "tensorboard/compat/proto/resource_handle.proto":
tensorboard.ResourceHandleProto.device: "tensorboard.ResourceHandleProto.device" is already defined in file "tensorboard/src/resource_handle.proto".
tensorboard.ResourceHandleProto.container: "tensorboard.ResourceHandleProto.container" is already defined in file "tensorboard/src/resource_handle.proto".
tensorboard.ResourceHandleProto.name: "tensorboard.ResourceHandleProto.name" is already defined in file "tensorboard/src/resource_handle.proto".
tensorboard.ResourceHandleProto.hash_code: "tensorboard.ResourceHandleProto.hash_code" is already defined in file "tensorboard/src/resource_handle.proto".
tensorboard.ResourceHandleProto.maybe_type_name: "tensorboard.ResourceHandleProto.maybe_type_name" is already defined in file "tensorboard/src/resource_handle.proto".
tensorboard.ResourceHandleProto: "tensorboard.ResourceHandleProto" is already defined in file "tensorboard/src/resource_handle.proto".

dqn_play.py for Chapter07

Hi Maxim,

first of all, fantastic book, thank you so much for that.

I saw other posted about that before but I couldnt resolve my problems reading that issues. I am sure its an easy task but to dull here on my side it seems.

I am struggling to adopt 03_dqn_play.py from Chapter 06 to the examples of Chapter 07. I am able with some minor tweaks to save the best nets during training, but I fail trying to "play" these nets. My problems start with the different wrappers we use in Chapter 07, which result in env.reset() returning a LazyFrames object instead of an observation.

If somebody out there could manage to write a small script to run the trained nets of Chapter 07 I would highly appreciate if you can share. Of course also any pointer how I can get this done myself would be highly appreciated.

Thanks

martin

A3C grad_paralell gradients addition

This is regarding line 140 of 02_a3c_grad.py where we are adding the gradients of different processes

Deep-Reinforcement-Learning-Hands-On/Chapter11/02_a3c_grad.py

Lines 136 to 141 in 7161799

 if grad_buffer is None: 

 grad_buffer = train_entry 

 else: 

 for tgt_grad, grad in zip(grad_buffer, train_entry): 

 tgt_grad += grad

Here even though tgt_grad += grad is done, this change is strangely not getting updated on the grad_buffer list.

I made a small snippet to test this:

               
               # creating a copy of the grad_buffering before its update
                old_grad_buffer = grad_buffer.copy()
                # a list to check whether elements of old_grad_buffer and
                # updated_grad_buffer are equal
                f = []
                # a new list to store the new added gradients in place
                new_grad_buffer = []
                for tgt_grad, grad in zip(grad_buffer, train_entry):
                    tgt_grad = tgt_grad + grad
                    # add the added gradients to new_grad_buffer
                    new_grad_buffer.append(tgt_grad)

                # comparing the updated grad_buffer and old_grad_buffer
                for tgt_grad, grad in zip(grad_buffer, old_grad_buffer):
                    f.append(np.array_equal(tgt_grad, grad))
                print(any(f))
                f.clear()
                # comparing the new_grad_buffer and old_grad_buffer
                for tgt_grad, grad in zip(old_grad_buffer, new_grad_buffer):
                    f.append(np.array_equal(tgt_grad, grad))
                print(any(f))
Outputs:
>> True
>> False

The above snippet produces True(updated grad_buffer and old_grad_buffer are equal) and False (for whether old_grad_buffer and new_grad_buffer are equal)

Something strange happens when i try to use += (a+=b) instead of adding normally (a = a + b).

                # creating a copy of the grad_buffering before its update
                old_grad_buffer = grad_buffer.copy()
                # a list to check whether elements of old_grad_buffer and
                # updated_grad_buffer are equal
                f = []
                # a new list to store the new added gradients in place
                new_grad_buffer = []
                for tgt_grad, grad in zip(grad_buffer, train_entry):
                    # the snippet above used tgt_grad = tgt_grad + grad
                    tgt_grad += grad
                    # add the added gradients to new_grad_buffer
                    new_grad_buffer.append(tgt_grad)

                # comparing the updated grad_buffer and old_grad_buffer
                for tgt_grad, grad in zip(grad_buffer, old_grad_buffer):
                    f.append(np.array_equal(tgt_grad, grad))
                print(any(f))
                f.clear()
                # comparing the new_grad_buffer and old_grad_buffer
                for tgt_grad, grad in zip(old_grad_buffer, new_grad_buffer):
                    f.append(np.array_equal(tgt_grad, grad))
                print(any(f))
                f.clear()
Outputs:
True
True

So I'm not sure why the first snippet works(even verified the gradient outputs and they are summed) while the second one doesn't.

@Shmuma Can you please take a look?

requirements.txt requires <Python==3.6.*

The current set of packages in requirements.txt won't work with later versions of Python (like 3.7) because tensorflow 1.12.0 is only compatible with Python 3.6.* or lower. Adding this to the README would be helpful.

Chapter 8: Can't run train_model_conv.py due to the issue with gym.wrappers.TimeLimit

Couldn't get the code in train_model_conv.py to run due to the
'NoneType' object has no attribute 'max_episode_steps'
Not sure if some recent change in gym.wrappers or a bad setup from my side. It seems to somewhat work when I replace self.env.spec.max_episode_steps with self.env.spec in the definition of gym.wrappers.TimeLimit.

chapter 12: 'torch.dtype' object has no attribute 'type'

Hi, when I start to train the model using "train.scst.py", this error appeared: 'torch.dtype' object has no attribute 'type',
I am running pytorch v1.
this is complete log of error thanks:

python train_scst.py --cuda --data comedy -l saves/crossent-comedy/epoch_090_0.725_0.102.dat -n sc-comedy-test
2019-01-13 09:08:35,726 INFO Loaded 159 movies with genre comedy
2019-01-13 09:08:35,726 INFO Read and tokenise phrases...
2019-01-13 09:08:42,243 INFO Loaded 93039 phrases
2019-01-13 09:08:42,651 INFO Loaded 24716 dialogues with 93039 phrases, generating training pairs
2019-01-13 09:08:42,766 INFO Counting freq of words...
2019-01-13 09:08:43,320 INFO Data has 31774 uniq words, 4913 of them occur more than 10
2019-01-13 09:08:43,573 INFO Obtained 47644 phrase pairs with 4905 uniq words
2019-01-13 09:08:43,859 INFO Training data converted, got 25166 samples
2019-01-13 09:08:43,892 INFO Train set has 21672 phrases, test 1253
2019-01-13 09:08:46,872 INFO Model: PhraseModel(
(emb): Embedding(4905, 50)
(encoder): LSTM(50, 512, batch_first=True)
(decoder): LSTM(50, 512, batch_first=True)
(output): Sequential(
(0): Linear(in_features=512, out_features=4905, bias=True)
)
)
2019-01-13 09:08:46,883 INFO Generating grammar tables from /usr/lib/python3.6/lib2to3/Grammar.txt
2019-01-13 09:08:46,901 INFO Generating grammar tables from /usr/lib/python3.6/lib2to3/PatternGrammar.txt
2019-01-13 09:08:46,935 INFO Model loaded from saves/crossent-comedy/epoch_090_0.725_0.102.dat, continue training in RL mode...
2019-01-13 09:08:46,964 INFO Input: #BEG seymour! i promise you that wasn't a joke-- you have to call her back! #END
2019-01-13 09:08:46,964 INFO Refer: how can you be so sure? #END
2019-01-13 09:08:46,964 INFO Argmax: i don't know, but i always uh-- #END, bleu=0.0302
2019-01-13 09:08:46,975 INFO Sample: i have i hadn't for, but i all help. i together. #END, bleu=0.0218
2019-01-13 09:08:46,982 INFO Sample: i mean i, for only-- #END, bleu=0.0373
2019-01-13 09:08:46,991 INFO Sample: i mean, sir. i always think here. but i #END, bleu=0.0253
2019-01-13 09:08:46,997 INFO Sample: i i like i as sorry sorry. #END, bleu=0.0373
Traceback (most recent call last):
File "train_scst.py", line 160, in
tb_tracker.track("advantage", adv_v, batch_idx)
File "/home/farshid/tensorflow/lib/python3.6/site-packages/ptan/common/utils.py", line 329, in track
self.writer.add_scalar(param_name, np.mean(data), iter_index)
File "/home/farshid/tensorflow/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2957, in mean
out=out, **kwargs)
File "/home/farshid/tensorflow/lib/python3.6/site-packages/numpy/core/_methods.py", line 80, in _mean
ret = ret.dtype.type(ret / rcount)
AttributeError: 'torch.dtype' object has no attribute 'type'

Chapter 14: DDPG Agent instead of D4PG

Hi,
shouldn't it be a D4PG Agent instead of a DDPG one if the chapter is about D4PG? It's in line 111 in 06_train_d4pg.py

Ch. 15 Trpo algorithm's implementation: get_kl method

Hi:
In Chp 15, trpo's KL distance is implemented by following code:
kl = logstd_v - logstd0_v + (std0_v ** 2 + ((mu0_v - mu_v) ** 2) / (2.0 * std_v ** 2)) - 0.5
While, based on this link
https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians
should the kl distance of two Normal distributions be this:
kl = logstd_v - logstd0_v + (std0_v ** 2 + (mu0_v - mu_v) ** 2) / (2.0 * std_v ** 2) - 0.5

Chapter07/lib/common.py - definitions for a newbie

Hello all, I am still very much in the beginning of my journey into ML/DL/RL, and I am having some issue with some of the terminology. Specifically in Chapter 7/common.py. Everything up to this point is very clear. I do not understand what is the meaning of b_j, tz_j, u(I assume upper), l(I assume lower). As in what is the definition of b_j, tz_j, u, l. Can someone reference what these means. I know these are just variable names but I am curious to know.

for atom in range(n_atoms):
tz_j = np.minimum(Vmax, np.maximum(Vmin, rewards + (Vmin + atom * delta_z) * gamma))
b_j = (tz_j - Vmin) / delta_z
l = np.floor(b_j).astype(np.int64)
u = np.ceil(b_j).astype(np.int64)
eq_mask = u == l
proj_distr[eq_mask, l[eq_mask]] += next_distr[eq_mask, atom]
ne_mask = u != l
proj_distr[ne_mask, l[ne_mask]] += next_distr[ne_mask, atom] * (u - b_j)[ne_mask]
proj_distr[ne_mask, u[ne_mask]] += next_distr[ne_mask, atom] * (b_j - l)[ne_mask]

A3C Bug?

Hello,

This is regarding line 104-107 in 01_a3c_data.py

Deep-Reinforcement-Learning-Hands-On/Chapter11/01_a3c_data.py

Lines 104 to 107 in 7161799

 loss_value_v = F.mse_loss(value_v.squeeze(-1), vals_ref_v) 

 log_prob_v = F.log_softmax(logits_v, dim=1) 

 adv_v = vals_ref_v - value_v.detach()

On line 104, we are calculating loss between value_v and vals_ref_v after squeezing value_v as its shape is (batch_size, 1) while vals_ref_v has the shape (batch_size). This is clear to me.

But on line 107, we aren't squeezing value_v before subtracting from values_ref_v and the resulting adv_v vector has the shape (batch_size, batch_size) and this also influences the shape of log_prob_actions_v at line 108.

And this adv_v calculation is used in a2c.py(chapter 10) as well.

Is this a bug? I haven't compared the code with and without squeezing value_v at line 107 but I am confused after inspecting the shapes.

Chapter 6 Problem/Bug? On 02_dqn_pong.py

(python36) c:\Anaconda\Deep-Reinforcement-Learning-Hands-On-master\Chapter06>python 02_dqn_pong.py
DQN(
(conv): Sequential(
(0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
(1): ReLU()
(2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
(3): ReLU()
(4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
(5): ReLU()
)
(fc): Sequential(
(0): Linear(in_features=3136, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=6, bias=True)
)
)
762: done 1 games, mean reward -21.000, eps 0.99, speed 1040.47 f/s
1630: done 2 games, mean reward -20.500, eps 0.98, speed 993.68 f/s
Best mean reward updated -21.000 -> -20.500, model saved
2622: done 3 games, mean reward -20.000, eps 0.97, speed 949.88 f/s
Best mean reward updated -20.500 -> -20.000, model saved
3458: done 4 games, mean reward -20.000, eps 0.97, speed 928.02 f/s
4257: done 5 games, mean reward -20.200, eps 0.96, speed 904.58 f/s
5019: done 6 games, mean reward -20.333, eps 0.95, speed 908.89 f/s
5938: done 7 games, mean reward -20.286, eps 0.94, speed 914.17 f/s
6700: done 8 games, mean reward -20.375, eps 0.93, speed 937.75 f/s
7612: done 9 games, mean reward -20.444, eps 0.92, speed 884.04 f/s
8374: done 10 games, mean reward -20.500, eps 0.92, speed 866.52 f/s
9624: done 11 games, mean reward -20.273, eps 0.90, speed 865.36 f/s
Traceback (most recent call last):
File "02_dqn_pong.py", line 170, in
loss_t = calc_loss(batch, net, tgt_net, device=device)
File "02_dqn_pong.py", line 97, in calc_loss
state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
RuntimeError: Expected object of scalar type Long but got scalar type Int for argument #3 'index'

Multi-agent D4PG

Thank you for these useful examples. I am trying to implement D4PG in multiple agents that interact with each other, share the same reward, but each agent takes its own actions. I wonder if you had any tips on how I could modify the code to achieve this. Thank you in advance.

Experiments

Hallo Maxim,

your book is awesome. I gave it 5 stars on O'Reilly Safari. I modified something in Chapter04/02_frozenlake_naive and after adding it, it seems to converge:

`class FrozenLakeRewardWrapper(gym.RewardWrapper):
def init(self, env):
super(FrozenLakeRewardWrapper, self).init(env)

def reward(self, reward):
    if reward == 0:
        return 1
    else:
        return 2`

I don't know actually what happens :-)

How can I then visualize the images / videos which are created? I uncommented the line:
env= gym.wrappers.Monitor(env, directory="mon", force=True)
and I got some files in mon Folder (e.g. openaigym.episode_batch.0.8090.stats.json) but I have no idea how to play/see them...

Could you give me a tip? Thank you so much!
Regards
Fabio
`

List comprehension

Is there any specific reason you're not using list comprehensions? It seems much more "pythonic" than dealing with lists, maps and lambdas.
For example, in crossentropy_cartpole you're getting rewards from a batch like this:

def filter_batch(batch, percentile):
    rewards = list(map(lambda s: s.reward, batch))

while the same could be written as

def filter_batch(batch, percentile):
    rewards = [s.reward for s in batch]

,which (to me, anyway) looks cleaner and easier to understand.

Purpose of new_state = new_state

Line 77 In 02_dqn_pong.py, what is the purpose of assigning a variable to itself?

new_state = new_state

RuntimeError in wob_click_play.py

Hi @Shmuma ,
I am getting this error,

Traceback (most recent call last):
File "./wob_click_play.py", line 64, in
logits_v = net(obs_v)[0]
File "/home/hemanth_savasere/.conda/envs/rl_book_ch13/lib/python3.6/site-package
s/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/hemanth_savasere/Deep-Reinforcement-Learning-Hands-On/Chapter13/lib/
model_vnc.py", line 45, in forward
conv_out = self.conv(fx).view(fx.size()[0], -1)
File "/home/hemanth_savasere/.conda/envs/rl_book_ch13/lib/python3.6/site-package
s/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/hemanth_savasere/.conda/envs/rl_book_ch13/lib/python3.6/site-package
s/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/hemanth_savasere/.conda/envs/rl_book_ch13/lib/python3.6/site-package
s/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/hemanth_savasere/.conda/envs/rl_book_ch13/lib/python3.6/site-package
s/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: expected stride to be a single integer value or a list of 1 values t
o match the convolution dimensions, but got stride=[5, 5]

CUDA performance of training code, Chapter 8, much lower than expected

Hello

Firstly thanks for the great book, I have learned a great deal from it. This is my first foray into the world of machine learning and I am having a ball.

I am attempting to get the stock trading sample running from Chapter 8. Initially it seems to run but I am finding that the performance with CUDA enabled is much worse than I had expected.

My system specs are as follows:
X6 Phenom II 1055t
GTX 980 4GB watercooled
24GB RAM
Windows 10 64

Software:
Anaconda 3
Python 3.6.7
PyCharm 2018.3

An initial pass through train_model.py without CUDA enabled yields the following performance:

Note the GPU is idle as expected, at about 1% utilization.

Task Manager shows CPU at 100% as expected, GPU at 0%

After enabling CUDA with --cuda switch, performance is only marginally better. Note that the GPU is at about 2% load according to Task Manager, or 9% load according to GPU temp, and temperature has risen a whole 1-2 degrees C on average (if I run FurMark the GPU is at 99% and the temperature quickly rises from 33 to maybe 48 Deg C, on water cooling). It's fast up until the buffer is populated and training starts, then afterwards it takes about 15 seconds to spit out one line, or 100 epochs.

GPU is barely doing anything,

Task Manager says GPU is 2% utilized. CPU has a few peaks but averages about 30%

It appears that the mode is changing from CPU to GPU since CPU goes down and GPU goes up. But it seems that the GPU only improved the outcome by maybe 60%, which seems almost negligible for a GPU like a GTX 980, and has 2048 Shader, 128 Texture, 64 ROPs. And clearly at 2% utilization the GPU is not doing a great deal of acceleration.

I assume I must have done something wrong. I have been trying different variations of packages and settings for a few days, including:

Trying different Python versions, 3.6, 3.7 (some minor versions would not play with PyTorch at all, eg. 3.6.8)
Installing the CUDA 9 Windows installer from NVidia as a system wide installation
Trying different video drivers, the latest two, and also one that was installed with the CUDA installer which was dated 2017
Trying a bunch of different package versions of various things, cudatoolkit, PyTorch etc.
Also I did some tests to ensure that CUDA was working

import torch
torch.cuda.current_device()
Out[3]: 0
torch.cuda.device(0)
Out[4]: <torch.cuda.device at 0x153b39c5780>
torch.cuda.device_count()
Out[5]: 1
torch.cuda.get_device_name(0)
Out[6]: 'GeForce GTX 980'

My current package configuration is as follows;
` Name Version Build Channel

anaconda-client 1.7.2 py36_0
anaconda-navigator 1.9.6 py36_0
anaconda-project 0.8.2 py36_0
cuda90 1.0 0 pytorch
cudatoolkit 9.0 1
cudnn 7.3.1 cuda9.0_0
gym 0.11.0 pypi_0 pypi
matplotlib 3.0.2 py36hc8f65d3_0
numpy 1.15.4 py36h19fb1c0_0
opencv-python 4.0.0.21 pypi_0 pypi
pip 19.0.1 py36_0
ptan 0.3 pypi_0 pypi
python 3.6.7 h9f7ef89_2
pytorch 0.4.1 py36_cuda90_cudnn7he774522_1 pytorch
scipy 1.2.0 py36h29ff71c_0
tensorboard 1.12.2 py36h33f27b4_0
tensorboardx 1.6 pypi_0 pypi
tensorflow 1.12.0 gpu_py36ha5f9131_0
tensorflow-base 1.12.0 gpu_py36h6e53903_0
tensorflow-gpu 1.12.0 pypi_0 pypi
torchvision 0.2.1 py_2 pytorch`

Hmm that borked my formatting, heres an image

Any thoughts about what I might have done wrong here would be much appreciated. I'm still just getting a handle on Python and Deep RL.

Thanks for your time!

Chris

Chapter 18: difference from original paper

When I read the original paper by D. Silver et al., I came across 2 difference between the paper and your code.

Original paper uses L2 regularization, but you don't (train.py, line 64).
Your code retains all sub trees during an episode, but original paper says that sub trees are retained only if the are children of the selected action.

If you have any reasons for these, please let us know.

monitor not working in Chapter04/01_cartpole.py

i umcommented the line 79 in the Chapter04/01_cartpole.py to record video:
env = gym.wrappers.Monitor(env, directory="mon", force=True)

however, when the program terminates, it throws the following exception:

Exception ignored in: <bound method Viewer.del of <gym.envs.classic_control.rendering.Viewer object at 0x000000000507DA58>>
Traceback (most recent call last):
File "E:\dlsoft\lib\site-packages\gym\envs\classic_control\rendering.py", line 143, in del
File "E:\dlsoft\lib\site-packages\gym\envs\classic_control\rendering.py", line 62, in close
File "E:\dlsoft\lib\site-packages\pyglet\window\win32_init_.py", line 305, in close
File "E:\dlsoft\lib\site-packages\pyglet\window_init_.py", line 770, in close
ImportError: sys.meta_path is None, Python is likely shutting down

i'm working on windows in Anaconda, Any ideas ? thanks !

tensorboard error

AttributeError: module 'tensorflow.python.training.checkpointable' has no attribute 'CheckpointableBase'

i get this error on runtime please help

calulation of Q in chapter 6 pong

Deep-Reinforcement-Learning-Hands-On/Chapter06/02_dqn_pong.py

Lines 88 to 103 in 2e171a1

 def calc_loss(batch, net, tgt_net, device="cpu"): 

 states, actions, rewards, dones, next_states = batch 

 states_v = torch.tensor(states).to(device) 

 next_states_v = torch.tensor(next_states).to(device) 

 actions_v = torch.tensor(actions).to(device) 

 rewards_v = torch.tensor(rewards).to(device) 

 done_mask = torch.ByteTensor(dones).to(device) 

 state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1) 

 next_state_values = tgt_net(next_states_v).max(1)[0] 

 next_state_values[done_mask] = 0.0 

 next_state_values = next_state_values.detach() 

 expected_state_action_values = next_state_values * GAMMA + rewards_v 

 return nn.MSELoss()(state_action_values, expected_state_action_values)

in Agent.play_step:
an Experience is created that records: the current state (observation), the action taken, the transition reward, & the next state (observation)

in Agent.calc_loss:
the Experience is unpacked & tensored. then the next state (observation) is passed to the target neural net (tgt_net) to get (predicted) score for all actions in the next state. the (predicted) maximum score for the next state is determined (next_state_values).

now, here is where i'm confused — instead of calculating Q for the current state as:
reward at the current state + the gamma (discount) * max Q at the new state
it looks (to me) like it calculates instead, at line 102:
max Q at the new state + the gamma (discount) * reward at the current state

Chapter 11: 02_a3c_grad.py

Hi,

When we collect gradients in the gradient buffer between lines 136 and 140 what is the reason for the new tgt_grad variable.

For example why can we simply not replace this with,

        if grad_buffer is None:
            grad_buffer = train_entry
        else:
            grad_buffer += train_entry

Incidentally, with the original code I could not get convergence but with the above everything worked fine. (I only tried once so this could just be a lucky seed).

Cheers,
Jamie

Saving/Running Model from 08_dqn_rainbow.py

Hi again,
I was hoping you could help me modify the code for 08_dqn_rainbow.py to save the model. So that it can be replayed through dqn_play.py

Here's what I have so far, it fails on:
state, reward, done, _ = env.step(action)

MODIFIED CODE OF (dqn_play.py from chapter 6)

import torch.nn as nn

Vmax = 10
Vmin = -10
N_ATOMS = 51
DELTA_Z = (Vmax - Vmin) / (N_ATOMS - 1)

class RainbowDQN(nn.Module):

if name == "main":
net = RainbowDQN(env.observation_space.shape, env.action_space.n)

MODIFIED CODE OF (common.py from chapter 7)

class RewardTracker:
def init(self, writer, stop_reward, env_name, net):
self.writer = writer
self.stop_reward = stop_reward
self.best_mean_reward = None
self.env_name = env_name
self.net = net

def reward(self, reward, frame, epsilon=None):
    self.total_rewards.append(reward)
    speed = (frame - self.ts_frame) / (time.time() - self.ts)
    self.ts_frame = frame
    self.ts = time.time()
    mean_reward = np.mean(self.total_rewards[-100:])
    epsilon_str = "" if epsilon is None else ", eps %.2f" % epsilon
    print("%d: done %d games, mean reward %.3f, speed %.2f f/s%s" % (
        frame, len(self.total_rewards), mean_reward, speed, epsilon_str
    ))
    sys.stdout.flush()
    if epsilon is not None:
        self.writer.add_scalar("epsilon", epsilon, frame)
    self.writer.add_scalar("speed", speed, frame)
    self.writer.add_scalar("reward_100", mean_reward, frame)
    self.writer.add_scalar("reward", reward, frame)
    if self.best_mean_reward is None or self.best_mean_reward < mean_reward:
        torch.save(self.net.state_dict(), self.env_name + "-best.dat")
        if self.best_mean_reward is not None:
            print("Best mean reward updated %.3f -> %.3f, model saved" % (self.best_mean_reward, mean_reward))
        self.best_mean_reward = mean_reward
    if mean_reward > self.stop_reward:
        print("Solved in %d frames!" % frame)
        return True
    return False

with common.RewardTracker(writer, params['stop_reward'], params['env_name'], net) as reward_tracker:

bug on indexing?

Deep-Reinforcement-Learning-Hands-On/Chapter14/06_train_d4pg.py

Lines 79 to 86 in a307a95

 if eq_dones.any(): 

 proj_distr[eq_dones, l] = 1.0 

 ne_mask = u != l 

 ne_dones = dones_mask.copy() 

 ne_dones[dones_mask] = ne_mask 

 if ne_dones.any(): 

 proj_distr[ne_dones, l] = (u - b_j)[ne_mask] 

 proj_distr[ne_dones, u] = (b_j - l)[ne_mask]

Hello Maxim,

Your book is really helpful for us to immediately implement D4PG, but I am wondering if the index for l and u should be modified to be as follows:

proj_distr[eq_dones, l[eq_mask]] = 1.0

proj_distr[ne_dones, l[ne_mask]] = (u - b_j)[ne_mask]
proj_distr[ne_dones, u[ne_mask]] = (b_j - l)[ne_mask]

Thanks for any clarification.

Chapter13: wob_click_train.py not getting 0.8 mean reward even after 200k

Hi @Shmuma

wob_click_train.py not getting a mean reward even after running it for 250k. As mentioned in your book, it should have reached this stage at 200k steps. Training on GCP with Nvidia Tesla K80 GPU.

Thanks

Chapter08: exploration in the validation procedure, is it an issue ?

Hello Max ...
great work that allow us to dig into RL world ...

in chapter08 code :

in the validation.py procedure I have noticed that epsilon is kept to a non zero (default to 0.2) which means that the policy is not greedy but rather epsilon-greedy,
thi smeans that 2 out of 10 actions are random!!!

RL teory says that it should only be greedy (epsilon=0)
is it an error or deliberately done ?

How to reload the models in Chapter07?

Hi Maxim,

I would find really useful in Chapter07 something like 03_dqn_play.py in Chapter06.
I modified the RewardTracker in order to save the best model:

if self.best_mean_reward is None or self.best_mean_reward < mean_reward:
----torch.save(self.net.state_dict(), "best.dat")
----if self.best_mean_reward is not None:
--------print("Best mean reward updated %.3f -> %.3f, model saved" % (self.best_mean_reward, mean_reward))
--------self.best_mean_reward = mean_reward

and it seems to work. But I have problems to reload the net with the saved weights and to work with it...I tried with something as in the code attached...

09_dqn_play.py.txt

I have doubt it is right :-( Do you have perhaps already a piece of code at hand or could you give me a tip?

Thank you very much in advance!
Regards
Fabio

chapter09 PG_baseline_cartpole

Hi,

thank for you the book that brought me into the world of reinforcement learning. Though there is quite a lot material available in the internet, it is still quite hard for a beginner to catch the whole ideas about the reinforcement learning. your book provides a systematic way for those beginners like me.

my question is on your cp09 Policy gradient baseline code(cartpole). I would like to have the honour to get your further advice.

in the PG code, after the agent interacts with the environment, I noticed that you just record the S, A, R, S', discard the output value of the PGN net for every time stamp. in that case, at the training stage, the below codes were needed for the loss function,

states_v = torch.FloatTensor(batch_states)
logits_v = net(states_v)

My question is whether I could use store the net(state_v) during the first time Agent interacts with the environment. then, those the logit_v stored could be extracted for the loss function computing instead of making another computing for the logit_v. in that case, we could waive one round of net forward computing.

The reason I raised that question is that the pytorch spends a lot of time for converting CPU tensor to GPU. therefore, I thought if I could avoid that step, for accelerating the whole computing.

however, I am just a beginner only with 2 months learning on your books. I did not have the confidence on that.

your advice is highly appreciated!

Best Regards,

Charles

Chapter06/02_dqn_pong.py can't learn properly after modified the ScaleFloatFrame

Hi, I am a reader of your book and recently met a very weird issue. I tried to modify your ScaledFloatFrame preprocessing, the only two modifications I made are:

Firstly, in Chapter06/lib/wrappers.py, I commented the ScaledFloatFrame

def make_env(env_name):
    env = gym.make(env_name)
    env = MaxAndSkipEnv(env)
    env = FireResetEnv(env)
    env = ProcessFrame84(env)
    env = ImageToPyTorch(env)
    env = BufferWrapper(env, 4)
    return env

Secondly, in Chapter06/lib/dqn_model.py, I performed the scaling in the network model:

    def forward(self, x):
        fx = x.float() / 255.0
        conv_out = self.conv(fx).view(fx.size()[0], -1)
        return self.fc(conv_out)

However, I found DQN can't learn properly on pong. Results are:

Any ideas of why such things happened? I just really don't know where I made the mistake...

Possible bug in Chapter08/lib/data.py

Hi.

in line 33 it asks if prev_vals is not None but then it's not used.

if fix_open_price and prev_vals is not None:
                ppo, pph, ppl, ppc, ppv = vals

I think line 34 should be:
ppo, pph, ppl, ppc, ppv = prev_vals

Thanks.

Chapter 3 03_atari_gan.py

Hi,
Can you show how to get the sample images produced by the generator network like in Fig7 of chapter 3?
Many thanks,
Quang

Can not run Deep-Reinforcement-Learning-Hands-On/Chapter03/03_atari_gan.py due to np.isscalar assertion

Chapter 14, PDF equation has action and mu reversed in 02_train_a2c.py?

The code here: train a2c chapter 14

Has this snippet:

def calc_logprob(mu_v, var_v, actions_v):
    p1 = - ((mu_v - actions_v) ** 2) / (2*var_v.clamp(min=1e-3))
    p2 = - torch.log(torch.sqrt(2 * math.pi * var_v))
    return p1 + p2

but the textbook shows the equation for p1 to be:

where it is (x-u) not (u-x), assuming x is action and u is mu or the mean.

Is this an error in the implementation?

Chapter 6 02_dqn_pong.py RuntimeError

I'm running this on Windows 10 (installed the atari lib with pip install -U git+https://github.com/Kojoley/atari-py.git)
9765: done 10 games, mean reward -20.100, eps 0.90, speed 1036.60 f/s
Traceback (most recent call last):
File "02_dqn_pong.py", line 170, in
loss_t = calc_loss(batch, net, tgt_net, device=device)
File "02_dqn_pong.py", line 97, in calc_loss
state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
RuntimeError: Expected object of type torch.cuda.LongTensor but found type torch.cuda.IntTensor for argument #3 'index'

importing env Chapter 13

Looks like no one else had this issue on Chapter 13,
using the supplied yml environment I'm seeing a packages not found error?

ResolvePackageNotFound:

pycparser==2.18=py36hf9f622e_1
libgfortran-ng==7.2.0=hdf63c60_3
cudnn==7.0.5=cuda8.0_0
numpy==1.14.2=py36hdbf6ddf_1
jpeg==9b=h024ee3a_2
readline==7.0=ha6073c6_4
mkl_fft==1.0.1=py36h3010b51_0
tk==8.6.7=hc745277_3
cffi==1.11.5=py36h9745a5d_0
mkl_random==1.0.1=py36h629b387_0
pillow==5.0.0=py36h3deb7b8_0
libffi==3.2.1=hd88cf55_4
ncurses==6.0=h9df7e31_2
torchvision==0.2.1=py36_1
python==3.6.5=hc3d631a_0
xz==5.2.3=h55aa19d_2
sqlite==3.22.0=h1bed415_0
libpng==1.6.34=hb9fc6fc_0
libedit==3.1=heed3624_0
pytorch==0.4.0=py36_cuda8.0.61_cudnn7.1.2_1
libtiff==4.0.9=h28f6b97_0
zlib==1.2.11=ha838bed_2
libgcc-ng==7.2.0=hdf63c60_3
libstdcxx-ng==7.2.0=hdf63c60_3
freetype==2.8=hab7d2ae_1
openssl==1.0.2o=h20670df_0
six==1.11.0=py36h372c433_1

	eq_mask = u == l
	eq_dones = dones.copy()
	eq_dones[dones] = eq_mask
	if eq_dones.any():
	proj_distr[eq_dones, l[eq_mask]] = 1.0

	if grad_buffer is None:
	grad_buffer = train_entry
	else:
	for tgt_grad, grad in zip(grad_buffer, train_entry):
	tgt_grad += grad

	loss_value_v = F.mse_loss(value_v.squeeze(-1), vals_ref_v)

	log_prob_v = F.log_softmax(logits_v, dim=1)
	adv_v = vals_ref_v - value_v.detach()

	def calc_loss(batch, net, tgt_net, device="cpu"):
	states, actions, rewards, dones, next_states = batch

	states_v = torch.tensor(states).to(device)
	next_states_v = torch.tensor(next_states).to(device)
	actions_v = torch.tensor(actions).to(device)
	rewards_v = torch.tensor(rewards).to(device)
	done_mask = torch.ByteTensor(dones).to(device)

	state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
	next_state_values = tgt_net(next_states_v).max(1)[0]
	next_state_values[done_mask] = 0.0
	next_state_values = next_state_values.detach()

	expected_state_action_values = next_state_values * GAMMA + rewards_v
	return nn.MSELoss()(state_action_values, expected_state_action_values)

	if eq_dones.any():
	proj_distr[eq_dones, l] = 1.0
	ne_mask = u != l
	ne_dones = dones_mask.copy()
	ne_dones[dones_mask] = ne_mask
	if ne_dones.any():
	proj_distr[ne_dones, l] = (u - b_j)[ne_mask]
	proj_distr[ne_dones, u] = (b_j - l)[ne_mask]

packtpublishing / deep-reinforcement-learning-hands-on Goto Github PK

deep-reinforcement-learning-hands-on's Introduction

Deep Reinforcement Learning Hands-On

Versions and compatibility

Chapters' examples

Deep Reinforcement Learning Hands-On

About the Book

Download a free PDF

deep-reinforcement-learning-hands-on's People

Contributors

Stargazers

Watchers

Forkers

deep-reinforcement-learning-hands-on's Issues

Recommend Projects

Recommend Topics

Recommend Org