Simple Reinforcement learning tutorials, 莫烦Python 中文AI教学

Home Page: https://mofanpy.com/tutorials/machine-learning/reinforcement-learning/

License: MIT License

Python 100.00%

a3c actor-critic asynchronous-advantage-actor-critic ddpg deep-deterministic-policy-gradient deep-q-network double-dqn dqn dueling-dqn machine-learning policy-gradient ppo prioritized-replay proximal-policy-optimization q-learning reinforcement-learning sarsa sarsa-lambda tensorflow-tutorials tutorial

reinforcement-learning-with-tensorflow's People

Contributors

Stargazers

Watchers

Forkers

hanahimi hubfire shaomai00 felixflyfly28 gdf1 lgselite andyxzq k0e1n sylvia1664 albererre saadmahboob antsui lu-yi-hsun maphysart cuiweigeng littlecomma tonyan charlestoty sabrinashurong arthur814 yummylee23 gucasdongzi xxxyyyqqq12345 dahaihe weinima12 zhyhhust zhijs kmusialik jcajax coocoky ferankliu biranchi2018 celinecheng1029 ucaslyc pcchenxi jenniferdeng gitxuy bkhk06 dannylee1991 alicedudu pythonai zhouweiti ldh127 437072341 tanghsihsi johndpope hannibalqsp huzaifafaruqui peterouzh woaiwodib107 superashan wujianhui0504 lxw4939 youclee brianhu2006 jamesraynor67 helloworldwinning binderwang nuclearwang jkrj yr11134former ziqi-zhang hellokitty8 wanggangalanhe yiqinggit lxwithgod zhoushaojun iguazi yuchaoyin jangoai hchchhc quantumiracle clarkzhao huachunwang peakulorain jeff60907 jalused ml-lab hhhhhui jcwen kongmo thuxugang cxiang26 declanzane gagang123 huangsg1 kyoyo zmdfwh jangoraspberrypi isaactl wolray xuexixuexihaha nalanyu2000 vineet-mehta zlte2011 sxdkxgwan mohbattharani wkfwkf danielzhang2018 dominjune

reinforcement-learning-with-tensorflow's Issues

Question

Thanks for the sample code - I had a question on this
q_table.ix[S, A] += ALPHA * (q_target - q_predict) # update

Why subtract q_predict from q_target. The q_target should be good enough. I am confused about the use of q_predict in the above formula.
Should something like this not suffice?
q_table.ix[S, A] += ALPHA * (q_target)

argmax() is deprecated, use idxmax() instead

It's a minor problem.
In "contents/1_command_line_reinforcement_learning/treasure_on_right.py", the first use of argmax() starts to produce warning that argmax() is deprecated. The program can still continue though. You might want to replace it with idxmax(), and instances in other files.

Problem in a3c discrete implements about encourage exploration

log_prob = tf.reduce_sum(tf.log(self.a_prob) * tf.one_hot(self.a_his, N_A, dtype=tf.float32), axis=1, keep_dims=True)

exp_v = log_prob * tf.stop_gradient(td)
entropy = -tf.reduce_sum(self.a_prob * tf.log(self.a_prob + 1e-5),
                                             axis=1, keep_dims=True)  # encourage exploration
self.exp_v = ENTROPY_BETA * entropy + exp_v
self.a_loss = tf.reduce_mean(-self.exp_v)

We want to encourage exploration so actually we try to increase the entropy of action distribution, so I think "self.exp_v" should be equal to "-ENTROPY_BETA * entropy + exp_v"?

AC_CartPole.py may need to consider the distance between cart to the center

Hi, I think AC_CartPole.py may need to consider the distance between cart to the center.

Simply adding r -= abs(s_[0]) * .1 to the reward.

AC Cartpole: I think the better loss function is this one.

cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.acts_logits, logits=self.a, name='ce')
loss = tf.reduce_sum(tf.multiply(cross_entropy, self.td_error))

stochastic policy for continuous control

Hello, I found that in continuous control case (A3C/PPO), the code use neural network to generate Gaussian distribution, then sample it from the distribution, like the following code:

log_prob = self.normal_dist.log_prob(self.a)  # loss without advantage
self.exp_v = log_prob * self.td_error  # advantage (TD_error) guided loss
# Add cross entropy cost to encourage exploration
self.exp_v += 0.01*self.normal_dist.entropy()

In the tensorflow documents, the normal_dist.log_prob represents the log probability density function (not probability), and therefore the normal_dist.prob could be greater than 1.0, and log_prob could be positive. In that case, exp_v might be positive (let's assume entropy loss is 0). In that case, the loss become negative value. I know this is most common way for continuous action, but apparently, from above analytics, it seems problematic. I was wondering if you have any insights to explain it? Thank you!

please add # coding=utf-8 in the beginning of every python file

or i get
SyntaxError: Non-ASCII character

Why there is stop_gradient for td?

Reinforcement-learning-with-tensorflow/contents/10_A3C/A3C_discrete_action.py

Line 63 in b484df7

exp_v = log_prob * tf.stop_gradient(td)

In this line, I can't understand why there's stop_gradient before td

为啥要对td做反向截断呢？
我看了些别人的A3C代码，都没有这么做。而且

Reinforcement-learning-with-tensorflow/experiments/Robot_arm/A3C.py

Line 82 in b484df7

exp_v = log_prob * td

robot arm中 stop_gradient 消失了。真的很不理解。望赐教。谢谢！

Car example - does not converge

I've been experimenting with your car example. The success of the training seems to depend a lot on when the training stops.

See the log below as an example. It reaches a good state of 600 steps by EP 210 and the car is going around the block. If the training stops now then the model works well when LOAD = True. But at EP 232 the car is crashing again. So if we happen to stop training on EP 232 we will not get a successful model saved.

If you leave the training running you will continually see the steps reach 600 and then drop back down and build up again. Is this behaviour expected? Is there a regularisation issue?

Ep: 0 | Steps: 34 | Explore: 2.00
Ep: 1 | Steps: 50 | Explore: 2.00
Ep: 2 | Steps: 57 | Explore: 2.00
Ep: 3 | Steps: 17 | Explore: 2.00
Ep: 4 | Steps: 56 | Explore: 2.00
Ep: 5 | Steps: 59 | Explore: 2.00
Ep: 6 | Steps: 59 | Explore: 2.00
Ep: 7 | Steps: 21 | Explore: 2.00
Ep: 8 | Steps: 61 | Explore: 2.00
Ep: 9 | Steps: 58 | Explore: 2.00
Ep: 10 | Steps: 41 | Explore: 2.00
Ep: 11 | Steps: 26 | Explore: 2.00
Ep: 12 | Steps: 58 | Explore: 2.00
Ep: 13 | Steps: 58 | Explore: 2.00
Ep: 14 | Steps: 29 | Explore: 2.00
Ep: 15 | Steps: 60 | Explore: 2.00
Ep: 16 | Steps: 59 | Explore: 2.00
Ep: 17 | Steps: 59 | Explore: 2.00
Ep: 18 | Steps: 54 | Explore: 2.00
Ep: 19 | Steps: 57 | Explore: 2.00
Ep: 20 | Steps: 62 | Explore: 2.00
Ep: 21 | Steps: 57 | Explore: 2.00
Ep: 22 | Steps: 58 | Explore: 2.00
Ep: 23 | Steps: 31 | Explore: 2.00
Ep: 24 | Steps: 59 | Explore: 2.00
Ep: 25 | Steps: 58 | Explore: 2.00
Ep: 26 | Steps: 32 | Explore: 2.00
Ep: 27 | Steps: 58 | Explore: 2.00
Ep: 28 | Steps: 41 | Explore: 2.00
Ep: 29 | Steps: 30 | Explore: 2.00
Ep: 30 | Steps: 40 | Explore: 2.00
Ep: 31 | Steps: 39 | Explore: 2.00
Ep: 32 | Steps: 57 | Explore: 2.00
Ep: 33 | Steps: 26 | Explore: 2.00
Ep: 34 | Steps: 29 | Explore: 2.00
Ep: 35 | Steps: 59 | Explore: 2.00
Ep: 36 | Steps: 59 | Explore: 2.00
Ep: 37 | Steps: 37 | Explore: 2.00
Ep: 38 | Steps: 59 | Explore: 2.00
Ep: 39 | Steps: 20 | Explore: 2.00
Ep: 40 | Steps: 36 | Explore: 2.00
Ep: 41 | Steps: 144 | Explore: 2.00
Ep: 42 | Steps: 37 | Explore: 2.00
Ep: 43 | Steps: 57 | Explore: 2.00
Ep: 44 | Steps: 61 | Explore: 2.00
Ep: 45 | Steps: 58 | Explore: 2.00
Ep: 46 | Steps: 41 | Explore: 2.00
Ep: 47 | Steps: 61 | Explore: 2.00
Ep: 48 | Steps: 53 | Explore: 2.00
Ep: 49 | Steps: 60 | Explore: 2.00
Ep: 50 | Steps: 24 | Explore: 2.00
Ep: 51 | Steps: 58 | Explore: 2.00
Ep: 52 | Steps: 58 | Explore: 2.00
Ep: 53 | Steps: 24 | Explore: 2.00
Ep: 54 | Steps: 39 | Explore: 2.00
Ep: 55 | Steps: 58 | Explore: 2.00
Ep: 56 | Steps: 23 | Explore: 2.00
Ep: 57 | Steps: 31 | Explore: 2.00
Ep: 58 | Steps: 65 | Explore: 2.00
Ep: 59 | Steps: 37 | Explore: 2.00
Ep: 60 | Steps: 24 | Explore: 2.00
Ep: 61 | Steps: 33 | Explore: 2.00
Ep: 62 | Steps: 14 | Explore: 2.00
Ep: 63 | Steps: 57 | Explore: 2.00
Ep: 64 | Steps: 60 | Explore: 2.00
Ep: 65 | Steps: 58 | Explore: 2.00
Ep: 66 | Steps: 21 | Explore: 2.00
Ep: 67 | Steps: 59 | Explore: 2.00
Ep: 68 | Steps: 20 | Explore: 2.00
Ep: 69 | Steps: 20 | Explore: 2.00
Ep: 70 | Steps: 64 | Explore: 2.00
Ep: 71 | Steps: 135 | Explore: 2.00
Ep: 72 | Steps: 21 | Explore: 2.00
Ep: 73 | Steps: 34 | Explore: 2.00
Ep: 74 | Steps: 33 | Explore: 2.00
Ep: 75 | Steps: 65 | Explore: 2.00
Ep: 76 | Steps: 58 | Explore: 2.00
Ep: 77 | Steps: 57 | Explore: 2.00
Ep: 78 | Steps: 22 | Explore: 2.00
Ep: 79 | Steps: 24 | Explore: 2.00
Ep: 80 | Steps: 59 | Explore: 2.00
Ep: 81 | Steps: 31 | Explore: 2.00
Ep: 82 | Steps: 31 | Explore: 2.00
Ep: 83 | Steps: 21 | Explore: 2.00
Ep: 84 | Steps: 57 | Explore: 2.00
Ep: 85 | Steps: 28 | Explore: 2.00
Ep: 86 | Steps: 21 | Explore: 2.00
Ep: 87 | Steps: 43 | Explore: 2.00
Ep: 88 | Steps: 19 | Explore: 2.00
Ep: 89 | Steps: 60 | Explore: 2.00
Ep: 90 | Steps: 59 | Explore: 2.00
Ep: 91 | Steps: 41 | Explore: 2.00
Ep: 92 | Steps: 53 | Explore: 2.00
Ep: 93 | Steps: 75 | Explore: 2.00
Ep: 94 | Steps: 55 | Explore: 2.00
Ep: 95 | Steps: 58 | Explore: 2.00
Ep: 96 | Steps: 50 | Explore: 2.00
Ep: 97 | Steps: 58 | Explore: 2.00
Ep: 98 | Steps: 44 | Explore: 2.00
Ep: 99 | Steps: 21 | Explore: 2.00
Ep: 100 | Steps: 63 | Explore: 2.00
Ep: 101 | Steps: 33 | Explore: 2.00
Ep: 102 | Steps: 27 | Explore: 2.00
Ep: 103 | Steps: 24 | Explore: 2.00
Ep: 104 | Steps: 58 | Explore: 2.00
Ep: 105 | Steps: 59 | Explore: 2.00
Ep: 106 | Steps: 32 | Explore: 2.00
Ep: 107 | Steps: 29 | Explore: 2.00
Ep: 108 | Steps: 82 | Explore: 1.92
Ep: 109 | Steps: 60 | Explore: 1.87
Ep: 110 | Steps: 25 | Explore: 1.84
Ep: 111 | Steps: 59 | Explore: 1.79
Ep: 112 | Steps: 41 | Explore: 1.75
Ep: 113 | Steps: 22 | Explore: 1.73
Ep: 114 | Steps: 36 | Explore: 1.70
Ep: 115 | Steps: 83 | Explore: 1.63
Ep: 116 | Steps: 19 | Explore: 1.62
Ep: 117 | Steps: 23 | Explore: 1.60
Ep: 118 | Steps: 30 | Explore: 1.58
Ep: 119 | Steps: 21 | Explore: 1.56
Ep: 120 | Steps: 29 | Explore: 1.54
Ep: 121 | Steps: 26 | Explore: 1.52
Ep: 122 | Steps: 31 | Explore: 1.49
Ep: 123 | Steps: 32 | Explore: 1.47
Ep: 124 | Steps: 92 | Explore: 1.40
Ep: 125 | Steps: 20 | Explore: 1.39
Ep: 126 | Steps: 21 | Explore: 1.38
Ep: 127 | Steps: 28 | Explore: 1.36
Ep: 128 | Steps: 23 | Explore: 1.34
Ep: 129 | Steps: 21 | Explore: 1.33
Ep: 130 | Steps: 43 | Explore: 1.30
Ep: 131 | Steps: 24 | Explore: 1.28
Ep: 132 | Steps: 19 | Explore: 1.27
Ep: 133 | Steps: 19 | Explore: 1.26
Ep: 134 | Steps: 36 | Explore: 1.24
Ep: 135 | Steps: 29 | Explore: 1.22
Ep: 136 | Steps: 26 | Explore: 1.20
Ep: 137 | Steps: 22 | Explore: 1.19
Ep: 138 | Steps: 30 | Explore: 1.17
Ep: 139 | Steps: 27 | Explore: 1.16
Ep: 140 | Steps: 52 | Explore: 1.13
Ep: 141 | Steps: 23 | Explore: 1.11
Ep: 142 | Steps: 29 | Explore: 1.10
Ep: 143 | Steps: 19 | Explore: 1.09
Ep: 144 | Steps: 22 | Explore: 1.08
Ep: 145 | Steps: 19 | Explore: 1.07
Ep: 146 | Steps: 20 | Explore: 1.05
Ep: 147 | Steps: 23 | Explore: 1.04
Ep: 148 | Steps: 72 | Explore: 1.01
Ep: 149 | Steps: 30 | Explore: 0.99
Ep: 150 | Steps: 21 | Explore: 0.98
Ep: 151 | Steps: 18 | Explore: 0.97
Ep: 152 | Steps: 35 | Explore: 0.95
Ep: 153 | Steps: 21 | Explore: 0.94
Ep: 154 | Steps: 28 | Explore: 0.93
Ep: 155 | Steps: 40 | Explore: 0.91
Ep: 156 | Steps: 47 | Explore: 0.89
Ep: 157 | Steps: 22 | Explore: 0.88
Ep: 158 | Steps: 21 | Explore: 0.87
Ep: 159 | Steps: 64 | Explore: 0.85
Ep: 160 | Steps: 39 | Explore: 0.83
Ep: 161 | Steps: 66 | Explore: 0.80
Ep: 162 | Steps: 34 | Explore: 0.79
Ep: 163 | Steps: 33 | Explore: 0.78
Ep: 164 | Steps: 59 | Explore: 0.75
Ep: 165 | Steps: 57 | Explore: 0.73
Ep: 166 | Steps: 58 | Explore: 0.71
Ep: 167 | Steps: 36 | Explore: 0.70
Ep: 168 | Steps: 28 | Explore: 0.69
Ep: 169 | Steps: 46 | Explore: 0.67
Ep: 170 | Steps: 36 | Explore: 0.66
Ep: 171 | Steps: 58 | Explore: 0.64
Ep: 172 | Steps: 31 | Explore: 0.63
Ep: 173 | Steps: 22 | Explore: 0.63
Ep: 174 | Steps: 22 | Explore: 0.62
Ep: 175 | Steps: 24 | Explore: 0.61
Ep: 176 | Steps: 24 | Explore: 0.60
Ep: 177 | Steps: 26 | Explore: 0.60
Ep: 178 | Steps: 20 | Explore: 0.59
Ep: 179 | Steps: 21 | Explore: 0.58
Ep: 180 | Steps: 24 | Explore: 0.58
Ep: 181 | Steps: 24 | Explore: 0.57
Ep: 182 | Steps: 21 | Explore: 0.56
Ep: 183 | Steps: 19 | Explore: 0.56
Ep: 184 | Steps: 24 | Explore: 0.55
Ep: 185 | Steps: 20 | Explore: 0.55
Ep: 186 | Steps: 24 | Explore: 0.54
Ep: 187 | Steps: 15 | Explore: 0.54
Ep: 188 | Steps: 15 | Explore: 0.53
Ep: 189 | Steps: 17 | Explore: 0.53
Ep: 190 | Steps: 19 | Explore: 0.52
Ep: 191 | Steps: 21 | Explore: 0.52
Ep: 192 | Steps: 17 | Explore: 0.51
Ep: 193 | Steps: 19 | Explore: 0.51
Ep: 194 | Steps: 18 | Explore: 0.50
Ep: 195 | Steps: 20 | Explore: 0.50
Ep: 196 | Steps: 23 | Explore: 0.49
Ep: 197 | Steps: 20 | Explore: 0.49
Ep: 198 | Steps: 25 | Explore: 0.48
Ep: 199 | Steps: 20 | Explore: 0.48
Ep: 200 | Steps: 22 | Explore: 0.47
Ep: 201 | Steps: 18 | Explore: 0.47
Ep: 202 | Steps: 20 | Explore: 0.46
Ep: 203 | Steps: 27 | Explore: 0.46
Ep: 204 | Steps: 19 | Explore: 0.45
Ep: 205 | Steps: 21 | Explore: 0.45
Ep: 206 | Steps: 24 | Explore: 0.44
Ep: 207 | Steps: 62 | Explore: 0.43
Ep: 208 | Steps: 61 | Explore: 0.42
Ep: 209 | Steps: 62 | Explore: 0.40
Ep: 210 | Steps: 600 | Explore: 0.30
Ep: 211 | Steps: 600 | Explore: 0.22
Ep: 212 | Steps: 600 | Explore: 0.16
Ep: 213 | Steps: 600 | Explore: 0.12
Ep: 214 | Steps: 600 | Explore: 0.10
Ep: 215 | Steps: 600 | Explore: 0.10
Ep: 216 | Steps: 600 | Explore: 0.10
Ep: 217 | Steps: 600 | Explore: 0.10
Ep: 218 | Steps: 600 | Explore: 0.10
Ep: 219 | Steps: 600 | Explore: 0.10
Ep: 220 | Steps: 600 | Explore: 0.10
Ep: 221 | Steps: 600 | Explore: 0.10
Ep: 222 | Steps: 600 | Explore: 0.10
Ep: 223 | Steps: 600 | Explore: 0.10
Ep: 224 | Steps: 600 | Explore: 0.10
Ep: 225 | Steps: 600 | Explore: 0.10
Ep: 226 | Steps: 600 | Explore: 0.10
Ep: 227 | Steps: 600 | Explore: 0.10
Ep: 228 | Steps: 600 | Explore: 0.10
Ep: 229 | Steps: 600 | Explore: 0.10
Ep: 230 | Steps: 600 | Explore: 0.10
Ep: 231 | Steps: 453 | Explore: 0.10
Ep: 232 | Steps: 30 | Explore: 0.10
Ep: 233 | Steps: 31 | Explore: 0.10
Ep: 234 | Steps: 95 | Explore: 0.10
Ep: 235 | Steps: 519 | Explore: 0.10
Ep: 236 | Steps: 33 | Explore: 0.10
Ep: 237 | Steps: 600 | Explore: 0.10
Ep: 238 | Steps: 163 | Explore: 0.10

How is the state dimension 7 , For 3 arms would it be 9 ?

car_env.py SyntaxError

File "/usr/projects/Reinforcement-learning-with-tensorflow/experiments/2D_car/car_env.py", line 67
self.car_info[:3] = np.array([*self.start_point, -np.pi / 2])
^
SyntaxError: invalid syntax

q_learning中 maze_env.py报错

运行Q_learning的maze_env.py文件中错误，
请问莫凡大神这是哪里的问题？？tk的版本问题吗

/usr/lib/python3.6/tkinter/init.py in init(self, screenName, baseName, className, useTk, sync, use)
2018 baseName = baseName + ext
2019 interactive = 0
-> 2020 self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)
2021 if useTk:
2022 self._loadtk()

TclError: no display name and no $DISPLAY environment variable

why there is some ‘nans’ stored in the self.tree.tree arrays?

A3C example fail after updating TF==1.6

Hi @MorvanZhou , the BipedalWalker A3C example fail to converge after updating the TensorFlow.
It would be great if we can fix it.

https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/experiments/Solve_BipedalWalker/A3C.py

W_3 Ep: 7983 | ------- | Pos: 4 | RR: -16.6 | EpR: -16.4 | var: [ 4.245301  15.187426   5.2913938 13.67638  ]
W_1 Ep: 7984 | ------- | Pos: 3 | RR: -16.9 | EpR: -22.1 | var: [1.5639918 9.140908  1.399676  7.786484 ]
W_2 Ep: 7985 | ------- | Pos: 5 | RR: -16.8 | EpR: -14.6 | var: [ 4.6013346 16.964872   5.8146315 15.3746605]
W_4 Ep: 7986 | ------- | Pos: 3 | RR: -16.8 | EpR: -16.9 | var: [2.4117482 9.90723   1.954719  8.533363 ]
W_7 Ep: 7987 | ------- | Pos: 3 | RR: -16.9 | EpR: -19.8 | var: [ 2.0128653 13.1242     2.5133812 11.365165 ]
W_3 Ep: 7988 | ------- | Pos: 4 | RR: -16.8 | EpR: -14.5 | var: [ 1.7905483 10.57028    1.7253214  9.119608 ]
W_6 Ep: 7989 | ------- | Pos: 4 | RR: -16.8 | EpR: -16.8 | var: [ 4.03255  14.209857  4.585601 12.950537]
W_4 Ep: 7990 | ------- | Pos: 4 | RR: -16.7 | EpR: -14.7 | var: [ 3.7650225 14.439074   5.4745865 13.146451 ]
W_1 Ep: 7991 | ------- | Pos: 4 | RR: -16.7 | EpR: -15.6 | var: [ 4.213461  14.894608   4.1767473 13.537503 ]
W_7 Ep: 7992 | ------- | Pos: 3 | RR: -16.6 | EpR: -15.6 | var: [0.88352734 8.910546   0.9521044  7.6020703 ]
W_2 Ep: 7993 | ------- | Pos: 5 | RR: -16.4 | EpR: -13.4 | var: [ 2.968406  13.267687   3.6187844 12.018632 ]
W_3 Ep: 7994 | ------- | Pos: 3 | RR: -16.4 | EpR: -15.7 | var: [ 2.07616   11.160275   1.1988583  9.583065 ]
W_5 Ep: 7995 | ------- | Pos: 2 | RR: -17.1 | EpR: -30.1 | var: [ 0.9681119 10.007403   1.2556459  8.55513  ]
W_4 Ep: 7996 | ------- | Pos: 6 | RR: -16.8 | EpR: -11.4 | var: [ 3.3065507 13.672268   3.9533446 12.384474 ]
W_6 Ep: 7997 | ------- | Pos: 5 | RR: -16.6 | EpR: -12.3 | var: [ 3.542752  12.5449505  3.5135493 11.413937 ]
W_7 Ep: 7998 | ------- | Pos: 4 | RR: -16.4 | EpR: -12.7 | var: [ 3.7254398 14.587961   4.214101  13.035912 ]
W_2 Ep: 7999 | ------- | Pos: 4 | RR: -16.6 | EpR: -20.8 | var: [ 4.3646445 16.331139   5.4474187 14.882863 ]
W_1 Ep: 8000 | ------- | Pos: 2 | RR: -17.1 | EpR: -26.0 | var: [ 2.3089342 10.188143   2.140171   8.682452 ]
W_5 Ep: 8001 | ------- | Pos: 3 | RR: -17.1 | EpR: -18.5 | var: [1.4845458 9.2544775 1.4096836 7.9558926]
W_7 Ep: 8002 | ------- | Pos: 5 | RR: -16.9 | EpR: -12.9 | var: [ 4.1792936 12.500039   3.9695206 11.367695 ]
W_4 Ep: 8003 | ------- | Pos: 4 | RR: -16.9 | EpR: -16.5 | var: [ 2.9577925 13.383015   2.9338877 11.993065 ]
W_3 Ep: 8004 | ------- | Pos: 2 | RR: -17.3 | EpR: -25.1 | var: [1.4824905 9.347416  1.77881   7.8598304]
W_6 Ep: 8005 | ------- | Pos: 6 | RR: -17.3 | EpR: -16.6 | var: [ 4.678798  13.103448   4.7200103 11.855856 ]

how to save A3C model

after i train model , how can i save the model?

Problem with more than one action - A3C

Hello,

Thank you very much for the A3C implementation. I was trying implement your A3C implementation on biped walker to a custom muscle model and got it working for a single action output. But when I tried to implement it with more than one action, it get stuck in the smallest reward local minima. I tried starting the training with high learning rate and also tried different entropy_beta value to encourage more exploration. But alas nothing helped. could u help me with some advise on this.

Irrespective of whatever I tried the training get stuck in the local minima with a smallest reward possible.

regards.
akhil

Substitute pandas `ix` with `iloc`.

The ix function is seems to be deprecated. http://pandas.pydata.org/pandas-docs/version/0.20.3/whatsnew.html#deprecate-ix
Could be substitute with iloc in many cases.

Using DDPG_update2.py with pendulum, reward not converging

Hi, I am using https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/9_Deep_Deterministic_Policy_Gradient_DDPG/DDPG_update2.py and running it just as given, except with MAX_EPISODES=1000 and MAX_EP_STEPS=1000.

In the beginning, it starts off with reward values ~-7000, but at the end it seems like it did not converge since there are still values of -3203, -42, -481, etc. Is this normal, or am I missing something?

lambda parameter

Reinforcement-learning-with-tensorflow/contents/8_Actor_Critic_Advantage/AC_CartPole.py

Line 165 in 639902b

running_reward = running_reward * 0.95 + ep_rs_sum * 0.05

do you think we should specify lambda hyperparameter, rather than hardcoding it to 0.95?

Why both AC and A3C examples use Value function not Q-function?

Hi, thank you for your great work.

I have a quick question, I wonder why the critics of your AC and A3C examples use Value function but not Q-function?

Will Q-function can stablize the training of your CartPole AC example?

Many thanks.

如果reward 十分稀疏，A3C UPDATE_GLOBAL_ITER 该如何选取

首先十分感谢您耐心解答了我前一个问题。
最近在套用您的A3C框架做一个玩小游戏的AI，有一个问题这段时间比较困扰我。

Reinforcement-learning-with-tensorflow/experiments/Solve_LunarLander/A3C.py

Line 29 in 9c27c85

UPDATE_GLOBAL_ITER = 5

我的游戏有点像吃豆子小人游戏，在一个地图中有很多糖果，控制的agent上下左右移动去获取糖果。
reward 的定义就是这一步吃到糖果reward 就是1，否则agent移动reward 都是0。这样会导致reward list 的正reward 十分稀疏。
您例子中reward都比较连续，这样感觉UPDATE_GLOBAL_ITER 比较小的话训练比较有效率。但是如果 reward十分稀疏的话 UPDATE_GLOBAL_ITER 该如何适应呢？或者有什么别的思路可以解决这一问题呢？
再次感谢！

生成的图片意味着什么呢？

莫烦你好，我运行了run_this.py，结果生成的cost图片如下：

可是从图中看来，到了训练末尾，好像loss还没有收敛呢，那么这个图说明了什么呢？
谢谢！

如何实现A3C代码中仅save全局网络的参数，而非所有参数

MorvanZhou：
您好！
想问下，A3C代码中，在代码中仅有一个graph和一个session的情况下，如何实现仅save globalNet的参数以供后续使用？是不是只能够通过操作save函数要保存的参数列表？换成多进程来实现仅save globalNet的参数是不是更方便些？

谢谢
zxk

How to use the prioritized code to run atari games?

Hi, I want to run atari games based on the prioritized experience replay. However, the current code does not work. Any ideas?

為什麼合併Q的時候要將A減去他的平均值?

莫大大你好，你在影片中有提到為了避免V=0讓A直接等於Q，所以要將A減去他的均值。這邊我有個問題:

為什麼不減去均值V就會直接等於0?
為什麼剪了均值後A就不會直接等於Q?
謝謝!

the code report the pyglet issues with OpenGL

Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\threading.py", line 916, in _bootstrap_inne
r
self.run()
File "C:\ProgramData\Anaconda3\lib\threading.py", line 864, in run
self._target(*self.args, **self.kwargs)
File "arm_a3c_RL.py", line 201, in
job = lambda: worker.work()
File "arm_a3c_RL.py", line 136, in work
self.env.render()
File "C:\Code\python_code\RL\arm_env.py", line 72, in render
self.viewer = Viewer(*self.viewer_xy, self.arm_info, self.point_info, self.p
oint_l, self.mouse_in)
File "C:\Code\python_code\RL\arm_env.py", line 115, in init
super(Viewer, self).init(width, height, resizable=False, caption='Arm',
vsync=False) # vsync=False to not use the monitor FPS
File "C:\ProgramData\Anaconda3\lib\site-packages\pyglet\window\win32_init.
py", line 131, in init
super(Win32Window, self).init(*args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyglet\window_init.py", l
ine 504, in init
config = screen.get_best_config(template_config)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyglet\canvas\base.py", line
161, in get_best_config
configs = self.get_matching_configs(template)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyglet\canvas\win32.py", line
33, in get_matching_configs
configs = template.match(canvas)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyglet\gl\win32.py", line 25,
in match
return self._get_arb_pixel_format_matching_configs(canvas)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyglet\gl\win32.py", line 98,
in _get_arb_pixel_format_matching_configs
nformats, pformats, nformats)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyglet\gl\lib_wgl.py", line 9
5, in call
result = self.func(*args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\pyglet\gl\lib.py", line 62, i
n MissingFunction
raise MissingFunctionException(name, requires, suggestions)
pyglet.gl.lib.MissingFunctionException: wglChoosePixelFormatARB is not exported
by the available OpenGL driver. ARB_pixel_format is required for this functiona
lity.

A3C玩 flappy bird

大神你好,我用你的A3C代码去改写了一些,去玩flappy bird.可是一直就运行不太正常(小鸟过不了管道).参照几个A3c 玩flappy 的代码改了很多很久,始终没有成功.很难受.
大神,你有没有兴趣帮我修正一下?

DDPG Critic implementation

Hi MorvanZhou,
I have a question about implementation of the critic in the ddpg.

In the baseline implementation is a difference:
https://github.com/openai/baselines/blob/master/baselines/ddpg/models.py

class Critic(Model):
...
x = obs
x = tf.layers.dense(x, 64)
...
x = tf.nn.relu(x)

        x = tf.concat([x, action], axis=-1)
        x = tf.layers.dense(x, 64)

...
x = tf.nn.relu(x)
...
The action is inserted via concat in the second layer.

This differs from the implementation in this repo. There the action is expanded in the range of n_l1 and is finally summed up in the first layer.
w1_s = tf.get_variable('w1_s', [self.s_dim, n_l1], trainable=trainable)
w1_a = tf.get_variable('w1_a', [self.a_dim, n_l1], trainable=trainable)
b1 = tf.get_variable('b1', [1, n_l1], trainable=trainable)
net = tf.nn.relu(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1)

What do you think is the correct implementation?

Thanks,
Roman

Better Exploration with Parameter Noise

Hey MorvanZhou,

first thanks for your really amazing expert work! Your repo is my favorite GitHub repo to learn about Reinforcement Learning!

Matthias Plappert has presented an impressive alternative to the action noise. It is called Parameter Noise:
https://blog.openai.com/better-exploration-with-parameter-noise/
https://github.com/openai/baselines/blob/master/baselines/ddpg/noise.py

Do you think this could be interesting for your repository? To compare and teach it?

Do you think it's easy to embed?

Best,
Roman

Some questions about PPO

Hello Zhou,

What a nice code. But I have some questions about your PPO part.

Do you test other continue space env like Mujoco? I test the Simple PPO code in Mujoco, which look like the code doesn't work. On the other hand, Pendulum-v0 result is good but not very good, which mean the best 100-episode average reward was below -200.
In Simple PPO code, you use the reward normalize. But I don't understand why the reward normalize can improve the result so much? You know, the reward normalize in the code is characteristic engineering, we had better not do it by ourselves.

I am looking forward your reply. : )

Avoid high frequent changes in DPPO

Hey MorvanZhou,

thank you for your really great work! I like your implementations a lot.

I have a question about the DPPO Algorithm.

In my own environments, it tends to do high frequent changes between the min and max action. Is there a way to avoid it?

Best,
Roman

network cost don't convergence

RL_brain.py and DQN_modified.py

network cost don't convergence

environments:
python3
tensorflow 1.3.0
macbookpro

Save model in "experiments/Solve_LunarLander/A3C.py"

Hi....
How can I save the model in "experiments/Solve_LunarLander/A3C.py"

Thanks.....

How to save network in DDPG_update2.py?

Hey MorvanZhou,

thanks for this great repository. Is fun to play and learn with your code!

I have a question about the DDPG_update2.py script and saving the network.

I tried to add:
saver = tf.train.Saver

(and after the training)

save_path = saver.save(ddpg.sess, 'checkpoint', write_meta_graph=False)

Python replies:
TypeError: save() missing 1 required positional argument: 'save_path'

When I type in saver.save(ddpg.sess) the reply is:
TypeError: save() missing 2 required positional arguments: 'sess' and 'save_path'

Do you have an idea what is wrong?

Thanks in advance.

Best,
Roman

I solved it.
saver = tf.train.Saver()

and

save_path = saver.save(ddpg.sess, "./trained_variables.ckpt", write_meta_graph=False)

DeepMind涉嫌抄袭你? :)

Distributed Proximal Policy Optimization (DPPO) (Tensorflow)中提到的不让worker计算和更新梯度，而只是传数据（obversaion），让PPO飞起来。你的这个想法也许超前于DeepMind的IMPALA的并行智能体结构(http://i.dataguru.cn/mportal.php?aid=13103&mod=view)。

从计算机科学角度，将观察，计算，合并，不同计算量的层次划分为实体，是典型的设计模式之一，但rl太难，能懂会用就不错了，所以不可能像JAVA框架这样快速接近哲学和数学高度。
DeepMind涉嫌抄袭你?还是说你就职于DeepMind?
anyway 感谢你的rl入门，痛快的看了我一个下午，还发着高烧，非常感谢你的教程，过瘾。

Question for Deep Q network

Hi Morvan.

I have a question about this line: https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/5_Deep_Q_Network/DQN_modified.py#L86. Shouldn't it be self.s_ instead of self.s? Thank you!

AttributeError: 'NoneType' object has no attribute 'decode'

File "DDPG.py", line 22, in
from car_env import CarEnv
File "/home1/ /Self-driving-car-RL/Reinforcement-learning-with-tensorflow/experiments/2D_car/car_env.py", line 148, in
class Viewer(pyglet.window.Window):
File "/home//workspace/tensorflow1.0-py3/lib/python3.5/site-packages/pyglet/init.py", line 357, in getattr
import(import_name)
File "/home//workspace/tensorflow1.0-py3/lib/python3.5/site-packages/pyglet/window/init.py", line 1816, in
gl._create_shadow_window()
File "/home//workspace/tensorflow1.0-py3/lib/python3.5/site-packages/pyglet/gl/init.py", line 205, in _create_shadow_window
_shadow_window = Window(width=1, height=1, visible=False)
File "/home//workspace/tensorflow1.0-py3/lib/python3.5/site-packages/pyglet/window/xlib/init.py", line 163, in init
super(XlibWindow, self).init(*args, **kwargs)
File "/home//workspace/tensorflow1.0-py3/lib/python3.5/site-packages/pyglet/window/init.py", line 504, in init
config = screen.get_best_config(template_config)
File "/home//workspace/tensorflow1.0-py3/lib/python3.5/site-packages/pyglet/canvas/base.py", line 161, in get_best_config
configs = self.get_matching_configs(template)
File "/home//workspace/tensorflow1.0-py3/lib/python3.5/site-packages/pyglet/canvas/xlib.py", line 179, in get_matching_configs
configs = template.match(canvas)
File "/home//workspace/tensorflow1.0-py3/lib/python3.5/site-packages/pyglet/gl/xlib.py", line 29, in match
have_13 = info.have_version(1, 3)
File "/home//workspace/tensorflow1.0-py3/lib/python3.5/site-packages/pyglet/gl/glx_info.py", line 86, in have_version
client_version = self.get_client_version().split()[0]
File "/home//workspace/tensorflow1.0-py3/lib/python3.5/site-packages/pyglet/gl/glx_info.py", line 118, in get_client_version
return asstr(glXGetClientString(self.display, GLX_VERSION))
File "/home//workspace/tensorflow1.0-py3/lib/python3.5/site-packages/pyglet/compat.py", line 88, in asstr
return s.decode("utf-8")
AttributeError: 'NoneType' object has no attribute 'decode'

Same error with other experiments of this repository.
Did you ever meet this error? I have tried many different versions of pyglet.
My env : tf1.0 + py3.5 + Virtualenv

请问apply_gradients 这个函数要加锁吗？

因为A3C是多线程的运行，都是对全局网络更新，不加锁会有问题吗？

'terminal' in 2_Q_Learning_maze

In the project: 2_Q_Learning_maze, the constant 'terminal' has never been assigned to any variable, So，the statement "if s_ != 'terminal':" has never been matched, is this a small typo?

我应该如何输出我的应对策略表

在2_Q_Learning_maze中您初始化了一个网格，加入我现在把他看成16种状态（4*4），我应该如何输出他在每一种状态时选择的action？也就是说，我想通过RL得到一个完整的应对策略表我应该如何修改代码？以及我能否观察收敛速度，因为我想对比不同的算法看看效果。
在这里感谢一下您的工作，真的很伟大，而且很通俗易懂。

The hidden layers of A3C have too many neurons

The hidden layers of A3C have too many neurons. The input of Cartpole only has 4 parameters. Reduce the number 200 and 100 to 24, and the result should be a lot better.
A3C网络里隐藏层神经元设太多了，Cartpole的输入只有4个参数，设成200和100很难收敛，改成24收敛快很多，效果也好很多。

Main function of run_this.py?

Hi Morvan, many thanks for all your sharing.

I typed in all your 3 py files posted here, but when I run "python run_this.py" on windows, it shows up:

Traceback (most recent call last):
File "run_this.py", line 47, in
RL = QLearningTable(actions=list(range(env.n_actions)))
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\tki
nter_init_.py", line 2095, in getattr
return getattr(self.tk, attr)
AttributeError: '_tkinter.tkapp' object has no attribute 'n_actions'

It seems the n_action variable is not initialized. Not sure where the problem is. Could you help?

mu, sigma = mu * A_BOUND[1], sigma + 1e-4

请问 A3C的连续动作算法中， mu * A_BOUND[1]的含义是? sigma加上一个很小的值是防止 sigma变得很小，失去了探索？

Try to apply RL_brain to new gym enviroment

Q-learning vs. Sarsa_lambda

Hi. Thanks for good tutorials!

You mentioned that Sarsa_lambda(4) was more efficient than the regular Q-learning(2), and it looks like it learn to get to the yellow target quicker.

However, when I modify the environment to be a 8x8 maze with 5 obstacles. The Q-learning seems to "get it faster", than Sarsa_lambda. Sarsa_Lamda, starts to explore, but after some falilures it starts more and more to camp in the corner (Afraid to explore? :-) )

Is this the way it should be, or I am missing something?

Tried to attach my custom environment file:

maze_env.txt

qlearning等算法讲的不透彻

网上其实资料不少，但需要整理和讲透撤，就需要自身实力。
境界1. 自己看懂
境界2. 让别人明白

这第二点就特别难。因为兴趣基于明白其真理的欲望，所以真理和原理对需要研究的人最重要。对入门跑跑helloworld可能不需要。当下，最难的可能是深度和强化入门容易，学习曲线和数学基础不需要太高深。但问题是，入门之后要用好，要研究心算法，必须哲学层面透彻理解这些算法。

楼主入门资料非常好，不多说，qlearning和sarsa算法的原理层面讲解不透彻。
推荐一篇，https://www.zhihu.com/question/26408259
第一个讲解的不错，第二个博士在读的讲错了可以忽略。

期望能将您的形式丰富多彩的入门课程，逐步发展为中级，甚至高级的研究资料，这样您的能力至少接近了斯坦福的上课教授，比如李飞飞，如果走学院派，估计你以后也会带徒弟，如果是业界工作，到不需要。

最后，感谢您认真细致的资料，对入门帮助很大，看了即便。

Questions regarding DQN_modified

Hi,

In your updated DQN_modified in the DQN folder, how did you do the indexing for q_eval_wrt_a?

with tf.variable_scope('q_eval'):

a_indices = tf.stack([tf.range(tf.shape(self.a)[0], dtype=tf.int32), self.a], axis=1) self.q_eval_wrt_a = tf.gather_nd(params=self.q_eval, indices=a_indices) # shape=(None, )

Wouldn't tf.shape(self.a)[0] be referring to the num_of_actions(4 in the maze env)? but the quantity is 32 instead of 4 which is passed from the environment. I tried to print a using eval() and i got:
[[ 0 3]
[ 1 3]
[ 2 3]
[ 3 0]
[ 4 3]
[ 5 3]
[ 6 3]
[ 7 3]
[ 8 3]
[ 9 3]
[10 3]
[11 3]
[12 3]
[13 0]
[14 3]
[15 3]
[16 1]
[17 3]
[18 3]
[19 2]
[20 3]
[21 3]
[22 3]
[23 3]
[24 3]
[25 3]
[26 3]
[27 3]
[28 0]
[29 3]
[30 3]
[31 0]]
Then you apply index-slicing on q_eval_wrt_a, so it will be q_eval_wrt_a[#Batch][action_index]?
What is the intuition of this?
A little bit explanation on this will be greatly helpful.
Thank u!

关于Critic网络训练的问题

Morvan Zhou:
您好：
请问在Critic网络训练中，函数learn（s, a, r, s_）中为何没有a_?
谢谢。
Sincerely, Luo

关于Prioritized Expereience DQN的问题

为什么在计算loss的时候，abs_error是用reduce_sum但是TD-error是用reduce_mean

A3C_continuous_action - reward normalisation

Hi - thank you very much for a great set of tutorials.

I noticed this line where the reward is rescaled:
https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/10_A3C/A3C_continuous_action.py#L134

Removing this line seems to break the Pendulum example, so therefore it is clearly sensitive to this parameter. I understand why this was done, but are you aware of any other way of dealing with this issue to make the algorithm more general?

I came across this which sounds interesting https://arxiv.org/pdf/1602.07714.pdf

morvanzhou / reinforcement-learning-with-tensorflow Goto Github PK

reinforcement-learning-with-tensorflow's People

Contributors

Stargazers

Watchers

Forkers

reinforcement-learning-with-tensorflow's Issues

运行Q_learning的maze_env.py文件中错误， 请问莫凡大神这是哪里的问题？？tk的版本问题吗

Recommend Projects

Recommend Topics

Recommend Org

运行Q_learning的maze_env.py文件中错误，
请问莫凡大神这是哪里的问题？？tk的版本问题吗