Hi, Thanks a lot for the wonderful work, the code is well written and modular, it

Actor Critic for Mountain Car about yarll HOT 2 CLOSED

arnomoonens commented on May 27, 2024 1

Actor Critic for Mountain Car

from yarll.

Comments (2)

arnomoonens commented on May 27, 2024 1

Hello,

Thank you for the feedback about my code! I'm glad that I can help other people using my repository.

I have experienced the same issue with the MountainCar-v0 environment. The problem is that we have an on-policy method (A2C and A3C) applied to an environment that rarely gives useful rewards (i.e. only at the end).

I have only used Sarsa with function approximation (not DPG), and I believe this algorithm works quite good on the MountainCar-v0 environment because in this case it favors actions that haven't been tried yet in the current state. This is the case because the thetas are initialized randomly uniform. Whenever a reward (for this environment -1) is received, it only changes thetas for the previous state and action.
I haven't studied and implemented DPG yet. I am interested in how that algorithm is able to "solve" this environment.

In contrary to Sarsa+F.A., updates using A3C can influence all the parameters (in this case the neural network weights) and thus the result for every state (the input to the neural net of the actor) can be influenced. I ran an experiment, and the network always seems to output the same probabilities, as the feedback to the network is also always the same.
Thus, you can only get at the finish by luck. Once the agent "discovered" the finish, the performance should improve. In fact, some people report to have successfully learned using A3C.

I hope my explanation is clear. Feel free to ask more questions otherwise.
I also don't fully understand it yet. Unfortunately, I don't have enough time right now to investigate the problem more thoroughly.

By the way, the weights of the networks in my A2C and A3C algorithms weren't initialized properly. The standard deviation was 1, which is too big and can lead to a big difference in probabilities for the action to be selected. Sometimes an action had only a probability of 0.5% for example. As I explained, the probabilities never change much and thus sometimes an action is rarely selected. I changed it now (commit 6a0d879) by using as weight initializer tf.truncated_normal_initializer(mean=0.0, stddev=0.02).

from yarll.

zencoding commented on May 27, 2024

Thanks for your explanation, that helps. It seems that on-policy have bad exploration compared to off-policy so, in situations where the rewards are not changing with state changes, it is better to use off-policy methods.

BTW, I tried various things on A2C to make it work such as added reward for movement

for _ in range(self.config["repeat_n_actions"]): state, rew, done, _ = self.step_env(action) stateDelta = np.mean(np.square(state-old_state)) # Good rewards if agent moved the car if stateDelta > 0.0001: rew = 0 if done: # Don't continue if episode has already ended break
and Experience Replay and epilson greedy if np.random.rand() <= self.config["epsilon"]: action = np.random.randint(0,3,size=1)[0] else: action = self.choose_action(state) but the network still won't converge to less than 200 steps. I don't know why but I will investigate.

Thanks again for your help in understanding

from yarll.

Actor Critic for Mountain Car about yarll HOT 2 CLOSED

Comments (2)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent