Giter VIP home page Giter VIP logo

Comments (2)

arnomoonens avatar arnomoonens commented on May 27, 2024 1

Hello,

Thank you for the feedback about my code! I'm glad that I can help other people using my repository.

I have experienced the same issue with the MountainCar-v0 environment. The problem is that we have an on-policy method (A2C and A3C) applied to an environment that rarely gives useful rewards (i.e. only at the end).

I have only used Sarsa with function approximation (not DPG), and I believe this algorithm works quite good on the MountainCar-v0 environment because in this case it favors actions that haven't been tried yet in the current state. This is the case because the thetas are initialized randomly uniform. Whenever a reward (for this environment -1) is received, it only changes thetas for the previous state and action.
I haven't studied and implemented DPG yet. I am interested in how that algorithm is able to "solve" this environment.

In contrary to Sarsa+F.A., updates using A3C can influence all the parameters (in this case the neural network weights) and thus the result for every state (the input to the neural net of the actor) can be influenced. I ran an experiment, and the network always seems to output the same probabilities, as the feedback to the network is also always the same.
Thus, you can only get at the finish by luck. Once the agent "discovered" the finish, the performance should improve. In fact, some people report to have successfully learned using A3C.

I hope my explanation is clear. Feel free to ask more questions otherwise.
I also don't fully understand it yet. Unfortunately, I don't have enough time right now to investigate the problem more thoroughly.


By the way, the weights of the networks in my A2C and A3C algorithms weren't initialized properly. The standard deviation was 1, which is too big and can lead to a big difference in probabilities for the action to be selected. Sometimes an action had only a probability of 0.5% for example. As I explained, the probabilities never change much and thus sometimes an action is rarely selected. I changed it now (commit 6a0d879) by using as weight initializer tf.truncated_normal_initializer(mean=0.0, stddev=0.02).

from yarll.

zencoding avatar zencoding commented on May 27, 2024

Thanks for your explanation, that helps. It seems that on-policy have bad exploration compared to off-policy so, in situations where the rewards are not changing with state changes, it is better to use off-policy methods.

BTW, I tried various things on A2C to make it work such as added reward for movement

for _ in range(self.config["repeat_n_actions"]): state, rew, done, _ = self.step_env(action) stateDelta = np.mean(np.square(state-old_state)) # Good rewards if agent moved the car if stateDelta > 0.0001: rew = 0 if done: # Don't continue if episode has already ended break
and Experience Replay and epilson greedy if np.random.rand() <= self.config["epsilon"]: action = np.random.randint(0,3,size=1)[0] else: action = self.choose_action(state) but the network still won't converge to less than 200 steps. I don't know why but I will investigate.

Thanks again for your help in understanding

from yarll.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.