Giter VIP home page Giter VIP logo

Comments (14)

Kallinteris-Andreas avatar Kallinteris-Andreas commented on June 5, 2024 2

Great I will add it to v5 change list

from gymnasium.

Kallinteris-Andreas avatar Kallinteris-Andreas commented on June 5, 2024 1

Pendulum-v4-vs-v5
Here is the v4 vs v5 for InvertedPendulum (the only difference in v5 is the reward being fixed)
As expected the v5 version has a faster learning transience

from gymnasium.

pseudo-rnd-thoughts avatar pseudo-rnd-thoughts commented on June 5, 2024

Yes, I think that is a reasonable thing to consider adding for v5. @rodrigodelazcano thoughts?

from gymnasium.

Kallinteris-Andreas avatar Kallinteris-Andreas commented on June 5, 2024


The same appears to be the case for InvertedPendulumEnv

from gymnasium.

rodrigodelazcano avatar rodrigodelazcano commented on June 5, 2024

That is a good catch. I agree with @pseudo-rnd-thoughts . This should be added to v5 since v4 only updates to the mujoco bindings and this reward error comes from older versions as well.

from gymnasium.

Kallinteris-Andreas avatar Kallinteris-Andreas commented on June 5, 2024

Here is some code verifying the bugs

>>> env = gymnasium.make('InvertedPendulum-v4')
>>> env.reset()
(array([-0.00114481,  0.00315834, -0.00689603, -0.00764207]), {})
>>> env.step([1])
(array([ 0.0052199 , -0.01239018,  0.32425438, -0.76226102]), 1.0, False, False, {})
>>> env.step([1])
(array([ 0.02474693, -0.05746427,  0.65169342, -1.48966764]), 1.0, False, False, {})
>>> env.step([1])
(array([ 0.05732965, -0.13159401,  0.97709572, -2.21890001]), 1.0, False, False, {})
>>> env.step([1])
(array([ 0.10286945, -0.23521879,  1.29895519, -2.96571882]), 1.0, True, False, {})
>>> env.step([1])
(array([ 0.16112042, -0.36907483,  1.6112052 , -3.72861976]), 1.0, True, False, {})
>>> env.step([1])
(array([ 0.23148975, -0.53346372,  1.902614  , -4.48774083]), 1.0, True, False, {})
>>> env = gymnasium.make('InvertedDoublePendulum-v4')
>>> env.reset()
(array([-0.05209413, -0.03106399, -0.05757982,  0.9995174 ,  0.99834091,
       -0.00319314, -0.10766195,  0.09683618,  0.        ,  0.        ,
        0.        ]), {})
>>> env.step([1])
(array([ 7.67962813e-04, -1.44606909e-01,  8.22320453e-02,  9.89489182e-01,
        9.96613210e-01,  2.11193186e+00, -4.45134196e+00,  5.49346477e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00]), 9.17710405832815, False, False, {})
>>> env.step([1])
(array([ 0.15545075, -0.44371616,  0.43168352,  0.89616738,  0.90202513,
        3.9987776 , -7.74866516,  7.9830774 ,  0.        ,  0.        ,
        0.        ]), 8.877556821859912, False, False, {})
>>> env.step([1])
(array([ 0.39199627, -0.77051144,  0.69186829,  0.63742617,  0.72202373,
        5.38530052, -8.71215195,  3.8089483 ,  0.        ,  0.        ,
        0.        ]), 8.807853136081622, True, False, {})

from gymnasium.

Kallinteris-Andreas avatar Kallinteris-Andreas commented on June 5, 2024


v4 is the current v4 version
v4-fixed is the current v4 version, with the reward_alive fixed
v5 is the current v4 version, with the reward_alive fixed and the observation fix (#228)

from gymnasium.

pseudo-rnd-thoughts avatar pseudo-rnd-thoughts commented on June 5, 2024

This is a massive reward difference.
This shouldn't explain to me the performance difference as the primary difference is if terminated=True as alive_bonus = 0 so I expected that the episode reward might be 10 points lower.
@Kallinteris-Andreas Am I misunderstanding something?

from gymnasium.

Kallinteris-Andreas avatar Kallinteris-Andreas commented on June 5, 2024

The episodic reward being 10 points happens only if the episode terminates (which does not happen after some training regardless of the reward function).

The best policy of all the cases resulted in the same return (~9360), it is just that with the fixed reward function it is possible to get there more consistently

Note: I have double-checked the source codes, nothing is wrong there.

from gymnasium.

pseudo-rnd-thoughts avatar pseudo-rnd-thoughts commented on June 5, 2024

That doesn't explain the ~4000 point increase shown by

To me, the change to the reward function is only when terminated=True such that reward_alive=0. Have I misunderstood something?

from gymnasium.

Kallinteris-Andreas avatar Kallinteris-Andreas commented on June 5, 2024

No, your understanding of the change in the reward function is correct

from gymnasium.

pseudo-rnd-thoughts avatar pseudo-rnd-thoughts commented on June 5, 2024

When why the ~4000 point difference? To me, if the agents were already collecting the optimal result then the difference should be on average 10 points

from gymnasium.

Kallinteris-Andreas avatar Kallinteris-Andreas commented on June 5, 2024

Because on some runs with the old reward function, the agent is not able to learn how to "escape" an unbalanced state

The optimal results are identical with both reward functions (since the "optimal" policy, would not be unbalanced)

from gymnasium.

pseudo-rnd-thoughts avatar pseudo-rnd-thoughts commented on June 5, 2024

Wow, that is amazing if purely changing that variables causes such a massive change in performance

from gymnasium.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.