Describe the bug <div

<a target="_blank" rel="noopener noreferrer" href="https://github.com/Kallinteris-Andr

Yes, I think that is a reasonable thing to consider adding for v5. <a class="user-ment

That is a good catch. I agree with <a class="user-mention notranslate" data-hovercard-

Here is some code verifying the bugs <div class="snippet-clipboard-content notrans

<a target="_blank" rel="noopener noreferrer" href="https://github.com/Kallinteris-Andr

[Bug Report] `InvertedDoublePendulumEnv` and `InvertedPendulumEnv` always gives "alive_bonus",about farama-foundation/gymnasium

Comments (14)

Kallinteris-Andreas commented on June 5, 2024 2

Great I will add it to v5 change list

from gymnasium.

Kallinteris-Andreas commented on June 5, 2024 1

Here is the v4 vs v5 for InvertedPendulum (the only difference in v5 is the reward being fixed)
As expected the v5 version has a faster learning transience

from gymnasium.

pseudo-rnd-thoughts commented on June 5, 2024

Yes, I think that is a reasonable thing to consider adding for v5. @rodrigodelazcano thoughts?

from gymnasium.

Kallinteris-Andreas commented on June 5, 2024

Gymnasium/gymnasium/envs/mujoco/inverted_pendulum_v4.py

Line 118 in c4f67b9

reward = 1.0

The same appears to be the case for InvertedPendulumEnv

from gymnasium.

rodrigodelazcano commented on June 5, 2024

That is a good catch. I agree with @pseudo-rnd-thoughts . This should be added to v5 since v4 only updates to the mujoco bindings and this reward error comes from older versions as well.

from gymnasium.

Kallinteris-Andreas commented on June 5, 2024

Here is some code verifying the bugs

>>> env = gymnasium.make('InvertedPendulum-v4')
>>> env.reset()
(array([-0.00114481,  0.00315834, -0.00689603, -0.00764207]), {})
>>> env.step([1])
(array([ 0.0052199 , -0.01239018,  0.32425438, -0.76226102]), 1.0, False, False, {})
>>> env.step([1])
(array([ 0.02474693, -0.05746427,  0.65169342, -1.48966764]), 1.0, False, False, {})
>>> env.step([1])
(array([ 0.05732965, -0.13159401,  0.97709572, -2.21890001]), 1.0, False, False, {})
>>> env.step([1])
(array([ 0.10286945, -0.23521879,  1.29895519, -2.96571882]), 1.0, True, False, {})
>>> env.step([1])
(array([ 0.16112042, -0.36907483,  1.6112052 , -3.72861976]), 1.0, True, False, {})
>>> env.step([1])
(array([ 0.23148975, -0.53346372,  1.902614  , -4.48774083]), 1.0, True, False, {})

>>> env = gymnasium.make('InvertedDoublePendulum-v4')
>>> env.reset()
(array([-0.05209413, -0.03106399, -0.05757982,  0.9995174 ,  0.99834091,
       -0.00319314, -0.10766195,  0.09683618,  0.        ,  0.        ,
        0.        ]), {})
>>> env.step([1])
(array([ 7.67962813e-04, -1.44606909e-01,  8.22320453e-02,  9.89489182e-01,
        9.96613210e-01,  2.11193186e+00, -4.45134196e+00,  5.49346477e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00]), 9.17710405832815, False, False, {})
>>> env.step([1])
(array([ 0.15545075, -0.44371616,  0.43168352,  0.89616738,  0.90202513,
        3.9987776 , -7.74866516,  7.9830774 ,  0.        ,  0.        ,
        0.        ]), 8.877556821859912, False, False, {})
>>> env.step([1])
(array([ 0.39199627, -0.77051144,  0.69186829,  0.63742617,  0.72202373,
        5.38530052, -8.71215195,  3.8089483 ,  0.        ,  0.        ,
        0.        ]), 8.807853136081622, True, False, {})

from gymnasium.

Kallinteris-Andreas commented on June 5, 2024

v4 is the current v4 version
v4-fixed is the current v4 version, with the reward_alive fixed
v5 is the current v4 version, with the reward_alive fixed and the observation fix (#228)

from gymnasium.

pseudo-rnd-thoughts commented on June 5, 2024

This is a massive reward difference.
This shouldn't explain to me the performance difference as the primary difference is if terminated=True as alive_bonus = 0 so I expected that the episode reward might be 10 points lower.
@Kallinteris-Andreas Am I misunderstanding something?

from gymnasium.

Kallinteris-Andreas commented on June 5, 2024

The episodic reward being 10 points happens only if the episode terminates (which does not happen after some training regardless of the reward function).

The best policy of all the cases resulted in the same return (~9360), it is just that with the fixed reward function it is possible to get there more consistently

Note: I have double-checked the source codes, nothing is wrong there.

from gymnasium.

pseudo-rnd-thoughts commented on June 5, 2024

That doesn't explain the ~4000 point increase shown by

To me, the change to the reward function is only when terminated=True such that reward_alive=0. Have I misunderstood something?

from gymnasium.

Kallinteris-Andreas commented on June 5, 2024

No, your understanding of the change in the reward function is correct

from gymnasium.

pseudo-rnd-thoughts commented on June 5, 2024

When why the ~4000 point difference? To me, if the agents were already collecting the optimal result then the difference should be on average 10 points

from gymnasium.

Kallinteris-Andreas commented on June 5, 2024

Because on some runs with the old reward function, the agent is not able to learn how to "escape" an unbalanced state

The optimal results are identical with both reward functions (since the "optimal" policy, would not be unbalanced)

from gymnasium.

pseudo-rnd-thoughts commented on June 5, 2024

Wow, that is amazing if purely changing that variables causes such a massive change in performance

from gymnasium.

[Bug Report] `InvertedDoublePendulumEnv` and `InvertedPendulumEnv` always gives "alive_bonus" about gymnasium HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent