Hi, I recently tried to reproduce the experiment result in your paper and I found some

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Question about results in the paper about rllab HOT 8 OPEN

rll commented on July 25, 2024

Question about results in the paper

from rllab.

Comments (8)

dementrock commented on July 25, 2024

Hi @lchenat, all the parameters should have been documented in the appendix of the paper. However, it is not guaranteed that you will get exactly the same result, due to difference in random seeds. I'd be happy to assist if you observe significant discrepancies.

from rllab.

lchenat commented on July 25, 2024

I did not find the parameters of DDPG in the appendix of the paper. I ran the following code and the maximal average return in each iteration is no more than 2400:

import re
import numpy
import sys
from subprocess import call
from rllab.algos.vpg import VPG
from rllab.algos.tnpg import TNPG
from rllab.algos.erwr import ERWR
from rllab.algos.reps import REPS
from rllab.algos.trpo import TRPO
from rllab.algos.cem import CEM
from rllab.algos.cma_es import CMAES
from rllab.algos.ddpg import DDPG
from rllab.baselines.linear_feature_baseline import LinearFeatureBaseline
from rllab.envs.box2d.cartpole_env import CartpoleEnv
from rllab.envs.normalized_env import normalize
from rllab.policies.gaussian_mlp_policy import GaussianMLPPolicy
from rllab.misc.instrument import stub, run_experiment_lite
from rllab.exploration_strategies.ou_strategy import OUStrategy
from rllab.policies.deterministic_mlp_policy import DeterministicMLPPolicy
from rllab.q_functions.continuous_mlp_q_function import ContinuousMLPQFunction

path = "/home/data/lchenat/rllab-master/data/local/experiment/"
exp_name = "test_cartpole_again_ddpg"

stub(globals())

env = normalize(CartpoleEnv())

policy = DeterministicMLPPolicy(
env_spec=env.spec,
hidden_sizes=(400, 300)
)
es = OUStrategy(env_spec=env.spec)
qf = ContinuousMLPQFunction(env_spec=env.spec)
algo = DDPG(
env=env,
policy=policy,
es=es,
qf=qf,
n_epochs=600,
)

#delete the previous data
call(["rm","-rf",path+exp_name])

run_experiment_lite(
algo.train(),
n_parallel=1,
snapshot_mode="last",
#seed=1,
exp_name=exp_name,
#plot=True,
)

from rllab.

lchenat commented on July 25, 2024

By the way, is there a function that can calculate the metric defined in the paper? (average over all iterations and all trajectories) The debug log only provides average return of each iteration and the number of trajectories in each iteration are not provided in some algorithms.

from rllab.

dementrock commented on July 25, 2024

Also, as mentioned in the paper (probably should have been more clear), we scaled all the rewards by 0.1 when running DDPG. Refer to https://github.com/openai/rllab/blob/master/rllab/algos/ddpg.py#L112

In general, we found this parameter to be very important, and due to time constraint at the time we weren't able to tune it extensively. You may try some other values on other tasks, which may give you even better results.

Re the second question: I think we did a very crude approximation and simply averaged the results over all iterations (treating it as if all iterations had the same number of trajectories). Feel free to submit a pull request that adds additional loggings.

from rllab.

lchenat commented on July 25, 2024

I have scale the reward by 0.1, but I still got return around 2500, are there any other parameters that I need to tune?

from rllab.

dementrock commented on July 25, 2024

Oh, you should change max path length in ddpg to 500. Otherwise, the optimal score is 2500!

from rllab.

lchenat commented on July 25, 2024

Yes, the optimal score has increased to 5000 after I have changed the max path length to 500, but the average over all iteration is around 3100. Here are average return extracted from the debug.log:

[85.1877, 22.4833, 22.2935, 22.4445, 22.561, 22.3393, 22.8141, 22.2145, 22.2697, 22.3604, 100.441, 177.388, 196.363, 183.331, 223.452, 272.554, 293.124, 407.079, 535.813, 619.828, 695.468, 872.355, 1028.65, 952.744, 645.209, 846.002, 601.686, 607.632, 656.687, 697.427, 715.399, 646.103, 646.78, 621.531, 609.173, 629.381, 598.768, 633.524, 603.093, 692.313, 627.032, 665.51, 671.895, 678.046, 721.31, 670.6, 645.387, 603.164, 594.49, 617.101, 676.009, 634.184, 627.533, 658.008, 700.695, 684.835, 622.859, 596.207, 691.321, 615.621, 612.777, 573.243, 598.272, 611.166, 596.099, 598.044, 551.066, 636.267, 740.511, 599.541, 605.533, 615.751, 710.193, 662.288, 619.205, 661.016, 582.386, 582.968, 601.911, 653.29, 617.729, 651.414, 744.331, 714.654, 658.312, 804.903, 841.202, 925.207, 855.179, 1044.97, 895.128, 936.976, 1066.89, 1406.07, 2131.26, 4021.35, 1814.43, 1877.28, 1512.61, 1993.6, 1686.47, 1991.07, 3476.89, 4138.7, 2385.71, 3379.73, 2648.44, 2970.91, 4008.72, 4683.97, 3603.48, 4999.14, 4999.04, 4998.86, 2328.25, 4534.03, 4999.28, 4999.24, 4998.56, 4283.28, 4998.47, 4998.89, 4998.86, 2223.49, 4999.18, 2702.06, 4998.8, 4998.67, 4999.02, 4998.57, 4999.6, 4998.84, 4998.5, 4998.65, 2449.9, 2153.85, 2034.24, 1275.76, 1394.86, 2258.75, 4557.9, 4998.51, 4998.52, 4998.37, 4998.73, 4998.16, 4997.71, 4997.81, 4583.94, 4998.32, 4998.46, 4998.38, 4998.21, 4804.9, 4997.79, 4998.41, 4998.03, 4998.44, 4998.26, 4998.16, 4998.07, 4998.21, 4997.73, 4998.04, 4997.81, 4998.3, 4998.33, 4998.2, 4998.27, 4998.15, 4998.6, 4998.23, 4998.63, 4998.58, 4998.57, 4999.11, 4999.32, 4999.47, 4999.41, 4790.46, 4999.45, 4999.45, 4999.57, 4999.45, 4781.79, 4999.5, 4999.46, 2834.94, 2667.89, 4999.43, 4879.07, 4999.51, 4999.5, 4256.07, 4999.24, 3749.83, 3140.73, 2184.49, 3293.37, 4276.64, 4570.93, 4549.38, 4448.15, 4999.32, 4608.16, 4999.52, 4999.38, 4999.16, 4999.43, 4790.45, 4999.54, 4724.55, 4999.43, 4627.56, 4999.58, 4999.45, 4272.88, 4999.26, 4999.38, 4784.83, 4731.7, 4696.11, 4427.15, 4165.41, 4906.99, 4422.53, 3953.47, 3692.44, 4123.02, 4571.29, 4450.07, 4999.32, 4859.32, 4999.44, 4498.9, 4895.5, 4999.22, 4589.09, 4998.88, 4733.38, 4775.73, 4999.29, 4999.18, 4640.48, 4610.55, 4935.44, 4999.2, 4883.15, 4852.51, 4900.67, 4835.74, 4500.04, 4738.27, 4531.23, 4530.79, 4999.0, 4999.18, 3974.69, 4797.54, 4998.95, 4000.32, 3699.98, 3424.3, 4998.86, 4003.68, 4878.38, 4915.73, 4763.66, 4998.63, 4688.21, 4998.92, 4926.33, 3244.25, 4507.45, 4998.75, 4998.79, 4998.45, 3060.27, 2583.36, 2717.86, 2005.12, 4911.39, 4998.91, 4998.66, 4660.82, 4789.71, 4998.43, 4998.52, 4884.03, 4541.58, 4998.37]
291results
3179.83032646

The average return drops down from 5000 to 2000-3000 from time to time, is that a normal phenomenon in ddpg?

from rllab.

dementrock commented on July 25, 2024

@lchenat The benchmark results were run over 25 million samples, to match the sample complexity used by other algorithms. This should correspond to roughly 2500 epochs. A good approximation could be to extrapolate the performance of the last few epochs to the same amount of sample, and compute the average return using all these data.

I have also observed that ddpg is sometimes unstable, even in cartpole. What you're getting seems about right. One thing we didn't try was batch normalization, which we did not get working before the paper deadline and this could be a good thing to try. You can also try other reward scaling (e.g. 0.01), which might stabilize learning more.

from rllab.

Question about results in the paper about rllab HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent