Let me start off by saying thanks for writing such a wonderful, and easy to use librar

same problem here with a longer sequence. <a class="user-mention notranslate" data

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Reward either goes down or stays stagnant about trl HOT 10 CLOSED

huggingface commented on September 26, 2024 9

Reward either goes down or stays stagnant

from trl.

Comments (10)

Alymostafa commented on September 26, 2024 3

same problem here with a longer sequence.
@vblagoje
@lvwerra

from trl.

Alymostafa commented on September 26, 2024 1

@adhitya-synth I used the same configuration as you mentioned and I found out that when the batch size is small it happens as you said but with a larger batch size as in the notebook, the reward increases.

from trl.

hdvvip commented on September 26, 2024 1

Thus, based on the OpenAI experiments in InstructGPT paper, I think that it's based on the dataset you used to train your model. In OpenAI case, with the best implementation of PPO, they still failed to improve the rewards when they train GPT-3 using PPO on FLAN and T0 datasets.

from trl.

hdvvip commented on September 26, 2024 1

Well, I think we have some misunderstanding here. I didn't specifically mention you in post. I just want to explain to everyone here that depend on your tasks, PPO may work or not. So, it's not your fault when PPO failed on your NLP task. Everyone here has different tasks, so my answer didn't have anything to do with batch size. BTW, OpenAI used batch size of 128 but still failed.

from trl.

parshinsh commented on September 26, 2024

I confirm that this issue happens. I'm facing the same problem with my own task. Can anyone help with this?

from trl.

hdvvip commented on September 26, 2024

Recently, I came across OpenAI InstructGPT which is an upgrade version of GPT-3 that has been trained with reinforcement learning.
The reinforcement learning they used for training InstructGPT is PPO which is implemented in this github repository.
Related to the problem that the reward is stagnant or going down, I think even OpenAI (fathers of PPO) also face the same issue. Please see the Figure 13 below.
"As shown in Figure 13, the reward saturates after the initial 400k examples of training."

Here is InstructGPT paper.
https://arxiv.org/pdf/2203.02155.pdf

from trl.

hdvvip commented on September 26, 2024

Thus, if you used PPO on your task and it doesn't work. Don't be surprised! Like I said above, some tasks PPO will work. Some tasks, it won't.

from trl.

Alymostafa commented on September 26, 2024

Thanks for the clarification. But, I am mentioning that based on his observations when the batch size is small what he mentioned happens, but when I increased the batch size I was able to reproduce the same results as in the notebook.

from trl.

lvwerra commented on September 26, 2024

Thanks for the discussion here. Indeed, it can depend a lot on the hyperparameters as well as the task. Great you found that increasing the BS works. I think this is still a very underexplored area!

from trl.

leoribeiro commented on September 26, 2024

@adhitya-synth I face the same problem when using larger text. Did you figure it out a way to overcome this?

from trl.

Reward either goes down or stays stagnant about trl HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent