Comments (10)
same problem here with a longer sequence.
@vblagoje
@lvwerra
from trl.
@adhitya-synth I used the same configuration as you mentioned and I found out that when the batch size is small it happens as you said but with a larger batch size as in the notebook, the reward increases.
from trl.
Thus, based on the OpenAI experiments in InstructGPT paper, I think that it's based on the dataset you used to train your model. In OpenAI case, with the best implementation of PPO, they still failed to improve the rewards when they train GPT-3 using PPO on FLAN and T0 datasets.
from trl.
Well, I think we have some misunderstanding here. I didn't specifically mention you in post. I just want to explain to everyone here that depend on your tasks, PPO may work or not. So, it's not your fault when PPO failed on your NLP task. Everyone here has different tasks, so my answer didn't have anything to do with batch size. BTW, OpenAI used batch size of 128 but still failed.
from trl.
I confirm that this issue happens. I'm facing the same problem with my own task. Can anyone help with this?
from trl.
Recently, I came across OpenAI InstructGPT which is an upgrade version of GPT-3 that has been trained with reinforcement learning.
The reinforcement learning they used for training InstructGPT is PPO which is implemented in this github repository.
Related to the problem that the reward is stagnant or going down, I think even OpenAI (fathers of PPO) also face the same issue. Please see the Figure 13 below.
"As shown in Figure 13, the reward saturates after the initial 400k examples of training."
Here is InstructGPT paper.
https://arxiv.org/pdf/2203.02155.pdf
from trl.
Thus, if you used PPO on your task and it doesn't work. Don't be surprised! Like I said above, some tasks PPO will work. Some tasks, it won't.
from trl.
Thanks for the clarification. But, I am mentioning that based on his observations when the batch size is small what he mentioned happens, but when I increased the batch size I was able to reproduce the same results as in the notebook.
from trl.
Thanks for the discussion here. Indeed, it can depend a lot on the hyperparameters as well as the task. Great you found that increasing the BS works. I think this is still a very underexplored area!
from trl.
@adhitya-synth I face the same problem when using larger text. Did you figure it out a way to overcome this?
from trl.
Related Issues (20)
- `OnPolicyConfig` - Rename or revise `num_sample_generations` HOT 1
- AttributeError: 'PPOv2Trainer' object has no attribute 'deepspeed' HOT 3
- Incorrect Doc String for `SFTConfig` HOT 2
- plz make GPOTrainer! (Generalized Preference Optimization)
- AttributeError: 'PPOv2Trainer' object has no attribute 'deepspeed' HOT 3
- ConstantLengthDataset should shuffle the order of samples before packing HOT 2
- The ppo_trainer.generate() call results in an error. HOT 2
- DPOTrainer failed on training Custom Mixture of Experts model with config output_router_logits=True
- Disable the dropout by default in Online DPO HOT 1
- RLOOTrainer & PPOv2Trainer - Modify Name for W&B Logged Table HOT 3
- PPOv2Trainer & RLOOTrainer - Add Safety Check that `policy` object != `ref_policy` object HOT 2
- Always allow `ref_model=None` HOT 5
- Reward model HOT 4
- [DPOTrainer] Tokenizer calculation fail during Q+A concat HOT 3
- Model does not generate eos when SFTTrainer with setup_chat_format is used HOT 1
- Deepspeed Zero2 not working when using DPOTrainer HOT 3
- Discrepancy in LLaMA 3.1 performance when using custom trainer and SFTTrainer HOT 7
- why does the ppo calculate advantage reversed the index?
- [Tracking issue] General dataset support
- `GKDTrainer`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from trl.