First, thank you for your efforts in helping to bring accurate and performant RLHF tec

Thanks for the issue. Regarding the model <div class="snippet-clipboard-content no

Thanks so much for the reply! Here is a that tra

Yes peft absolutely helps with the memory. In the N+ impl work <a class="user-mention

Clarification on reward/value heads in PPOV2 about trl HOT 3 OPEN

SalmanMohammadi commented on July 17, 2024

Clarification on reward/value heads in PPOV2

from trl.

Comments (3)

vwxyzjn commented on July 17, 2024 1

Thanks for the issue. Regarding the model

The default AutoModelForSequenceClassification implementation in Transformers uses bias=False for the classification nn.Linear

That is expected because bias cancels out in the RM loss. Here is a script that trains using the said RM

examples/scripts/ppo/ppo.py --output_dir models/minimal/ppo1 --num_ppo_epochs 4 --num_mini_batches 1 --learning_rate 3e-6 --per_device_train_batch_size 32 --gradient_accumulation_steps 16 --local_rollout_forward_batch_size 32 --total_episodes 100000 --model_name_or_path EleutherAI/pythia-1b-deduped --sft_model_path EleutherAI/pythia-1b-deduped --reward_model_path trl-internal-testing/rm_sentiment_1b --kl_coef 0.1 --stop_token period --non_eos_penalty --min_response_length 13 --penalty_reward_value -3

wandb here: https://wandb.ai/costa-huang/huggingface/runs/fmof4oxq/workspace?nw=nwusercostahuang

You can see the model's completion works as intended: the output text becomes more positive.

Screen.Recording.2024-06-27.at.1.40.49.PM.mov

t seems the value model is instantiated separately. Is my understanding correct here?

yes. separate value network is per OpenAI's setting in summarize from feedback and instructGPT

P.S. For context, I've been working on a PPO implementation in parallel in Torchtune pytorch/torchtune#1005, and I've found all the empirical work and implementation details invaluable so far.

that's amazing 💪👍!

from trl.

SalmanMohammadi commented on July 17, 2024

Thanks so much for the reply!

Here is a script that trains using the said RM

I've been hunting for this whilst I was doing some replication work against PPOV2 - so helpful, thanks :)

separate value network is per OpenAI's setting in summarize from feedback and instructGPT

I'd be really interested to hear if you have any thoughts on reducing the memory footprint of PPO - I noticed you guys were trying out some PEFT stuff similar to PPOV1, did you end up scaling the PEFT experiments to compare?

from trl.

vwxyzjn commented on July 17, 2024

Yes peft absolutely helps with the memory. In the N+ impl work @mnoukhov did some peft work and they perform pretty well, too. See screenshot below (missing the 6.9B lora checkpoint results, but it's pretty promising).

from trl.

Recommend Projects

Clarification on reward/value heads in PPOV2 about trl HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent