Giter VIP home page Giter VIP logo

Comments (3)

vwxyzjn avatar vwxyzjn commented on July 17, 2024 1

Thanks for the issue. Regarding the model

The default AutoModelForSequenceClassification implementation in Transformers uses bias=False for the classification nn.Linear

That is expected because bias cancels out in the RM loss. Here is a script that trains using the said RM

examples/scripts/ppo/ppo.py --output_dir models/minimal/ppo1 --num_ppo_epochs 4 --num_mini_batches 1 --learning_rate 3e-6 --per_device_train_batch_size 32 --gradient_accumulation_steps 16 --local_rollout_forward_batch_size 32 --total_episodes 100000 --model_name_or_path EleutherAI/pythia-1b-deduped --sft_model_path EleutherAI/pythia-1b-deduped --reward_model_path trl-internal-testing/rm_sentiment_1b --kl_coef 0.1 --stop_token period --non_eos_penalty --min_response_length 13 --penalty_reward_value -3

wandb here: https://wandb.ai/costa-huang/huggingface/runs/fmof4oxq/workspace?nw=nwusercostahuang

You can see the model's completion works as intended: the output text becomes more positive.

Screen.Recording.2024-06-27.at.1.40.49.PM.mov

t seems the value model is instantiated separately. Is my understanding correct here?

yes. separate value network is per OpenAI's setting in summarize from feedback and instructGPT

P.S. For context, I've been working on a PPO implementation in parallel in Torchtune pytorch/torchtune#1005, and I've found all the empirical work and implementation details invaluable so far.

that's amazing 💪👍!

from trl.

SalmanMohammadi avatar SalmanMohammadi commented on July 17, 2024

Thanks so much for the reply!

Here is a script that trains using the said RM

I've been hunting for this whilst I was doing some replication work against PPOV2 - so helpful, thanks :)

separate value network is per OpenAI's setting in summarize from feedback and instructGPT

I'd be really interested to hear if you have any thoughts on reducing the memory footprint of PPO - I noticed you guys were trying out some PEFT stuff similar to PPOV1, did you end up scaling the PEFT experiments to compare?

from trl.

vwxyzjn avatar vwxyzjn commented on July 17, 2024

Yes peft absolutely helps with the memory. In the N+ impl work @mnoukhov did some peft work and they perform pretty well, too. See screenshot below (missing the 6.9B lora checkpoint results, but it's pretty promising).

image

from trl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.