Giter VIP home page Giter VIP logo

rlphf's Introduction

RLPHF

This is the official github repository for Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging.

Citation:

@article{jang2023personalized,
  title={Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging},
  author={Jang, Joel and Kim, Seungone and Lin, Bill Yuchen and Wang, Yizhong and Hessel, Jack and Zettlemoyer, Luke and Hajishirzi, Hannaneh and Choi, Yejin and Ammanabrolu, Prithviraj},
  journal={arXiv preprint arXiv:2310.11564},
  year={2023}
}

Setup

Install dependencies

pip install -r requirements.txt

Get the data and unzip it

wget https://storage.googleapis.com/personalized-soups/data.zip

Step 1 - Generate Rollouts

torchrun --nnodes 1 --nproc_per_node 1 /net/nfs.cirrascale/mosaic/joel/personalized-rlhf/generate_rollouts.py \
    --output_dir $OUTPUT_DIR \
    --base_model $PATH_TO_TULU_CKPT \
    --dataset_name 'data/alpaca_gpt4_10k.json' \
    --prompt 'Generate a response that can be easily understood by an elementary school student.' \
    --batch_size 16 \
    --start_per 0 \
    --end_per 100

To get the Tulu checkpoints, refer to this repository. Feel free to put any customized prompt from the prompt config.

Step 2 - Label generated rollouts using GPT4

cd gpt4_annotate;
python run.py --open_ai_key $OPEN_AI_KEY \
	--input_dir $ROLLOUT_DIR \
	--saving_path $SAVE_PATH \
	--annotators $YAML_FILE_OF_ANNOTATOR_CONFIG

the .yaml file of the GPT4 annotator configs used for our experiments are provided in the GPT4_b5 directory. First clone https://github.com/tatsu-lab/alpaca_farm.git. Next place the GPT4_b5 directory inside alpaca_farm/auto_annotations/annotators and refer to the target .yaml file (e.g. pref1A.yaml) with the --annotators config. Please refer to the Alpacafarm code repo for more details.

Step 3 - Training Reward Model

Next, we utilize the GPT4 annotation for reward model training. An example script is provided below:

torchrun --nnodes 1 --nproc_per_node 4 training_reward_model.py 
    --model_name $PATH_TO_TULU_CKPT \
    --dataset_name $PATH_TO_RM_DATA \
    --eval_dataset_name $EVAL_DATASET_NAME \
    --output_dir $OUTPUT_DIR \
    --per_device_train_batch_size 2 \
    --num_train_epochs 1 \
    --wandb_project $WANDB_PROJECT_NAME \
    --wandb_run_name $WANDB_RUN_NAME

You can find the list of reward model training data in the data/rm_training directory. You can choose to create your own, custom eval dataset during rm training.

Step 4 - Policy Model Training

Here are sample script rns you can use to train each models:

Traditional RLHF

torchrun --nnodes 1 --nproc_per_node 4 training/rlhf.py \
    --dataset_name 'data/alpaca_gpt4_10k.json' \
    --model_name $PATH_TO_TULU_CKPT \
    --reward_model_name $DIR_TO_RM \
    --output_dir $OUTPUT_DIR \
    --adafactor False --save_freq 10 --output_max_length 512 --batch_size 16 --gradient_accumulation_steps 8 --batched_gen True --ppo_epochs 8 --learning_rate 1.4e-5 --mini_batch_size 2 \
    --early_stopping True --log_with wandb --val_dataset_name 'data/koala_eval_50_.json' --val_every_n_steps 10 \
    --wandb_project $WANDB_PROJECT_NAME --wandb_run_name $WANDB_RUN_NAME  \

$DIR_TO_RM is the directory to the adapter_model.bin from the reward model training output directory.

Multitask Training

torchrun --nnodes 1 --nproc_per_node 4 training/multitask_training.py \
    --base_model $PATH_TO_TULU_CKPT \
    --dataset_name 'data/alpca_gpt4_10k_mt.json' \
    --streaming --lr_scheduler_type 'constant' \
    --learning_rate 1e-5 --max_steps 1000 \
    --output_dir $OUTPUT_DIR \
    --project_name $WANDB_PROJECT_NAME --run_name $WANDB_RUN_NAME

P-MORL

torchrun --nnodes 1 --nproc_per_node 4 training/pmorl.py \
    --dataset_name 'data/alpaca_gpt4_pmorl_8.json' \
    --model_name $PATH_TO_TULU_CKPT \
    --reward_model_name $DIR_TO_RM \
    --output_dir $OUTPUT_DIR \
    --adafactor False --save_freq 10 --output_max_length 512 --batch_size 16 --gradient_accumulation_steps 8 --batched_gen True --ppo_epochs 8 --learning_rate 1.4e-5 --mini_batch_size 2  \
    --early_stopping True --log_with wandb --wandb_project $WANDB_PROJECT_NAME --wandb_run_name $WANDB_RUN_NAME  \
    --val_dataset_name 'data/koala_eval_50_.json' --val_every_n_steps 10

P-Soups

torchrun --nnodes 1 --nproc_per_node 4 training/psoups.py \
    --dataset_name 'data/psoups/alpaca_gpt4_P1A_10k.json' \
    --model_name $PATH_TO_TULU_CKPT \
    --reward_model_name $DIR_TO_RM \
    --output_dir $OUTPUT_DIR \
    --adafactor False --save_freq 10 --output_max_length 512 --batch_size 16 --gradient_accumulation_steps 8 --batched_gen True --ppo_epochs 8 --learning_rate 1.4e-5 --mini_batch_size 2 \
    --early_stopping True --log_with wandb --wandb_project $WANDB_PROJECT_NAME --wandb_run_name $WANDB_RUN_NAME  \
    --val_dataset_name 'data/koala_eval_50_.json' --val_every_n_steps 10

You can choose the different preference training files in data/psoups directory.

Step 5 - Generate model outputs

Example of generating outputs using trained policy models (e.g. P-MORL)

torchrun --nnodes 1 --nproc_per_node 1 eval.py \
    --output_dir $OUTPUT_DIR --base_model $PATH_TO_TULU_CKPT \
    --dataset_name 'data/koala_eval_50.json' \
    --prompt "Generate a response that can easily be understandable by an elementary school student. Generate a response that is concise and to the point without being verbose. Generate a response that is friendly witty funny and humorous like a close friend." \
    --batch_size 16 --start_per 0 --end_per 100 \
    --checkpoint_dir $POLICY_MODEL_DIR \

Example of generating outputs using P-Soups

torchrun --nnodes 1 --nproc_per_node 1 eval.py \
    --output_dir $OUTPUT_DIR --base_model $PATH_TO_TULU_CKPT \
    --dataset_name 'data/koala_eval_50.json' \
    --prompt "Generate a response that can easily be understandable by an elementary school student. Generate a response that is concise and to the point without being verbose. Generate a response that is friendly witty funny and humorous like a close friend." \
    --batch_size 16 --start_per 0 --end_per 100 \
    --checkpoint_dirs $POLICY_MODEL_DIR_1 \
    --checkpoint_dirs $POLICY_MODEL_DIR_2 \
    --checkpoint_dirs $POLICY_MODEL_DIR_3 \

You can append any combination for the --prompt configuration that you want to evaluate.

Step 6 - GPT4 Evaluation

After obtaining the model outputs from the previous step, you could use GPT-4 as an evaluator to judge/measure the win-rate across different baselines.

Go to ./gpt4_evaluate and run the following command

python run.py 
    --input_dir1 $First_Output_File 
    --input_dir2 $Second_Output_File
    --annotators "annotators/criteria_wise_eval_gpt4/p1a.yaml"
    --saving_path "./eval_results/crit=1A.json"

The demonstrations used for GPT4 evaluation and the criteria mentioned in the paper are all stored within "./gpt4_evaluate/alpaca_farm/auto_annotators/criteria_wise_eval_gpt4". Feel free to add additional preferences you would like to evaluate on!

rlphf's People

Contributors

joeljang avatar seungonekim avatar

Stargazers

Yinmin.Zhang avatar sicer avatar Shangbin Feng avatar Yu Bin, Kim avatar  avatar Xutong Zhao avatar Linkseed49 avatar Haotong DU 杜昊桐 avatar  avatar Guangyuan Zhao avatar Sue Hyun Park avatar  avatar Chen Zhang avatar Minchan Jeong avatar  avatar Tural Gulmammadov avatar  avatar Ruocheng Guo avatar  avatar Teng Xiao avatar Zhimeng Guo avatar Patrick Jiacheng Shen avatar Paul Sanchez avatar tbozhong avatar  avatar shayne avatar Hyunwoo Ko avatar  avatar  avatar  avatar Jaechan Lee avatar Ashutosh Baheti avatar Yu Xia avatar Yohan Na avatar  avatar Zhaoxuan Tan avatar Wangzhe avatar Harli WU avatar Jack Zhang avatar Yuhan Liu avatar Ruixin (Ray) Yang avatar seven8827 avatar  avatar Yaqing Wang avatar Jiacheng Zhu avatar Mohammad Reza Taesiri avatar Haoran Ye avatar nathan lile avatar Yoon, Seungje avatar  avatar  avatar Johnny He avatar kishi yuma avatar JiuhaiChen avatar kawamou avatar Jaewoo Ahn avatar Yuhang Lai avatar Henok Yemam avatar Sungju Kim avatar Yuga Yano avatar  avatar Zichen Wen avatar Huihui Xu avatar  avatar Myungchul Shin avatar 爱可可-爱生活 avatar yangchao avatar Xiangyu Qi avatar  avatar Kevin Armengol avatar Maksym Andriushchenko avatar Benny avatar Ali avatar Jiangjie Chen avatar Hugo Alves avatar Zhaowei Wang avatar Jiashuo WANG avatar (Bill) Yuchen Lin avatar Seonghyeon Ye avatar Yoonkyoung Cho avatar Than Lwin Aung avatar  avatar Jeff Carpenter avatar

Watchers

 avatar

rlphf's Issues

Cannot run step 2 due to potential version mismatch.

Hi! Thanks for sharing us with such a great repository!
I do step 2 after step 1. However, some error occurs.

The following is the error information.

Traceback (most recent call last):
  File "/mnt/c/Users/qiang/Documents/GitHub/RLPHF/gpt4_annotate/run.py", line 106, in <module>
    main(args)
  File "/mnt/c/Users/qiang/Documents/GitHub/RLPHF/gpt4_annotate/run.py", line 81, in main
    annotator = PairwiseAutoAnnotator(annotators_config = args.annotators,
  File "/mnt/c/Users/qiang/Documents/GitHub/alpaca_farm/src/alpaca_farm/auto_annotations/eval.py", line 170, in __init__
    super().__init__(
  File "/home/mirror/anaconda3/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/pairwise_evaluator.py", line 51, in __init__
    super().__init__(*args, **kwargs, primary_keys=self.input_keys + self.output_keys)
  File "/home/mirror/anaconda3/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 436, in __init__
    super().__init__(*args, **kwargs)
  File "/home/mirror/anaconda3/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 116, in __init__
    self.annotators = self._initialize_annotators()
  File "/home/mirror/anaconda3/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 210, in _initialize_annotators
    return {
  File "/home/mirror/anaconda3/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/base.py", line 211, in <dictcomp>
    name: self.SingleAnnotator(
  File "/mnt/c/Users/qiang/Documents/GitHub/alpaca_farm/src/alpaca_farm/auto_annotations/eval.py", line 186, in __init__
    super().__init__(*args, **kwargs)
  File "/home/mirror/anaconda3/envs/py310/lib/python3.10/site-packages/alpaca_eval/annotators/pairwise_evaluator.py", line 371, in __init__
    super().__init__(
TypeError: SingleAnnotator.__init__() got an unexpected keyword argument 'prompt_templates'

It seems that the version of alpaca_farm and alpaca_eval is a mismatch.

Could you please share with us the versions of alpaca_farm and alpaca_eval you use? Or could you please update your code to match the current alpaca_farm and alpaca_eval versions?

Some information may help you.

pip list | grep alpaca
alpaca_eval              0.5.2
alpaca-farm              0.1.12

Thx!

multitask_training

Thank you for sharing the code, i noticed that multitask_training.py seems to be a standard sft tuning, why it is called multitask_training?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.