uclaml / spin Goto Github PK

View Code? Open in Web Editor NEW

933.0 12.0 83.0 3.57 MB

The official implementation of Self-Play Fine-Tuning (SPIN)

Home Page: https://uclaml.github.io/SPIN/

License: Apache License 2.0

Shell 12.22% Python 87.78%

deep-learning fine-tuning large-language-models self-play

spin's Introduction

🤗 Models | 🤗 Datasets

Self-Play Fine-Tuning (SPIN)

This repository contains the official code for the paper "Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models".

Authors: Zixiang Chen*, Yihe Deng*, Huizhuo Yuan*, Kaixuan Ji, Quanquan Gu

[Webpage] [Huggingface]

🔔 News

[05/01/2024] SPIN has been accepted by ICML2024!
[04/06/2024] We released the training scripts to reproduce our results.
[04/04/2024] ❗ We've discovered that the datasets previously uploaded are incorrect. We've re-upload the corrected datasets.
[02/13/2024] We updated our arXiv v2: https://arxiv.org/abs/2401.01335.
[02/09/2024] Our code is open-sourced!
[01/02/2024] Our paper is released on arXiv: https://arxiv.org/abs/2401.01335.

❗ Note: We notice that the Alignment Handbook has updated their configuration and SFT checkpoint since our experiments. The configuration and SFT model from the Alignment Handbook that we used in our experiments for data generation and fine-tuning are the older version (Config, Model). The model checkpoint on Hugging Face (alignment-handbook/zephyr-7b-sft-full) has been updated once with the new config. If you wish to use the newest SFT model, you need to either load the checkpoint with revision=ac6e600eefcce74f5e8bae1035d4f66019e93190 or generate your own data instead of using the datasets we provided on Hugging Face.

About SPIN
Setup
- Data
- Model
Usage
Reproducing Our Results
Evaluation
Citation
Acknowledgement

🌀 About SPIN

SPIN utilizes a self-play mechanism, allowing an LLM to improve itself by playing against its previous iterations, without needing additional human-annotated preference data than the SFT dataset itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from the original SFT data.

Average score of SPIN at different iterations on the HuggingFace Open LLM leaderboard.

SPIN can significantly enhance the performance of an LLM after SFT across various benchmarks, outperforming the model trained with direct preference optimization (DPO) on labelled preference datasets. The approach is theoretically grounded, ensuring that the LLM aligns with the target data distribution, and empirically validated through extensive evaluations on multiple datasets.

Performance comparison with DPO training across the six benchmark datasets. SPIN at iteration 0 achieves comparable performance to DPO training with 62k new data. At iteration 1, SPIN has already surpassed DPO training on the majority of datasets.

For more details, you can check our paper here.

Setup

The following steps provide the necessary setup to run our codes.

Create a Python virtual environment with Conda:

conda create -n myenv python=3.10
conda activate myenv

Install the following Python dependencies to run the codes.

python -m pip install .
python -m pip install flash-attn --no-build-isolation

huggingface-cli login --token "${your_access_token}"

Data

We provide the data used in our experiments along with the synthetic data we generated in this repo as well as on HuggingFace. These data are converted to .parquet format for fine-tuning.

❗ On 04/04/2024, we've discovered that the datasets previously uploaded are incorrect. We've re-upload the corrected datasets.

Dataset	Download
SPIN_iter0	🤗 HuggingFace
SPIN_iter1	🤗 HuggingFace
SPIN_iter2	🤗 HuggingFace
SPIN_iter3	🤗 HuggingFace

The input data for our code is required to be of the same format where each data contains the following attributes, as similar to HuggingFaceH4/ultrafeedback_binarized:

{
    "real": [{"role": "user", "content": <prompt>}, 
               {"role": "assistant", "content": <ground truth>}],
    "generated": [{"role": "user", "content": <prompt>}, 
                 {"role": "assistant", "content": <generation>}]
}

🔍 Note: During data generation, the content for generated response can be empty, as we only uses prompt to generate model responses.

Model

We also provide our model checkpoints at iteration 0,1,2,3 on HuggingFace.

Model	Download
zephyr-7b-sft-full-SPIN-iter0	🤗 HuggingFace
zephyr-7b-sft-full-SPIN-iter1	🤗 HuggingFace
zephyr-7b-sft-full-SPIN-iter2	🤗 HuggingFace
zephyr-7b-sft-full-SPIN-iter3	🤗 HuggingFace

🔍 Note: With the provided data, you can directly jump to Step 2: Fine-tuning without doing data generation on your own. You may also start from any iteration to reproduce our results using our open-sourced model checkpoints.

Usage

For SPIN, we generate all synthetic data at once for an iteration, and fine-tune the LLM based on the real and synthetic data pairs.

Step 0 (optional): Reformatting SFT dataset

python spin/reformat.py [options]

Options

--data: directory to the SFT dataset (local or huggingface)
- default: HuggingFaceH4/ultrachat_200k
--output_dir: local directory to the reformated data files
- default: UCLA-AGI/SPIN_iter0

🔍 Note: If choosing to use SPIN on the entire dataset of HuggingFaceH4/ultrachat_200k instead of our 50k subset, one can reformat the original data with spin/reformat.py. To use other datasets, simply convert the data into the same format and resume with the following steps.

Step 1: Generation

accelerate launch spin/generate.py [options]

Options

--model: load model checkpoint for generation.
- default: alignment-handbook/zephyr-7b-sft-full
--input_dir: directory to the data files with prompts for generation
- The code is for generation based on data of the format given below.
- default: UCLA-AGI/SPIN_iter0
--output_dir: directory to save the output data.
--batch_size: per device batch size
- default: 16
--data_frac: break data into fractions for generations across server.
- --frac_len: length of the data fraction. Default is 0 which uses the entire dataset for generation. Set frac_len to a positive number to generate only for a fraction of data. Note: we recommend using a smaller frac_len (e.g. 800) to generate data by small batches to avoid unexpected crashes as data generation can be very time-consuming.
- Setting data_frac to be 0, 1, 2... to generate for different fractions of length frac_len.
- Note: maintain the same frac length when doing generation using data_frac. It's recommended to set a smaller frac_len to 800.
--split: choose the split for data generation
- default: train

The generated data is in json format where each data contains the following attributes:

{
    "real": [{"role": "user", "content": <prompt>}, 
               {"role": "assistant", "content": <ground truth>}],
    "generated": [{"role": "user", "content": <prompt>}, 
                 {"role": "assistant", "content": <generation>}]
}

Note: the iteration number for data generation solely depends on which model is used for generation (e.g. use original SFT model for iter0 data generation and SPIN iter0 model for iter1 data generation). The generate.py script will only use the prompts/questions for the data model.

Example. The following code generates 8k synthetic data for iteration 0.

bash scripts/generate.sh

🚀 Faster generation with vLLM

Alternatively, you could use the following example script to generate LLM responses with speedup. Larger frac_len can be used with vllm.

bash scripts/generate_vllm.sh

Thanks to @sumo43 for implementing vLLM for generation.

Step 1.5: Gather generations and convert data type

python spin/convert_data.py [options]

Options

--num_fracs: number of files to load in.
--input_dir: directory to the generated data files.
--output_dir: directory for the unified data that will be used for fine-tuning.

The code will generate the two final data files including train_prefs-00000-of-00001.parquet and test_prefs-00000-of-00001.parquet, which will be used for fine-tuning.

Note: make sure to collect the generated data filed into the same directory of --input_dir.

Example.

python spin/convert_data.py --output_dir new_data/iter0 --input_dir generated/iter0 --num_fracs 63

Step 2: Fine-tuning

accelerate launch --config_file configs/multi_gpu.yaml --num_processes=8 --main_process_port 29500 spin/run_spin.py configs/config.yaml

You might need to change the configuration in configs/config.yaml. Here are some key configs you might need to customize:

model_name_or_path: load model checkpoint for finetuning.
- default: alignment-handbook/zephyr-7b-sft-full
dataset_mixer: choose data to mix for fine-tuning.
- default: UCLA-AGI/SPIN_iter0: 1.0
- For SPIN at iteration 1 and 2, we included both the current iteration and the previous iteration (e.g. for iteration 1 we included both UCLA-AGI/SPIN_iter0: 1.0 and UCLA-AGI/SPIN_iter1: 1.0, summing to 100k data.)
output_dir: the output directory of finetuned model and checkpoints.
- default: outputs
per_device_train_batch_size: batch size on one GPU.
- default: 16
gradient_accumulation_steps: make sure that the product per_device_train_batch_size*num_processes*gradient_accumulation_steps equals to your true batch size.
num_train_epochs: the training epochs of this iteration.
- default: 3
beta: beta in SPIN.
- default: 0.1

In our experiments, we do full fine-tuning on a multi-GPU machine with DeepSpeed ZeRO-3 (requires A100 (80GB)).

Example.

bash scripts/finetune.sh

Reproducing Our Results

To help reproducing our results, we have made available the scripts corresponding to all four iterations of our study. These scripts are pre-configured with the exact parameters and model versions used in our paper. For each iteration, the base model is initialized with the version released on 🤗 HuggingFace, which can be found at the following links:

Dataset	Download
SPIN_iter0	🤗 HuggingFace
SPIN_iter1	🤗 HuggingFace
SPIN_iter2	🤗 HuggingFace
SPIN_iter3	🤗 HuggingFace

To execute the full pipeline using your locally trained models as the base, modify the model_name_or_path parameter in the configuration files to point to your model's path.

To start the full fine-tuning process, run the corresponding script from your terminal:

bash scripts/finetune.sh
bash scripts/finetune_iter1.sh
bash scripts/finetune_iter2.sh
bash scripts/finetune_iter3.sh

By following these steps, you should be able to reproduce our results.

Evaluation

For our evaluation on the Open LLM Leaderboard, please use this lm-evaluation-harness which is v0.4.0. Also, note that we set the number of few shot examples to be the same as instructed on the Leaderboard. Different evaluation versions results in different scores, but the trend will remain the same.

Star History

Citation

If you find this repo useful for your research, please consider citing the paper

@misc{chen2024selfplay,
      title={Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models}, 
      author={Zixiang Chen and Yihe Deng and Huizhuo Yuan and Kaixuan Ji and Quanquan Gu},
      year={2024},
      eprint={2401.01335},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Acknowledgement

This repo is built upon The Alignment Handbook. We thank the authors for their great work.

spin's People

Contributors

Stargazers

Watchers

Forkers

codeaudit souradip-umd zeroxclem cacosta88 theonetrueguy tomchapin mivanovitch saucam sorokinvld peterhan91 philipmay eltociear unaidedelf8777 yihedeng9 mbrukman mrpoou stamahalaoso evdcush energyouk excillu-newsabar patitimoner gutsylaochampupdates canyonsfr-84 bittycoi techthiyanes ailmendra17stictime dirtydreams daddyunikii dragon28 swifilaboroka leevaleeth mbakpur123 adeliavale presteddy56 girlieinkshamp maddilook yanxg galadrielcik asdlei99 smith2t airobotproject self-play-language-models tdolan21 chasemcdo saulocatharino xujinlai shyamalschandra e0397123 evelynmitchell ahmoussa arian-askari amulil lewtun ncs-gobubble mazic4 coldra1n gracewx ideaalloc expert68 hoda333 zerolink-io jamesliu joshua-ren ralami1859 thanhduycao reinholdm hajime-y stardewxxx qasimramzankbk zhaoyingjun dongjie-cheng aibot88 iphosi sonwy2 wbclark angelahzyuan mikasa-changfang kunshou123 xingjianleng ziyiliubird jxzhangjhu

spin's Issues

Theoretical Analysis and Idea of SPIN are quite weird (may not make senses)??

At iteration 0, the p_{\theta_0}=p_{SFT} and the global optimal p_{\theta_1} after iteration 1 of following objecitve will still be p_{SFT} . Thus, the following iterations of p_{\theta} will always be p_{SFT}.

In your theoretical analysis, you also prove that p_{\theta*}=p_{data}; however, p_{SFT}=p_{\theta*} by minimizing the cross-entropy. I believe the reason why SPIN outperforms p_{SFT} is not clear. Could you please explain this?

Evaluation results on MT Bench and BBH

Hello,

Thanks for the great work! It seems that SPIN is effective in improving model performance on reasoning, math tasks (TruthfulQA, GSM8K, BB-Causal) but less so on knowledge task (MMLU).

I have a couple of questions to the authors:

on which tasks in MT Bench do you observe contributing the most to the performance gain? I guess Math, reasoning, and extraction?
1.1. Do you plan to release the responses of your MT Bench results?
have you tested on the full suite of BBH? Did you observe performance gain across all tasks or only the subsets that you reported in the paper?
The results in figure 5 is intriguing: do you have results for other HF open leader board benchmarks, i.e., mmlu, gsm8k? Do you also observe the same phenomena in these benchmarks?

I would really appreciate if the authors could kindly respond to some of my questions.

Kind thanks,
FengTing

use_peft Not working?

Full finetuning is working well but use peft is error.

My config is

Model arguments

model_name_or_path: /workspace/sandbox/model_sealion7b

Data training arguments

dataset_mixer:
#HuggingFaceH4/ultrafeedback_binarized: 1.0
UCLA-AGI/SPIN_iter0: 1.0
dataset_splits:

train
test
preprocessing_num_workers: 12

Trainer arguments

bf16: true
beta: 0.1
do_eval: false
evaluation_strategy: "no"
eval_steps: 100
gradient_accumulation_steps: 1
gradient_checkpointing: true

hub_model_id: zephyr-7b-spin

learning_rate: 5.0e-7
log_level: info
logging_steps: 10
lr_scheduler_type: linear
max_length: 1024
max_prompt_length: 512
num_train_epochs: 3
optim: rmsprop
output_dir: outputs
per_device_train_batch_size: 1
per_device_eval_batch_size: 1
push_to_hub: false
save_strategy: "epoch"
save_total_limit: null
seed: 42
warmup_ratio: 0.1
use_peft: True
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules: ["down_proj", "out_proj", "up_proj", "Wqkv"]
trust_remote_code: True

I got error message
File "/workspace/SPIN/spin/run_spin.py", line 169, in main
train_result = spin_trainer.train()
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2725, in training_step
loss = self.compute_loss(model, inputs)
File "/workspace/SPIN/spin/alignment/trainer.py", line 579, in compute_loss
loss, metrics = self.get_batch_metrics(model, inputs, train_eval="train")
File "/workspace/SPIN/spin/alignment/trainer.py", line 546, in get_batch_metrics
) = self.concatenated_forward(self.ref_model, batch)
File "/workspace/SPIN/spin/alignment/trainer.py", line 504, in concatenated_forward
all_logits = model(
TypeError: 'NoneType' object is not callable

What should I do?

the four reward metrics

I wonder how the four reward metrics rewards/real, rewards/generated, rewards/accuracies and rewards/margins change during training? Can you show some pictures about them for reimplement？

SPIN == DPO in self-iteration?

The following part in the paper explains the difference of SPIN and DPO.

It claims that DPO improve the model using instance level information while SPIN are on the distribution level.

However, referring to the formulas respectively, the difference is minor when the SFT dataset in SPIN y~P_data is regarded as the winner y_w in DPO and the LLM outputs in SPIN y~P_theta is regarded as the loser y_l in DPO.

How can you explain this?

Significant Performance Drop in GSM8k Evaluation with Updated SFT ckpt

Hi,

Thank you for your work. We're re-evaluating experiments using an updated SFT ckpt from https://huggingface.co/alignment-handbook/zephyr-7b-sft-full and using lm-evaluation-harness v0.4.0 for evaluation. We've noticed a significant performance drop in GSM8k. We trained the model for 6 epochs in each iteration. Have you observed this issue or have insights into potential causes?

numpy version warnings

 × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [45 lines of output]

      A module that was compiled using NumPy 1.x cannot be run in
      NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
      versions of NumPy, modules must be compiled with NumPy 2.0.
      Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

      If you are a user of the module, the easiest solution will be to
      downgrade to 'numpy<2' or try to upgrade the affected module.
      We expect that some modules will need time to support NumPy 2.

I think the easiest way is to limit the numpy version to <2?

Some detailed questions regarding SPIN

Hi, thanks for open sourcing your work. That's awesome. I have several questions about the details of the implementation.

Why do you use the RMSProp optimizer and the linear schedule instead of the commonly used AdamW optimizer and the cosine schedule?
How to prepare data for the iter 0 of SPIN? It seems that in generate.sh, it starts with the data preparation work for iter 1.
Regarding the prompt template, it seems that you used a different template for synthetic data generation from the one which is used for zephyr. May I know the reason behind that? What if the same prompt is used for synthetic data generation and the SFT training?
For synthetic data generation, the max output length is limited to 256, but during training the max length and the max prompt length are 1024 and 512 respectively. So, the synthetic data is ending with more paddings than the actual data during training. Is that right? Have you tried to extend the max output length for data generation?

I've also skimmed the code. Here are some potential improvements.

https://github.com/uclaml/SPIN/blob/main/spin/generate_vllm.py#L46 For greedy generation, the temperature should be set to zero according to vllm.
https://github.com/uclaml/SPIN/blob/main/spin/alignment/trainer.py#L552-L559 Swap the order of cpu() and mean() for faster metrics collection on slower CPUs

Cannot reproduce generated samples in UCLA-AGI/SPIN_iter0

I use revision="ac6e600eefcce74f5e8bae1035d4f66019e93190" to generate data for iter0, but cannot reproduce generated samples in UCLA-AGI/SPIN_iter0. Since do_sample is default to be false, the generation should be the same. Below is an example:

#########
How does the location of the Sydney Conservatorium of Music impact the academic and professional opportunities available to music students, and how does the conservatorium support student engagement with the music industry in Australia?

[reproduced with generate.py]: The location of the Sydney Conservatorium of Music in the heart of Sydney's central business district provides students with access to a wide range of academic and professional opportunities. The conservatorium is situated in the historic University of Sydney Quadrangle, ........(omitted)

[provided in UCLA-AGI/SPIN_iter0]: he Conservatorium is located in the heart of the city of Sydney, specifically in the central business district. This provides students with unique access to academic and professional opportunities within the city, including the opportunity to learn from top faculty members in the music ........(omitted)

Question about the checkpoint provided in this repo

Is the checkpoint provided in this repo trained with the old version of zephyr-7b-sft-full?

Generate multiple samples with sampling?

Hello! I'm wondering if it would help to generate multiple samples per prompt per round with sampling (topk or topp). Is this something you have looked into? Is doing so theoretically sound or flawed?

OOM with 8 A800

hi, I got OOM error while fine tuning with qwen-14b-chat and the default model.
using
accelerate launch --config_file configs/deepspeed_zero3.yaml --multi_gpu --num_processes=8 --main_process_port 29501 spin/run_spin.py configs/config.yaml --num_train_epochs=3 --output_dir="xxx/spin_outputs/iter0-ckpt"

system info

absl-py 2.1.0
accelerate 0.23.0
aiohttp 3.9.5
aioprometheus 23.12.0
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.4.0
async-timeout 4.0.3
attrs 23.2.0
bitsandbytes 0.41.2.post2
certifi 2024.6.2
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 3.0.0
cmake 3.29.6
contourpy 1.2.1
cycler 0.12.1
datasets 2.14.6
deepspeed 0.12.2
dill 0.3.7
diskcache 5.6.3
dnspython 2.6.1
docstring_parser 0.16
einops 0.8.0
email_validator 2.1.1
evaluate 0.4.0
exceptiongroup 1.2.1
fastapi 0.111.0
fastapi-cli 0.0.4
filelock 3.15.1
flash_attn 2.5.9.post1
fonttools 4.53.0
frozenlist 1.4.1
fsspec 2023.10.0
grpcio 1.64.1
h11 0.14.0
hjson 3.1.0
httpcore 1.0.5
httptools 0.6.1
httpx 0.27.0
huggingface-hub 0.23.3
idna 3.7
interegular 0.3.3
Jinja2 3.1.4
joblib 1.4.2
jsonlines 4.0.0
jsonschema 4.22.0
jsonschema-specifications 2023.12.1
kiwisolver 1.4.5
lark 1.1.9
llvmlite 0.43.0
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.9.0
mdurl 0.1.2
mpmath 1.3.0
msgpack 1.0.8
multidict 6.0.5
multiprocess 0.70.15
nest-asyncio 1.6.0
networkx 3.3
ninja 1.11.1.1
numba 0.60.0
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.5.40
nvidia-nvtx-cu12 12.1.105
opencv-python 4.10.0.84
orjson 3.10.5
outlines 0.0.34
packaging 24.1
pandas 2.2.2
peft 0.6.1
pillow 10.4.0
pip 24.0
prometheus_client 0.20.0
protobuf 3.20.2
psutil 5.9.8
py-cpuinfo 9.0.0
py4j 0.10.9.7
pyarrow 16.1.0
pydantic 2.7.4
pydantic_core 2.18.4
Pygments 2.18.0
pynvml 11.5.0
pyparsing 3.1.2
pyspark 3.5.1
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-multipart 0.0.9
pytz 2024.1
PyYAML 6.0.1
quantile-python 1.1
ray 2.24.0
referencing 0.35.1
regex 2024.5.15
requests 2.32.3
responses 0.18.0
rich 13.7.1
rpds-py 0.18.1
safetensors 0.4.3
scipy 1.13.1
seaborn 0.13.2
sentencepiece 0.2.0
setuptools 69.5.1
shellingham 1.5.4
shtab 1.7.1
six 1.16.0
sniffio 1.3.1
spin 0.1.0.dev0
starlette 0.37.2
sympy 1.12.1
tensorboard 2.17.0
tensorboard-data-server 0.7.2
tiktoken 0.6.0
tokenizers 0.15.2
torch 2.1.0
torchvision 0.18.1
tqdm 4.66.4
transformers 4.36.2
transformers-stream-generator 0.0.5
triton 2.1.0
trl 0.7.4
typer 0.12.3
typing_extensions 4.12.2
tyro 0.8.4
tzdata 2024.1
ujson 5.10.0
ultralytics-thop 2.0.0
urllib3 2.2.1
uvicorn 0.30.1
uvloop 0.19.0
vllm 0.3.0
watchfiles 0.22.0
websockets 12.0
Werkzeug 3.0.3
wheel 0.43.0
xformers 0.0.23.post1
xxhash 3.4.1
yarl 1.9.4

Thanks for your help in advance!

the logps decrease

Hi, I have tried spin with an agent datasets. As training time increases, the logps become smaller and smaller.

I understand that as the logps, the model becomes increasingly uncertain about its predictions. Am I right?

Have you met this situation during fine tuning with spin?

{'loss': 0.683, 'learning_rate': 1.4492753623188405e-07, 'rewards/chosen': -0.04396624490618706, 'rewards/rejected': -0.06808599829673767, 'rewards/accuracies': 0.668749988079071, 'rewards/margins': 0.024119755253195763, 'logps/rejected': -1.543066382408142, 'logps/chosen': -1.1352736949920654, 'logps/rejected_ref': -1.5335168838500977, 'logps/chosen_ref': -1.1292275190353394, 'logits/rejected': -2.7396950721740723, 'logits/chosen': -2.740943431854248, 'epoch': 0.06}
{'loss': 0.6719, 'learning_rate': 1.8115942028985507e-07, 'rewards/chosen': -0.0987342819571495, 'rewards/rejected': -0.14318521320819855, 'rewards/accuracies': 0.675000011920929, 'rewards/margins': 0.04445093870162964, 'logps/rejected': -1.5421949625015259, 'logps/chosen': -1.1360487937927246, 'logps/rejected_ref': -1.5208072662353516, 'logps/chosen_ref': -1.1223704814910889, 'logits/rejected': -2.727785587310791, 'logits/chosen': -2.723764657974243, 'epoch': 0.07}
{'loss': 0.6527, 'learning_rate': 2.1739130434782607e-07, 'rewards/chosen': -0.1672864407300949, 'rewards/rejected': -0.2939082682132721, 'rewards/accuracies': 0.7250000238418579, 'rewards/margins': 0.126621812582016, 'logps/rejected': -1.597663164138794, 'logps/chosen': -1.1704113483428955, 'logps/rejected_ref': -1.555121898651123, 'logps/chosen_ref': -1.1455411911010742, 'logits/rejected': -2.6994194984436035, 'logits/chosen': -2.698334217071533, 'epoch': 0.09}
{'loss': 0.6357, 'learning_rate': 2.536231884057971e-07, 'rewards/chosen': -0.24780841171741486, 'rewards/rejected': -0.41642823815345764, 'rewards/accuracies': 0.6812499761581421, 'rewards/margins': 0.1686197966337204, 'logps/rejected': -1.5889415740966797, 'logps/chosen': -1.1762551069259644, 'logps/rejected_ref': -1.5331004858016968, 'logps/chosen_ref': -1.1385315656661987, 'logits/rejected': -2.608551263809204, 'logits/chosen': -2.595763683319092, 'epoch': 0.1}
{'loss': 0.6215, 'learning_rate': 2.898550724637681e-07, 'rewards/chosen': -0.3489204943180084, 'rewards/rejected': -0.5186128616333008, 'rewards/accuracies': 0.6625000238418579, 'rewards/margins': 0.16969238221645355, 'logps/rejected': -1.4930452108383179, 'logps/chosen': -1.2134768962860107, 'logps/rejected_ref': -1.4123075008392334, 'logps/chosen_ref': -1.1594829559326172, 'logits/rejected': -2.72218656539917, 'logits/chosen': -2.7101638317108154, 'epoch': 0.12}
{'loss': 0.5946, 'learning_rate': 3.260869565217391e-07, 'rewards/chosen': -0.5540724396705627, 'rewards/rejected': -0.7824233174324036, 'rewards/accuracies': 0.6625000238418579, 'rewards/margins': 0.22835083305835724, 'logps/rejected': -1.6317428350448608, 'logps/chosen': -1.18325674533844, 'logps/rejected_ref': -1.5138404369354248, 'logps/chosen_ref': -1.1028623580932617, 'logits/rejected': -2.5669362545013428, 'logits/chosen': -2.5608177185058594, 'epoch': 0.13}
{'loss': 0.5704, 'learning_rate': 3.6231884057971015e-07, 'rewards/chosen': -0.7594733238220215, 'rewards/rejected': -1.1346296072006226, 'rewards/accuracies': 0.6812499761581421, 'rewards/margins': 0.37515631318092346, 'logps/rejected': -1.6324068307876587, 'logps/chosen': -1.2670494318008423, 'logps/rejected_ref': -1.4673458337783813, 'logps/chosen_ref': -1.1539455652236938, 'logits/rejected': -2.371572971343994, 'logits/chosen': -2.3651740550994873, 'epoch': 0.15}
{'loss': 0.5638, 'learning_rate': 3.9855072463768114e-07, 'rewards/chosen': -0.9244664311408997, 'rewards/rejected': -1.3455318212509155, 'rewards/accuracies': 0.6625000238418579, 'rewards/margins': 0.4210655093193054, 'logps/rejected': -1.7487471103668213, 'logps/chosen': -1.3893951177597046, 'logps/rejected_ref': -1.5395921468734741, 'logps/chosen_ref': -1.2470614910125732, 'logits/rejected': -2.358799457550049, 'logits/chosen': -2.351849317550659, 'epoch': 0.16}
{'loss': 0.5619, 'learning_rate': 4.3478260869565214e-07, 'rewards/chosen': -0.9872746467590332, 'rewards/rejected': -1.5160770416259766, 'rewards/accuracies': 0.6312500238418579, 'rewards/margins': 0.5288023352622986, 'logps/rejected': -1.579150676727295, 'logps/chosen': -1.2077702283859253, 'logps/rejected_ref': -1.358649492263794, 'logps/chosen_ref': -1.0587767362594604, 'logits/rejected': -2.2918102741241455, 'logits/chosen': -2.295361042022705, 'epoch': 0.17}

Token indices sequence length is longer than the specified maximum sequence length

Hi, thank you for your excellent work. I meet a problem and really need your help:

When I conducted the step 1: Generation by running the following code:

accelerate launch --main_process_port=2950 spin/generate.py --input_dir /SPIN_iter0 --batch_size 8 --frac_len 800 --data_frac 2 --output_dir /generated/iter1

It returns a warning:

Token indices sequence length is longer than the specified maximum sequence length for this model (2392 > 2048). Running this sequence through the model will result in indexing errors.

I am wondering whether this will affect the following training step? Is there any way to address this issue? Or can I just ignore it?

Thanks in advance!

The data num is wrong

I download the [train_prefs-00000-of-00001.parquet](https://huggingface.co/datasets/UCLA-AGI/SPIN_iter0/blob/main/train_prefs-00000-of-00001.parquet) and load with

import pdb

from datasets import load_dataset

data = load_dataset("/data/oss_bucket_0/yunying/paper/SPIN/data/SPIN_iter0/", split="train")
print(len(data))
all_val = set()
for line in data:
    all_val.add(str(line))
print(len(all_val))
pdb.set_trace()

The len(data) is 99584 and the len(all_val) is 49792. Is that right?

Cannot reproduce the result

Hi authors,

I tried to reproduce the result you declared in the paper using zephyr-7b-sft-full

If following the algorithm you design in the paper, only the performance of the first iteration can increase, and then it decreases after later a few iterations. And I set the epoch = 3.

Could you give me some guidance? Thanks!

Difference between training / generation input format.

Hi guys! :)
Thanks for making your code open-source.
One quick question. I've realized that in run_spin.py, you use apply_chat_template which differs from the format
of generate.py or generate_vllm.py which just tokenizes the following formatted string

"### Instruction: " + data[idx][0]['content'] + "\n\n### Response: " for idx in range(len(data))]

Please correct me if I am wrong😅
Thanks in advance!

Question about which datasets are used for each iteration

Hello, thank you for open sourcing the code behind SPIN - it's very clean!

I'm currently working on porting this to trl in huggingface/trl#1344 and am validating everything works on a small Qwen-1.5-0.5b model.

On p.9 of your paper, you state that you combine datasets across each iteration:

In multiple iterations, we leverage the synthetic data from the most recent iteration and add to the newly generated synthetic data, therefore resulting in a synthetic dataset size of 50k at iteration 0 and 100k at iteration 1, 2 and 3. At each iteration, we train our model for 2 epochs.

My question concerns which combination of datasets you used for each SPIN iteration:

Was zephyr-7b-sft-full-SPIN-iter0 trained on UCLA-AGI/SPIN_iter0 (50k samples)?
Was zephyr-7b-sft-full-SPIN-iter1 trained on UCLA-AGI/SPIN_iter0 and UCLA-AGI/SPIN_iter1 (100k samples)?
Was zephyr-7b-sft-full-SPIN-iter2 trained on UCLA-AGI/SPIN_iter1 and UCLA-AGI/SPIN_iter2 (100k samples)?
Etc

In other words, do you combine the generations from the model trained on iteration t with those from t-1?

A related question is whether you always run generation on the same 50k prompts at each iteration or do you generate over 100k prompts for iterations 1-3?

Thanks!

Question about using peft (LoRA)

Hi, thanks for the great project and repo. I'm wondering if you have tried any peft techniques such as LoRA?

I added the following three lines in config.yaml

use_peft: true
lora_r: 16
lora_alpha: 16

and try to run finetune.sh but received the following error message:

Please let me know if you have any comments or observe the same error (or it's possible that there is something wrong with my device or I modified something...) Thanks!

Confused about iterations

Hi there, great job on the project!

I'm looking to clarify whether the UCLA-AGI/zephyr-7b-sft-full-SPIN-iter1 model was fine-tuned on top of UCLA-AGI/zephyr-7b-sft-full-SPIN-iter0 or alignment-handbook/zephyr-7b-sft-full. The paper suggests that training progresses from $\theta_t$ to $\theta_{t+1}$. However, the description provided at https://huggingface.co/UCLA-AGI/zephyr-7b-sft-full-SPIN-iter1 seems to indicate otherwise.

I would appreciate any clarification on this matter.

Thank you!

GPU Memory question

Hello! Thanks for the open-sourced code release.
I have been trying to run the fine-tuning with a phi-2 3B model on a 40GB A100 GPU, while running
accelerate launch spin/run_spin.py configs/config.yaml
I get GPU run out of memory errors, which really confuses me, since I have set the batch size to 1, and 1 for all the number of processes here. I can not imagine what is consuming so much memory:

[INFO|trainer.py:571] 2024-02-29 14:04:29,359 >> Using auto half precision backend
[INFO|trainer.py:1721] 2024-02-29 14:04:32,728 >> ***** Running training *****
[INFO|trainer.py:1722] 2024-02-29 14:04:32,728 >> Num examples = 20
[INFO|trainer.py:1723] 2024-02-29 14:04:32,728 >> Num Epochs = 3
[INFO|trainer.py:1724] 2024-02-29 14:04:32,728 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1727] 2024-02-29 14:04:32,728 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1728] 2024-02-29 14:04:32,728 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1729] 2024-02-29 14:04:32,728 >> Total optimization steps = 60
[INFO|trainer.py:1730] 2024-02-29 14:04:32,729 >> Number of trainable parameters = 2,779,683,840
0% 0/60 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
[WARNING|modeling_utils.py:1126] 2024-02-29 14:04:34,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
Traceback (most recent call last):
File "/content/SPIN/spin/run_spin.py", line 206, in
main()
File "/content/SPIN/spin/run_spin.py", line 169, in main
train_result = spin_trainer.train()
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1917, in _inner_training_loop
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 145, in step
self.optimizer.step(closure)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 373, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/rmsprop.py", line 115, in step
self._init_group(group, params_with_grad, grads, square_avgs, momentum_buffer_list, grad_avgs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/rmsprop.py", line 72, in _init_group
state["square_avg"] = torch.zeros_like(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacty of 39.56 GiB of which 64.81 MiB is free. Process 468145 has 39.49 GiB memory in use. Of the allocated memory 37.54 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
0% 0/60 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 986, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'spin/run_spin.py', 'configs/config.yaml']' returned non-zero exit status 1.

Thesis discussion: Why can the end-to-end algorithm work properly?

In the paper, it seems that you combined the two steps of reinforcement learning into one step, forming an end-to-end training method. The specific algorithm is shown in the figure.

However, I have a problem with this end-to-end method. Let's say we've got the Opponent Player model for round t, and now we're going to learn the Opponent Player model for round t+1 through an end-to-end algorithm, but since $P_{\theta}$ and $P_{\theta_{t}}$ are one model, isn't the resultant loss 0? This means that we can't get $P_{\theta_{t+1}}$ that has any progress , is my understanding wrong?

Hello, I would like to ask, when you are training the model, do you only use the first round of dialogue from the ultrachat_200k?

def load_and_process_data_ultrachat(dataset_name, split): try: dataset = load_dataset(dataset_name, split=split) reformatted_data = [{ 'generated': [message['messages'][0], {"role": "assistant", "content": ""}], 'real': [message['messages'][0], message['messages'][1]] } for message in dataset] return reformatted_data except Exception as e: logging.error(f"Error loading or processing dataset: {e}") return []

Potential reason of the significant improvement on the TruthfulQA and GSM8k

Is anybody have a potential explaining of this phenomenon?

The improvement is particularly significant on the TruthfulQA and GSM8k benchmarks, with improvement exceeding 5% and 10% respectively.

Unable to reproduce performance

base model: alignment-handbook/zephyr-7b-sft-full
train data: UCLA-AGI/SPIN_iter0
I use the default hyper-parameter to train the model, and test the model with HuggingFaceH4/open_llm_leaderboard locally. The result on allenai/ai2_arc as below:
base model: 0.5819112627986348
epoch1 (step 778): 0.5989761092150171
epoch2 (step 1556): 0.5964163822525598
epoch3 (step 2334): 0.590443686006826
which can not match the performance in paper (63.40).

vllm version

Hey @uclaml and @yihedeng9,

I noticed that your setup tools depend on vllm. Given that vllm is still under rapid development, and considering your implementation incorporates the tensor_parallel_size parameter, it might be prudent to specify a particular version of vllm. Additionally, including ray in your setup tools could be beneficial, as vllm's multi-GPU inference relies on it. 🤔

Thanks for your attention!

vllm generation issue.

i believe that the argument used in generate_vllm.sh is supposed to be batch_size rather than world_size. When I make that change the script starts working immediately.

What changes should I make to apply the SPIN method on Llama2?

I want to apply SPIN method on llama2 with alpaca-like finetuning datasets. What changes should I make to apply the SPIN method?
Thanks a lot!

Generate Result

Hi， I reproduce this project, the test result seems like paper, but generate result looks bad

just like:
{'real': [{'role': 'user', 'content': 'Describe the neural pathways that connect the hippocampus to the prefrontal cortex and their role in memory consolidation and retrieval.'}, {'role': 'assistant', 'content': 'The hippocampus is a structure in the brain that plays a crucial role in consolidating and retrieving memories. It sends information to several areas of the brain, but one of the most important connections is with the prefrontal cortex.\n\nThe prefrontal cortex is responsible for a range of functions, including decision-making, planning, and working memory. When we recall a memory, the prefrontal cortex is activated and communicates with the hippocampus to retrieve the information needed.\n\nThe neural pathway connecting the hippocampus and the prefrontal cortex is a complex network of fibers that travel through several brain regions. One of the primary pathways is the hippocampal-prefrontal pathway, which consists of several sub-regions.\n\nThe first sub-region is the entorhinal cortex, which receives input from the hippocampus and is responsible for relaying sensory information to the prefrontal cortex. The second sub-region is the perirhinal cortex, which plays a role in object recognition and is also involved in memory consolidation.\n\nThe third and final sub-region is the medial prefrontal cortex, which is crucial for memory retrieval and decision-making. This region receives input from both the perirhinal cortex and the entorhinal cortex and is responsible for retrieving memories from the hippocampus.\n\nIn summary, the hippocampus and prefrontal cortex are connected by a complex neural pathway, which is involved in memory consolidation and retrieval. This pathway consists of several sub-regions, each of which plays a unique role in the process. Activation of this pathway is essential for remembering past experiences and making decisions based on them.'}], 'generated': [{'role': 'user', 'content': 'Describe the neural pathways that connect the hippocampus to the prefrontal cortex and their role in memory consolidation and retrieval.'}, {'role': 'assistant', 'content': '// // // // // // Content // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // Educ // // // // // // // // // // // // // // // // // // // // // //abel // // // // // // // // // // // // // // // // // // // // //ate // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // //ates // // // // // //ational // // // // //ensus // // // // // // // // // // // // // // // /

Have you tried combination of SPIN, SFT, DPO

is quite interesting of this awesome work.

do you ever try that

train with SPIN only
train with SFT + SPIN + DPO
mixture the SPIN + DPO i.e union DPO pairs with SPIN pairs per iteration

also I'm wondering the difference between SFT and SPIN, high level imagination from the loss

SPIN not only wanna toward gt, but also wanna leave the mistakes made last checkpoint.
is it kind of regularization or a kind of more aggressive learning strategy?

very impressive work! if you can share your precious idea with us will be a huge benefit!

uclaml / spin Goto Github PK

spin's Introduction

Self-Play Fine-Tuning (SPIN)

🔔 News

Table of Contents

🌀 About SPIN

Setup

Data

Model

Usage

Step 0 (optional): Reformatting SFT dataset

Step 1: Generation

🚀 Faster generation with vLLM

Step 1.5: Gather generations and convert data type

Step 2: Fine-tuning

Reproducing Our Results

Evaluation

Star History

Citation

Acknowledgement

spin's People

Contributors

Stargazers

Watchers

Forkers

spin's Issues

Model arguments

Data training arguments

Trainer arguments

hub_model_id: zephyr-7b-spin

Recommend Projects

Recommend Topics

Recommend Org