Giter VIP home page Giter VIP logo

fms-hf-tuning's Introduction

FMS HF Tuning

This repo provides basic tuning scripts with support for specific models. The repo relies on Hugging Face SFTTrainer and PyTorch FSDP. Our approach to tuning is:

  1. Models are loaded from Hugging Face transformers or the foundation-model-stack -- models are either optimized to use Flash Attention v2 directly or through SDPA
  2. Hugging Face SFTTrainer for the training loop
  3. FSDP as the backend for training

Installation

pip install -e .

Note: After installing, if you wish to use FlashAttention, then you need to install these requirements:

pip install -e ".[dev]"
pip install -e ".[flash-attn]"

FlashAttention requires the CUDA Toolit to be pre-installed.

If you wish to use aim, then you need to install it:

pip install -e ".[aim]"

Data format

The data format expectation is a single column text. The trainer is configured to expect a response template as a string. For example, if one wants to prepare the alpaca format data to feed into this trainer, it is quite easy and can be done with the following code.

PROMPT_DICT = {
    "prompt_input": (
        "Below is an instruction that describes a task, paired with an input that provides further context. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
    ),
    "prompt_no_input": (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n{instruction}\n\n### Response:"
    ),
}

def format_alpaca_fn(example):
    prompt_input, prompt_no_input = PROMPT_DICT['prompt_input'], PROMPT_DICT['prompt_no_input']
    output = prompt_input.format_map(example) if example.get("input", "") != "" else prompt_no_input.format_map(example)
    output = f"{output} {example['output']}"
    return {"output": output}

ds = datasets.load_dataset('json', data_files='./stanford_alpaca/alpaca_data.json')

alpaca_ds = ds['train'].map(format_alpaca_fn, remove_columns=['instruction', 'input'])
alpaca_ds.to_json("sft_alpaca_data.json")

The response template corresponding to the above dataset and the Llama tokenizer is: \n### Response:".

The same way can be applied to any dataset, with more info can be found here.

Supported Models

Current supported and tested models are Llama2 (7 and 13B configurations have been tested) and GPTBigCode.

Training

Single GPU

# if you want to use one GPU on multi-gpu machine
export CUDA_VISIBLE_DEVICES=0

# MODEL_PATH=meta-llama/Llama-2-7b-hf # Huggingface model id or path to a checkpoint
# TRAIN_DATA_PATH=twitter_complaints.json # Path to the dataset
# OUTPUT_PATH=out # Path to the output folder where the checkpoints are saved

python tuning/sft_trainer.py  \
--model_name_or_path $MODEL_PATH  \
--training_data_path $TRAIN_DATA_PATH  \
--output_dir $OUTPUT_PATH  \
--num_train_epochs 5  \
--per_device_train_batch_size 4  \
--per_device_eval_batch_size 4  \
--gradient_accumulation_steps 4  \
--evaluation_strategy "no"  \
--save_strategy "epoch"  \
--learning_rate 1e-5  \
--weight_decay 0.  \
--warmup_ratio 0.03  \
--lr_scheduler_type "cosine"  \
--logging_steps 1  \
--include_tokens_per_second  \
--packing False  \
--response_template "\n### Response:"  \
--dataset_text_field "output" 

Multiple GPUs with FSDP

The recommendation is to use huggingface accelerate to launch multi-gpu jobs, in particular when using FSDP:

accelerate launch CLI to be run with specific command line arguments, see example below. Default arguments handled by passing in a --config_file argument; see reference docs and fixtures/accelerate_fsdp_defaults.yaml for sample defaults.

# Please set the environment variables:
# MASTER_PORT=1234 # The port at which the process with rank 0 listens to and should be set to an unused port
# MODEL_PATH=meta-llama/Llama-2-7b-hf # Huggingface model id or path to a checkpoint
# TRAIN_DATA_PATH=twitter_complaints.json # Path to the training dataset
# OUTPUT_PATH=out # Path to the output folder where the checkpoints are saved

accelerate launch \
--main_process_port $MASTER_PORT \
--config_file fixtures/accelerate_fsdp_defaults.yaml \
--num_processes=8 \ 
--main_process_port=$MASTER_PORT \
tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--training_data_path $TRAIN_DATA_PATH \
--torch_dtype bfloat16 \
--output_dir $OUTPUT_PATH \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "epoch" \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--include_tokens_per_second \
--packing False \
--response_template "\n### Response:" \
--dataset_text_field "output"

To summarize you can pick either python for singleGPU jobs or use accelerate launch for multiGPU jobs. The following tuning techniques can be applied:

Tuning Techniques :

LoRA Tuning Example

Set peft_method = "lora". You can additionally pass any arguments from LoraConfig.

# Args you can pass
r: int =8 
lora_alpha: int = 32
target_modules: List[str] = field(
  default_factory=lambda: ["q_proj", "v_proj"],
      metadata={
            "help": "The names of the modules to apply LORA to. LORA selects modules which either \
            completely match or "
            'end with one of the strings. If the value is ["all-linear"], \
            then LORA selects all linear and Conv1D '
            "modules except for the output layer."
        },
    )
  bias = "none"
  lora_dropout: float = 0.05

Example command to run:

python tuning/sft_trainer.py \
--model_name_or_path $MODEL_PATH \
--training_data_path $TRAIN_DATA_PATH \
--output_dir $OUTPUT_PATH \
--num_train_epochs 40 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--save_strategy "epoch" \
--learning_rate 1e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--include_tokens_per_second \
--packing False \
--response_template "\n### Label:" \
--dataset_text_field "output" \
--use_flash_attn False \
--tokenizer_name_or_path $MODEL_PATH \
--torch_dtype float32 \
--peft_method "lora" \
--logging_strategy "epoch" \
--r 8 \
--lora_dropout 0.05 \
--lora_alpha 16

Notice the target_modules that are set are the default values. target_modules are the names of the modules to apply the adapter to. If this is specified, only the modules with the specified names will be replaced. When passing a list of strings, either an exact match will be performed or it is checked if the name of the module ends with any of the passed strings. If this is specified as all-linear, then all linear/Conv1D modules are chosen, excluding the output layer. If this is not specified, modules will be chosen according to the model architecture. If the architecture is not known, an error will be raised โ€” in this case, you should specify the target modules manually. See HuggingFace docs for more details.

For each model, the target_modules will depend on the type of model architecture. You can specify linear or attention layers to target_modules. To obtain list of target_modules for a model:

from transformers import AutoModelForCausalLM
# load the model
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH)
# see the module list
model.modules

# to get just linear layers
import re
model_modules = str(model.modules)
pattern = r'\((\w+)\): Linear'
linear_layer_names = re.findall(pattern, model_modules)

names = []
for name in linear_layer_names:
    names.append(name)
target_modules = list(set(names))

For example for LLaMA model the modules look like:

<bound method Module.modules of LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)>

You can specify attention or linear layers. With the CLI, you can specify layers with --target_modules "q_proj" "v_proj" "k_proj" "o_proj" or --target_modules "all-linear".

Prompt Tuning :

Specify peft_method to 'pt' . You can additionally pass any arguments from PromptTuningConfig.

    # prompt_tuning_init can be either "TEXT" or "RANDOM"
    prompt_tuning_init: str = "TEXT"
    num_virtual_tokens: int = 8
    # prompt_tuning_init_text only applicable if prompt_tuning_init= "TEXT"
    prompt_tuning_init_text: str = "Classify if the tweet is a complaint or not:"
    tokenizer_name_or_path: str = "llama-7b-hf"

Example command you can run:

accelerate launch \
--main_process_port $MASTER_PORT \
--config_file fixtures/accelerate_fsdp_defaults.yaml \
tuning/sft_trainer.py  \
--model_name_or_path $MODEL_PATH  \
--training_data_path $TRAIN_DATA_PATH  \
--output_dir $OUTPUT_PATH  \
--peft_method pt \
--torch_dtype bfloat16 \
--tokenizer_name_or_path $MODEL_PATH  \
--num_train_epochs 5  \
--per_device_train_batch_size 1  \
--per_device_eval_batch_size 1  \
--gradient_accumulation_steps 1  \
--evaluation_strategy "no"  \
--save_strategy "epoch"  \
--learning_rate 1e-5  \
--weight_decay 0.  \
--warmup_ratio 0.03  \
--lr_scheduler_type "cosine"  \
--logging_steps 1  \
--include_tokens_per_second  \
--packing False  \
--response_template "\n### Label:"  \
--dataset_text_field "output" 

Fine Tuning :

Set peft_method = 'None'

Full fine tuning needs more compute resources, so it is advised to use the MultiGPU method

accelerate launch \
--main_process_port $MASTER_PORT \
--config_file fixtures/accelerate_fsdp_defaults.yaml \
tuning/sft_trainer.py  \
--model_name_or_path $MODEL_PATH  \
--training_data_path $TRAIN_DATA_PATH  \
--output_dir $OUTPUT_PATH  \
--peft_method "None" \
--torch_dtype bfloat16 \
--tokenizer_name_or_path $MODEL_PATH  \
--num_train_epochs 5  \
--per_device_train_batch_size 1  \
--per_device_eval_batch_size 1  \
--gradient_accumulation_steps 1  \
--evaluation_strategy "no"  \
--save_strategy "epoch"  \
--learning_rate 1e-5  \
--weight_decay 0.  \
--warmup_ratio 0.03  \
--lr_scheduler_type "cosine"  \
--logging_steps 1  \
--include_tokens_per_second  \
--packing False  \
--response_template "\n### Label:"  \
--dataset_text_field "output" 

Inference

Currently, we do not offer inference support as part of the library, but we provide a standalone script for running inference on tuned models for testing purposes. For a full list of options run python scripts/run_inference.py --help. Note that no data formatting / templating is applied at inference time.

Running a single example

If you want to run a single example through a model, you can pass it with the --text flag.

python scripts/run_inference.py \
--model my_checkpoint \
--text "This is a text the model will run inference on" \
--max_new_tokens 50 \
--out_file result.json

Running multiple examples

To run multiple examples, pass a path to a file containing each source text as its own line. Example:

Contents of source_texts.txt

This is the first text to be processed.
And this is the second text to be processed.
python scripts/run_inference.py \
--model my_checkpoint \
--text_file source_texts.txt \
--max_new_tokens 50 \
--out_file result.json

Inference Results Format

After running the inference script, the specified --out_file will be a JSON file, where each text has the original input string and the predicted output string, as follows. Note that due to the implementation of .generate() in Transformers, in general, the input string will be contained in the output string as well.

[
    {
        "input": "{{Your input string goes here}}",
        "output": "{{Generate result of processing your input string goes here}}"
    },
    ...
]

Changing the Base Model for Inference

If you tuned a model using a local base model, then a machine-specific path will be saved into your checkpoint by Peft, specifically the adapter_config.json. This can be problematic if you are running inference on a different machine than you used for tuning.

As a workaround, the CLI for inference provides an arg for --base_model_name_or_path, where a new base model may be passed to run inference with. This will patch the base_model_name_or_path in your checkpoint's adapter_config.json while loading the model, and restore it to its original value after completion. Alternatively, if you like, you can change the config's value yourself.

NOTE: This can also be an issue for tokenizers (with the tokenizer_name_or_path config entry). We currently do not allow tokenizer patching since the tokenizer can also be explicitly configured within the base model and checkpoint model, but may choose to expose an override for the tokenizer_name_or_path in the future.

Validation

We can use lm-evaluation-harness from EleutherAI for evaluating the generated model. For example, for the Llama-13B model, using the above command and the model at the end of Epoch 5, we evaluated MMLU score to be 53.9 compared to base model to be 52.8.

How to run the validation:

pip install -U transformers
pip install -U datasets
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
python main.py \ 
--model hf-causal \
--model_args pretrained=$MODEL_PATH \ 
--output_path $OUTPUT_PATH/results.json \ 
--tasks boolq,piqa,hellaswag,winogrande,arc_easy,arc_challenge,hendrycksTest-*

The above runs several tasks with hendrycksTest-* being MMLU.

More Examples

Prompt Tuning on Twitter Complaints

fms-hf-tuning's People

Contributors

anhuong avatar dchourasia avatar lchu-ibm avatar jbusche avatar alex-jw-brooks avatar tharapalanivel avatar dushyantbehl avatar tedhtchang avatar ssukriti avatar hickeyma avatar raghukiran1224 avatar kmehant avatar seshapad avatar harikrishnanbalagopal avatar kellyaa avatar dependabot[bot] avatar gkumbhat avatar riprasad avatar fabianlim avatar sutaakar avatar mohammadiiram avatar vassilisvassiliadis avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.