This guide will show you how to fine-tune any LLM quickly using Modal and axolotl
.
This repository gives the popular axolotl
fine-tuning library a serverless twist. It uses Modal's serverless infrastructure to run your fine-tuning jobs in the cloud, so you can train your models without worrying about building images or idling expensive GPU VMs.
Any application written with Modal can be trivially scaled across many GPUs. This ensures that any fine-tuning job prototyped with this repository is ready for production.
This tutorial uses many of the recommended, state-of-the-art optimizations for efficient training that axolotl supports, including:
- Deepspeed ZeRO to utilize multiple GPUs more info during training, according to a strategy you configure.
- Parameter-efficient fine-tuning via LoRA adapters for faster convergence
- Flash attention for fast and memory-efficient attention during training (note: only works with certain hardware, like A100s)
- Gradient checkpointing to reduce VRAM footprint, fit larger batches and get higher training throughput.
This modal app does not expose all CLI arguments that axolotl does. You can specify all your desired options in the config file instead. However, we find that the interface we provide is sufficient for most users. Any important differences are noted in the documentation below.
Follow the steps to quickly train and test your fine-tuned model:
-
Create a Modal account and create a Modal token and HuggingFace secret for your workspace, if not already set up.
Setting up Modal
- Create a Modal account.
- Install
modal
in your current Python virtual environment (pip install modal
) - Set up a Modal token in your environment (
python3 -m modal setup
) - You need to have a secret named
huggingface
in your workspace. You can create a new secret with the HuggingFace template in your Modal dashboard, using the same key from HuggingFace (in settings under API tokens) to populate bothHUGGING_FACE_HUB_TOKEN
andHUGGINGFACE_TOKEN
. - For some LLaMA models, you need to go to the Hugging Face page and agree to their Terms and Conditions for access (granted instantly).
-
Clone this repository and navigate to the finetuning directory:
git clone https://github.com/modal-labs/llm-finetuning.git cd llm-finetuning
-
Launch a training job:
modal run --detach src.train --config=config/mistral.yml --data=data/sqlqa.jsonl
Some notes about the train
command:
- The
--data
flag is used to pass your dataset to axolotl. This dataset is then written to thedatasets.path
as specified in your config file. If you already have a dataset atdatasets.path
, you must be careful to also pass the same path to--data
to ensure the dataset is correctly loaded. - Unlike axolotl, you cannot pass additional flags to the
train
command. However, you can specify all your desired options in the config file instead. --no-merge-lora
will prevent the LoRA adapter weights from being merged into the base model weights.- This example training script is opinionated in order to make it easy to get started. For example, a LoRA adapter is used and merged into the base model after training.
- Try the model from a completed training run. You can select a folder via
modal volume ls example-runs-vol
, and then specify the training folder with the--run_name
flag (something like/runs/axo-2023-11-24-17-26-66e8
) for inference:
modal run -q src.inference --run-name <run_tag>
Our quickstart example trains a 7B model on a text-to-SQL dataset as a proof of concept (it takes just a few minutes). It uses DeepSpeed ZeRO stage 1 to use data parallelism across 2 H100s. Inference on the fine-tuned model displays conformity to the output structure ([SQL] ... [/SQL]
). To achieve better results, you would need to use more data! Refer to the full development section below.
Tip
You modify the deepspeed
stage by changing the configuration path. Modal mounts the deepspeed_configs
folder from the axolotl
repository. You reference these configurations in your config.yml
like so: deepspeed: /root/axolotl/deepspeed_configs/zero3_bf16.json
.
- Finding your weights
As mentioned earlier, our Modal axolotl trainer automatically merges your LoRA adapter weights into the base model weights. You can browse the artifacts created by your training run with the following command, which is also printed out at the end of your training run in the logs.
modal volume ls example-runs-vol <run id>
# example: modal volume ls example-runs-vol axo-2024-04-13-19-13-05-0fb0
By default, the directory structure will look like this:
$ modal volume ls example-runs-vol axo-2024-04-13-19-13-05-0fb0/
Directory listing of 'axo-2024-04-13-19-13-05-0fb0/' in 'example-runs-vol'
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ filename ┃ type ┃ created/modified ┃ size ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ axo-2024-04-13-19-13-05-0fb0/last_run_prepared │ dir │ 2024-04-13 12:13:39-07:00 │ 32 B │
│ axo-2024-04-13-19-13-05-0fb0/mlruns │ dir │ 2024-04-13 12:14:19-07:00 │ 7 B │
│ axo-2024-04-13-19-13-05-0fb0/lora-out │ dir │ 2024-04-13 12:20:55-07:00 │ 178 B │
│ axo-2024-04-13-19-13-05-0fb0/logs.txt │ file │ 2024-04-13 12:19:52-07:00 │ 133 B │
│ axo-2024-04-13-19-13-05-0fb0/data.jsonl │ file │ 2024-04-13 12:13:05-07:00 │ 1.3 MiB │
│ axo-2024-04-13-19-13-05-0fb0/config.yml │ file │ 2024-04-13 12:13:05-07:00 │ 1.7 KiB │
└────────────────────────────────────────────────┴──────┴───────────────────────────┴─────────┘
The LoRA adapters are stored in lora-out
. The merged weights are stored in lora-out/merged
. Many inference frameworks can only load the merged weights, so it is handy to know where they are stored.
All the logic lies in train.py
. These are the important functions:
launch
prepares a new folder in the/runs
volume with the training config and data for a new training job. It also ensures the base model is downloaded from HuggingFace.train
takes a prepared folder and performs the training job using the config and data.Inference.completion
can spawn a vLLM inference container for any pre-trained or fine-tuned model from a previous training job.
You can view some example configurations in config
for a quick start with different models. See an overview of Axolotl's config options here. The most important options to consider are:
Model
base_model: mistralai/Mistral-7B-v0.1
Dataset (You can see all dataset options here)
datasets:
# This will be the path used for the data when it is saved to the Volume in the cloud.
- path: data.jsonl
ds_type: json
type:
# JSONL file contains question, context, answer fields per line.
# This gets mapped to instruction, input, output axolotl tags.
field_instruction: question
field_input: context
field_output: answer
# Format is used by axolotl to generate the prompt.
format: |-
[INST] Using the schema context below, generate a SQL query that answers the question.
{input}
{instruction} [/INST]
LoRA
adapter: lora # for qlora, or leave blank for full finetune (requires much more GPU memory!)
lora_r: 16
lora_alpha: 32 # alpha = 2 x rank is a good rule of thumb.
lora_dropout: 0.05
lora_target_linear: true # target all linear layers
Axolotl supports many dataset formats (see more). We recommend adding your custom dataset as a .jsonl file in the data
folder and making the appropriate modifications to your config.
Multi-GPU training
We recommend DeepSpeed for multi-GPU training, which is easy to set up. Axolotl provides several default deepspeed JSON configurations and Modal makes it easy to attach multiple GPUs of any type in code, so all you need to do is specify which of these configs you'd like to use.
In your config.yml
:
deepspeed: /root/axolotl/deepspeed_configs/zero3_bf16.json
In train.py
:
N_GPUS = 2
GPU_MEM = 40
GPU_CONFIG = modal.gpu.A100(count=N_GPUS, memory=GPU_MEM) # you can also change this to use A10Gs or T4s
Logging with Weights and Biases
To track your training runs with Weights and Biases:
- Create a Weights and Biases secret in your Modal dashboard, if not set up already (only the
WANDB_API_KEY
is needed, which you can get if you log into your Weights and Biases account and go to the Authorize page) - Add the Weights and Biases secret to your app, so initializing your stub in
common.py
should look like:
stub = Stub(APP_NAME, secrets=[Secret.from_name("huggingface"), Secret.from_name("my-wandb-secret")])
- Add your wandb config to your
config.yml
:
wandb_project: code-7b-sql-output
wandb_watch: gradients
Training
A simple training job can be started with
modal run --detach src.train --config=... --data=...
--detach
lets the app continue running even if your client disconnects.
The script reads two local files containing the config information and the dataset. The contents are passed as arguments to the remote launch
function, which writes them to the /runs
volume. Finally, train
reads the config and data from the volume for reproducible training runs.
When you make local changes to either your config or data, they will be used for your next training run.
Inference
To try a model from a completed run, you can select a folder via modal volume ls examples-runs-vol
, and then specify the training folder for inference:
modal run -q src.inference::inference_main --run-folder=...
The training script writes the most recent run name to a local file, .last_run_name
, for easy access.
CUDA Out of Memory (OOM)
This means your GPU(s) ran out of memory during training. To resolve, either increase your GPU count/memory capacity with multi-GPU training, or try reducing any of the following in your config.yml
: micro_batch_size, eval_batch_size, gradient_accumulation_steps, sequence_len
self.state.epoch = epoch + (step + 1 + steps_skipped) / steps_in_epoch ZeroDivisionError: division by zero
This means your training dataset might be too small.
Missing config option when using
modal run
in the CLI
Make sure your modal
client >= 0.55.4164 (upgrade to the latest version using pip install --upgrade modal
)
AttributeError: 'Accelerator' object has no attribute 'deepspeed_config'
Try removing the wandb_log_model
option from your config. See #4143.