modal-labs / llm-finetuning Goto Github PK

View Code? Open in Web Editor NEW

367.0 6.0 55.0 4.24 MB

Guide for fine-tuning Llama/Mistral/CodeLlama models and more

License: MIT License

Python 100.00%

llm-finetuning's Introduction

Fine-tune any LLM in minutes (featuring Mixtral, LLaMA, Mistral)

This guide will show you how to fine-tune any LLM quickly using Modal and axolotl.

Serverless Axolotl

This repository gives the popular axolotl fine-tuning library a serverless twist. It uses Modal's serverless infrastructure to run your fine-tuning jobs in the cloud, so you can train your models without worrying about building images or idling expensive GPU VMs.

Any application written with Modal can be trivially scaled across many GPUs. This ensures that any fine-tuning job prototyped with this repository is ready for production.

Designed For Efficiency

This tutorial uses many of the recommended, state-of-the-art optimizations for efficient training that axolotl supports, including:

Deepspeed ZeRO to utilize multiple GPUs more info during training, according to a strategy you configure.
Parameter-efficient fine-tuning via LoRA adapters for faster convergence
Flash attention for fast and memory-efficient attention during training (note: only works with certain hardware, like A100s)
Gradient checkpointing to reduce VRAM footprint, fit larger batches and get higher training throughput.

Differences From Axolotl

This modal app does not expose all CLI arguments that axolotl does. You can specify all your desired options in the config file instead. However, we find that the interface we provide is sufficient for most users. Any important differences are noted in the documentation below.

Quickstart

Follow the steps to quickly train and test your fine-tuned model:

Create a Modal account and create a Modal token and HuggingFace secret for your workspace, if not already set up.
Setting up Modal
1. Create a Modal account.
2. Install modal in your current Python virtual environment (pip install modal)
3. Set up a Modal token in your environment (python3 -m modal setup)
4. You need to have a secret named huggingface in your workspace. You can create a new secret with the HuggingFace template in your Modal dashboard, using the same key from HuggingFace (in settings under API tokens) to populate both HUGGING_FACE_HUB_TOKEN and HUGGINGFACE_TOKEN.
5. For some LLaMA models, you need to go to the Hugging Face page and agree to their Terms and Conditions for access (granted instantly).

Clone this repository and navigate to the finetuning directory:

git clone https://github.com/modal-labs/llm-finetuning.git
cd llm-finetuning

Launch a training job:

modal run --detach src.train --config=config/mistral.yml --data=data/sqlqa.jsonl

Some notes about the train command:

The --data flag is used to pass your dataset to axolotl. This dataset is then written to the datasets.path as specified in your config file. If you already have a dataset at datasets.path, you must be careful to also pass the same path to --data to ensure the dataset is correctly loaded.
Unlike axolotl, you cannot pass additional flags to the train command. However, you can specify all your desired options in the config file instead.
--no-merge-lora will prevent the LoRA adapter weights from being merged into the base model weights.
This example training script is opinionated in order to make it easy to get started. For example, a LoRA adapter is used and merged into the base model after training.

Try the model from a completed training run. You can select a folder via modal volume ls example-runs-vol, and then specify the training folder with the --run_name flag (something like /runs/axo-2023-11-24-17-26-66e8) for inference:

modal run -q src.inference --run-name <run_tag>

Our quickstart example trains a 7B model on a text-to-SQL dataset as a proof of concept (it takes just a few minutes). It uses DeepSpeed ZeRO stage 1 to use data parallelism across 2 H100s. Inference on the fine-tuned model displays conformity to the output structure ([SQL] ... [/SQL]). To achieve better results, you would need to use more data! Refer to the full development section below.

Tip

You modify the deepspeed stage by changing the configuration path. Modal mounts the deepspeed_configs folder from the axolotl repository. You reference these configurations in your config.yml like so: deepspeed: /root/axolotl/deepspeed_configs/zero3_bf16.json.

Finding your weights

As mentioned earlier, our Modal axolotl trainer automatically merges your LoRA adapter weights into the base model weights. You can browse the artifacts created by your training run with the following command, which is also printed out at the end of your training run in the logs.

modal volume ls example-runs-vol <run id>
# example: modal volume ls example-runs-vol axo-2024-04-13-19-13-05-0fb0

By default, the directory structure will look like this:

$ modal volume ls example-runs-vol axo-2024-04-13-19-13-05-0fb0/

Directory listing of 'axo-2024-04-13-19-13-05-0fb0/' in 'example-runs-vol'
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ filename                                       ┃ type ┃ created/modified          ┃ size    ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ axo-2024-04-13-19-13-05-0fb0/last_run_prepared │ dir  │ 2024-04-13 12:13:39-07:00 │ 32 B    │
│ axo-2024-04-13-19-13-05-0fb0/mlruns            │ dir  │ 2024-04-13 12:14:19-07:00 │ 7 B     │
│ axo-2024-04-13-19-13-05-0fb0/lora-out          │ dir  │ 2024-04-13 12:20:55-07:00 │ 178 B   │
│ axo-2024-04-13-19-13-05-0fb0/logs.txt          │ file │ 2024-04-13 12:19:52-07:00 │ 133 B   │
│ axo-2024-04-13-19-13-05-0fb0/data.jsonl        │ file │ 2024-04-13 12:13:05-07:00 │ 1.3 MiB │
│ axo-2024-04-13-19-13-05-0fb0/config.yml        │ file │ 2024-04-13 12:13:05-07:00 │ 1.7 KiB │
└────────────────────────────────────────────────┴──────┴───────────────────────────┴─────────┘

The LoRA adapters are stored in lora-out. The merged weights are stored in lora-out/merged . Many inference frameworks can only load the merged weights, so it is handy to know where they are stored.

Development

Code overview

All the logic lies in train.py. These are the important functions:

launch prepares a new folder in the /runs volume with the training config and data for a new training job. It also ensures the base model is downloaded from HuggingFace.
train takes a prepared folder and performs the training job using the config and data.
Inference.completion can spawn a vLLM inference container for any pre-trained or fine-tuned model from a previous training job.

Config

You can view some example configurations in config for a quick start with different models. See an overview of Axolotl's config options here. The most important options to consider are:

Model

base_model: mistralai/Mistral-7B-v0.1

Dataset (You can see all dataset options here)

datasets:
  # This will be the path used for the data when it is saved to the Volume in the cloud.
  - path: data.jsonl
    ds_type: json
    type:
      # JSONL file contains question, context, answer fields per line.
      # This gets mapped to instruction, input, output axolotl tags.
      field_instruction: question
      field_input: context
      field_output: answer
      # Format is used by axolotl to generate the prompt.
      format: |-
        [INST] Using the schema context below, generate a SQL query that answers the question.
        {input}
        {instruction} [/INST]

LoRA

adapter: lora  # for qlora, or leave blank for full finetune (requires much more GPU memory!)
lora_r: 16
lora_alpha: 32  # alpha = 2 x rank is a good rule of thumb.
lora_dropout: 0.05
lora_target_linear: true  # target all linear layers

Custom Dataset

Axolotl supports many dataset formats (see more). We recommend adding your custom dataset as a .jsonl file in the data folder and making the appropriate modifications to your config.

Multi-GPU training

We recommend DeepSpeed for multi-GPU training, which is easy to set up. Axolotl provides several default deepspeed JSON configurations and Modal makes it easy to attach multiple GPUs of any type in code, so all you need to do is specify which of these configs you'd like to use.

In your config.yml:

deepspeed: /root/axolotl/deepspeed_configs/zero3_bf16.json

In train.py:

N_GPUS = 2
GPU_MEM = 40
GPU_CONFIG = modal.gpu.A100(count=N_GPUS, memory=GPU_MEM)  # you can also change this to use A10Gs or T4s

Logging with Weights and Biases

To track your training runs with Weights and Biases:

Create a Weights and Biases secret in your Modal dashboard, if not set up already (only the WANDB_API_KEY is needed, which you can get if you log into your Weights and Biases account and go to the Authorize page)
Add the Weights and Biases secret to your app, so initializing your stub in common.py should look like:

stub = Stub(APP_NAME, secrets=[Secret.from_name("huggingface"), Secret.from_name("my-wandb-secret")])

Add your wandb config to your config.yml:

wandb_project: code-7b-sql-output
wandb_watch: gradients

Using the CLI

Training

A simple training job can be started with

modal run --detach src.train --config=... --data=...

--detach lets the app continue running even if your client disconnects.

The script reads two local files containing the config information and the dataset. The contents are passed as arguments to the remote launch function, which writes them to the /runs volume. Finally, train reads the config and data from the volume for reproducible training runs.

When you make local changes to either your config or data, they will be used for your next training run.

Inference

To try a model from a completed run, you can select a folder via modal volume ls examples-runs-vol, and then specify the training folder for inference:

modal run -q src.inference::inference_main --run-folder=...

The training script writes the most recent run name to a local file, .last_run_name, for easy access.

Common Errors

CUDA Out of Memory (OOM)

This means your GPU(s) ran out of memory during training. To resolve, either increase your GPU count/memory capacity with multi-GPU training, or try reducing any of the following in your config.yml: micro_batch_size, eval_batch_size, gradient_accumulation_steps, sequence_len

self.state.epoch = epoch + (step + 1 + steps_skipped) / steps_in_epoch ZeroDivisionError: division by zero

This means your training dataset might be too small.

Missing config option when using modal run in the CLI

Make sure your modal client >= 0.55.4164 (upgrade to the latest version using pip install --upgrade modal)

AttributeError: 'Accelerator' object has no attribute 'deepspeed_config'

Try removing the wandb_log_model option from your config. See #4143.

llm-finetuning's People

Contributors

Stargazers

Watchers

llm-finetuning's Issues

TODO: hamel to explore GUIs

I need to dig more into these from the README. I couldn't get it to work, likely am doing something wrong.

Using the GUI

Deploy the training backend with three business functions (launch, train, completion in __init__.py). Then run the Gradio GUI.

modal deploy src
modal run src.gui --config=... --data=...

The *.modal.host link from the latter will take you to the Gradio GUI. There will be three tabs: launch training runs, test out trained models and explore the files on the volume.

Weights & Bias integration

Hi, I'd like to integrate weights & biases into my training code. I'm a bit stuck on how to do that.

I've started w/ trying to set up wandb and log some default values, but after trying that it's producing and issue

Here's the start I have so far:
5d1b6e2

Currently that is giving me:

│ /root/train.py:45 in train │
│ │
│ ❱ 45 from torch.distributed.run import elastic_launch, parse_args, config_from_args │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'torch'

Any pointers would be super helpful!

how to download the finetuned model of the moda networkfilesystem to local?

Quickstart errors following README with sample data

git clone https://github.com/modal-labs/llm-finetuning.git
cd llm-finetuning

modal run --detach src.train

Note that running a local entrypoint in detached mode only keeps the last triggered Modal function alive after the parent process has been killed or disconnected.
✓ Initialized. View run at https://modal.com/...../MAOLn2nXinIjRIL64vxP
⠏ Creating objects...2023-12-22T15:40:46+0200 Mount of '/home/WorkSpace/modal/llm-finetuning/src' is empty.
✓ Created objects.
├── 🔨 Created mount /home/WorkSpace/modal/llm-finetuning/src
├── 🔨 Created train.
├── 🔨 Created merge.
├── 🔨 Created launch.
└── 🔨 Created Inference.completion.

==========
== CUDA ==

CUDA Version 11.8.0

.....
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .

Traceback (most recent call last):
Runner failed with exception: ModuleNotFoundError("No module named 'src'")
File "/pkg/modal/_container_entrypoint.py", line 336, in handle_user_exception
yield
File "/pkg/modal/_container_entrypoint.py", line 792, in main
imp_fun = import_function(
File "/pkg/modal/_container_entrypoint.py", line 659, in import_function
module = importlib.import_module(function_def.module_name)
File "/root/miniconda3/envs/py3.10/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 992, in _find_and_load_unlocked
File "", line 241, in _call_with_frames_removed
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'src'

[bug] inference.py mkdir

In inference.py def merge(), it uses os.mkdir() to create intermediate folders, which python doesnt like.

use os.makedirs() instead.

https://github.com/modal-labs/llama-finetuning/blob/e675b16b902cef931bbf6490f960b228357d1495/inference.py#L20C1-L21C1

error downloading Mistral 7B Instruct when running in inference

training code worked out of the box for Mistral7B but running in inference raises an error:

Loading mistralai/Mistral-7B-Instruct-v0.1 into GPU ... 
Error: DownloadError
Runner failed with exception: Runner has been initializing for too long: 300 seconds

Proposal: Get Rid of Gradio Interface

I spent some time playing with the Gradio interface for the axolotl trainer. Here are my impressions:

It doesn't add much in the way of ease of use. The dataset is pre-baked, and you still need to carefully fill out a config.yaml file.
We recommend the GUI in the README as the beginner option and the CLI as the advanced option. I would argue that the GUI is more confusing as you need to create a deployment backend as well as the Gradio front end. The only thing the GUI offers you is to press a button to train, but this is not good enough.
The data viewers add no value IMO as they are not formatting the data (and the form factor isn't going to work for many real-world datasets).
The gradio app has bugs in the inference tab, the model fails to load and config/data are not shown.

My suggestion is to get rid of the gradio app as it expands the surface area of this tutorial and application in ways that don't seem worth it. That way, we can have a more focused experience but also make this code more maintainable. I think deleting that part of the code will let modal really shine.

cc: @charlesfrye

Can I use train.py on CodeLlama-13b or CodeLlama-34b

Why add special tokens manually in sql_dataset.py

In datasets/sql_dataset.py you manually add a bunch of special tokens in the prompt instead of relying on the tokenizer to handle this. Is there a good reason for that?

Loop attempt failed and parallelism errors

Hi there, I just run:
modal run train.py --dataset sql_dataset.py --base chat7 --run-id chat7-sql

I have about 15 USD of credits left.

And I receive a pletora of red errors on the console:

To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

And training never finishes

Training Epoch: 10/10, step 0/9 completed (loss: 0.6004008054733276):  11%|█         | 1/9 [00:47<06:16, 47.11s/it]
Training Epoch: 10/10, step 0/9 completed (loss: 0.6142641305923462):  11%|█         | 1/9 [00:47<06:17, 47.15s/it]
Training Epoch: 10/10, step 0/9 completed (loss: 0.5926907062530518):  11%|█         | 1/9 [00:47<06:17, 47.15s/it]
Training Epoch: 10/10, step 0/9 completed (loss: 0.5915755033493042):  11%|█         | 1/9 [00:47<06:17, 47.13s/it]
Traceback (most recent call last):
  File "/pkg/modal/_container_entrypoint.py", line 366, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 484, in run_inputs
    res = imp_fun.fun(*args, **kwargs)
  File "/root/train.py", line 47, in train
    elastic_launch(
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run
    result = self._invoke_run(role)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 877, in _invoke_run
    time.sleep(monitor_interval)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 2 got signal: 2
Runner terminated, in-progress inputs will be re-scheduled

And then this is the last message:

grpclib.exceptions.GRPCError: (<Status.FAILED_PRECONDITION: 9>, 'App state is APP_STATE_STOPPED', None)

Any ideas what does that mean?

Does this guide/repo support phi-2 finetuning?

Hi there,

Just thought I'd ask, do you know if this guide supports phi-2 finetuning, https://huggingface.co/microsoft/phi-2?

Many thanks for any help!

Red lines for `download_and_unload_peft`

Hello, I'm wondering if I missed a dependency here. I cloned the repo and imported it into pycharm and got the following red lines.

volume.commit() dosen't work

I tried a simple example below. when I execute

modal run <fname>.py

I get the following error

ExecutionError: Object has not been hydrated and doesn't support lazy hydration. This might happen if
an object is defined on a different stub, or if it's on the same stub but it didn't get created
because it wasn't defined in global scope.

Following is the code

import pathlib
import modal

stub = modal.Stub()

volume = modal.Volume.persisted("my-volume")

p = pathlib.Path("/root/foo/bar.txt")


@stub.function(volumes={"/root/foo": volume})
def f():
    p.write_text("hello")
    print(f"Created {p=}")
    volume.commit()  # Persist changes
    print(f"Committed {p=}")


@stub.function(volumes={"/root/foo": volume})
def g(reload: bool = False):
    if reload:
        volume.reload()  # Fetch latest changes
    if p.exists():
        print(f"{p=} contains '{p.read_text()}'")
    else:
        print(f"{p=} does not exist!")


@stub.local_entrypoint()
def main():
    g.remote()  # 1. container for `g` starts
    f.remote()  # 2. container for `f` starts, commits file
    g.remote(reload=False)  # 3. reuses container for `g`, no reload
    g.remote(reload=True)

Audience For This Repo

Carrying over discussion with @mwaskom from this thread

I think this repo is pretty difficult to reason about if you aren't familiar with axolotl IMO. Like what are these configs? How does it work? How are my prompts assembled exactly? What does the dataset format need to be? Are there other dataset formats? How do I check the prompt construction? etc. I was actually assuming that the user is indeed familiar with axolotl.
If you are very familiar with axoltol, this --data flag was really confusing to me, because a key parameter in my config that I am used to using is being completely ignored with an extra layer of indirection. I actually got stuck on this personally as an experienced axolotl user, so I found the need to provide these two caveats.

cc: @charlesfrye @winglian curious what you think

Originally posted by @hamelsmu in #48 (comment)

getting TypeError while launching the training job

here is the error is get when i run the training following command

modal run --detach src.train --config=config/mistral.yml --data=data/sqlqa.jsonl

TypeError: _StatefulObject.from_name() got an unexpected keyword argument 
'create_if_missing'

the same happens when i try to run the

Do not hardcode LORA merging

Currently in the README

This example training script is opinionated in order to make it easy to get started. For example, a LoRA adapter is used and merged into the base model after training. This merging is currently hardcoded into the train.py script. You will need to modify this script if you do not wish to fine-tune using a LoRA adapter.

Can the code in this repository be used with my own local GPUs?

Thanks for creating this repository. Just a quick question: can the code in this repository be used with a user's own local GPUs? 😄

Help Needed: Fine-tuning codellama-7b-Instruct Model for pinescript Programming language

I'm having a problem with fine-tuning the codellama-7b-Instruct model for a programming language. The issue is that the model seems to focus too much on the new dataset , and its performance isn't great on new tasks. It's not just overfitting; sometimes, it doesn't do well on new tasks either.

For example:

Base model

User: Hey There! How are you
Model: I am good. How can I help ?

Finetuned model

User: Hey There! How are you
Model: Yes, I can fix your pinescript code. Provide me your issue?

I've tried increasing the number of training epochs to make sure it learns properly. I've also prepared my dataset carefully according to codellama's requirements. I used LORA and PEFT for finetuning. My dataset has 60,000 chat examples, each with 1000 tokens in the context. To make the model more robust, I overlapped the examples by 25%.

Here are my training settings:

Epochs: 15
Batch Size: 6
Gradient Accumulation Step: 2
Learning Rate: 4e-4
Warmup Ratio: 0.05

Lora r : 32
Lora alpha: 32
Lora dropout: 0.05
Target Modules: ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "lm_head"] (all linear layers are trainable)

Is there any way so I can merge some adapter layers and check its performance rather than merging all layers. I dont want to finetune again since it takes 10-15 days for the finetuning.

Any help or suggestion would highly be appreciated.

[issue] train.py

train.py prints out
https://github.com/modal-labs/llama-finetuning/blob/e675b16b902cef931bbf6490f960b228357d1495/train.py#L107

But there's no compare.py