Giter VIP home page Giter VIP logo

loftq's Introduction

LoftQ_logo LoftQ: LoRA-Fine-Tuning-Aware Quantization

LoftQ helps you fine-tune LLMs with limited GPUs. ๐Ÿš€ LoftQ finds good enough quantized LoRA initialization: quantized backbone Q and LoRA adapters A and B, given a pre-trained weight W.

This repo implements the paper ๐Ÿ”—: LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models.

Our models are available on ๐Ÿค— LoftQ Huggingface Hub

News

Quick Start

Requirements

We use bitsandbytes to implement the quantization. This package only support CUDA >= 11.0 and does not support CPU. However, we also provide fake quantization for fast and parallel training if GPUs are adequate.

pip install -r requirements.txt

Steps

  1. Apply LoftQ to a full-precision pre-trained weight and save.
  2. Load LoftQ initialization and train.

For step 1, we have provided off-the-shelf LoftQ initializations (see supported model list) in Huggingface Hub LoftQ. If you want to do it yourself, jump to LoftQ DIY.

For step 2, below is an example of loading 4bit Mistral-7B with 64rank LoRA adapters from Huggingface Hub.

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

# fetch the MODEL_ID at https://huggingface.co/LoftQ
MODEL_ID = "LoftQ/Mistral-7B-v0.1-4bit-64rank"

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, 
    torch_dtype=torch.bfloat16,  # you may change it with different models
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,  # bfloat16 is recommended
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type='nf4',
    ),
)
peft_model = PeftModel.from_pretrained(
    base_model,
    MODEL_ID,
    subfolder="loftq_init",
    is_trainable=True,
)

# Do training with peft_model ...

LoftQ DIY

Apply LoftQ and save

We provide quantize_save.py as an example to apply LoftQ with different bits(--bits), ranks(--rank), and alternating steps (--iter, a hyper-parameter in LoftQ, see Algorithm 1 in LoftQ paper). Currently, this example supports llama-2, falcon, mistral, bart, t5, deberta, bert, roberta.

Below is an example of obtaining 4bit LLAMA-2-7b with 16-rank LoRA adapters by 5 alternating steps.

SAVE_DIR="model_zoo/loftq/"
python quantize_save_load.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \  # high-precision model id in HF
    --token HF_TOKEN \  # your HF token if the model is private, e.g., llama-2
    --bits 4 \
    --iter 5 \
    --rank 16 \
    --save_dir $SAVE_DIR

The above commands end up with creating the model directory under $SAVE_DIR. Specifically, the model directory is named as

MODEL_DIR = SAVE_DIR + f"{args.model_name_or_path.split('/')[-1]}-{args.bits}bits-{args.rank}rank"

In this example, MODEL_DIR="model_zoo/loftq/Llama-2-7b-hf-4bit-16rank", where the backbone is stored in $MODEL_DIR and the LoRA adapters are at the sub-folder $MODEL_DIR/loftq_init.

Load and train

Similar to loading from Huggingface Hub, we only need to change the MODEL_ID to the MODEL_DIR.

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

MODEL_DIR = "model_zoo/loftq/Llama-2-7b-hf-4bit-16rank"

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_DIR, 
    torch_dtype=torch.bfloat16,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type='nf4',
    ),
)
peft_model = PeftModel.from_pretrained(
    base_model,
    MODEL_DIR,
    subfolder="loftq_init",
    is_trainable=True,
)
# Do training with peft_model ...

LoftQ Fine-tuning

We also provide an example to fine-tune LLAMA-7b with LoftQ on GSM8K.

python train_gsm8k.py \
    --model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank \
    --learning_rate 3e-4 \
    --seed 11 \
    --expt_name gsm8k_llama2_7b_4bit_64rank_loftq \
    --output_dir exp_results/ \
    --num_train_epochs 6 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --weight_decay 0.1 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 10 \
    --do_train \
    --report_to tensorboard

Other training Files

  • GLUE: glue/run_glue.py
  • Question Answering: glue/run_qa.py
  • Summarization: train_summarization.py
  • WikiText-2: train_clm.py
  • GSM8K: train_gsm8k.py

More example scripts are in scripts.

Quick Evaluation

Here is the command to test GSM8K with adapters we have fine-tuned. It is stored in the subfolder='gsm8k' of the target model in LoftQ Huggingface hub.

python test_gsm8k.py \
    --model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank \
    --batch_size 16
python test_gsm8k.py \
    --model_name_or_path LoftQ/phi-2-4bit-64rank \
    --batch_size 16

Feel free to change batch_size to accommodate to your machine.

Main Results

LLAMA-2 on WikiText-2 and GSM8K

Bit WikiText-2 WikiText-2 GSM8K GSM8K
LLAMA-2-7b LLAMA-2-13b LLAMA-2-7b LLAMA-2-13b
16 5.08 5.12 36.9 43.1
4 5.24 5.16 35.0 45.0
3 5.63 5.13 32.9 44.4
2.5 5.78 5.22 31.1 41.1
2.25 6.13 5.45 26.5 38.1
2 7.85 7.69 20.9 25.4

Models are fine-tuned through causal language modeling on training sets and are tested on validation/test sets.

Phi-2 on GSM8K

Model Bits Rank LoRA Initial GSM8K
Phi-2 16 - Full model fine-tuning 66.8ยฑ1.2
Phi-2 16 64 Gaussian + 0 64.8ยฑ0.5
Phi-2 4 64 Gaussian + 0 (QLoRA) 60.2ยฑ0.6
Phi-2 4 64 LoftQ 64.1ยฑ0.7

LLAMA-3 on GSM8K

Model Bits Rank LoRA Initial GSM8K
LLAMA-3-8B 16 - Full model fine-tuning 70.4ยฑ0.7
LLAMA-3-8B 16 64 Gaussian + 0 (LoRA) 69.3ยฑ1.5
LLAMA-3-8B 4 64 Gaussian + 0 (QLoRA) 67.4ยฑ1.0
LLAMA-3-8B 4 64 LoftQ 68.0ยฑ0.6

Models are fine-tuned through causal language modeling on (reformatted) training sets and are tested on validation/test sets.

BART-large on CNN/DailyMail and XSum

Bit Rank XSum CNN/DailyMail
Lead-3* 16.30/1.60/11.95 40.42/17.62/36.67
16 16 43.95/20.72/35.68 45.03/21.84/42.15
4 16 44.51/21.14/36.18 43.96/21.06/40.96
2 16 40.81/17.85/32.80 42.52/19.81/39.51
16 8 43.40/20.20/35.20 44.72/21.58/41.84
4 8 44.08/20.72/35.89 43.81/20.95/40.84
2 8 39.63/16.65/31.62 42.24/19.44/29.04

*: Using the first 3 sentences in the document as the summary

DeBERTa-V3-base on GLUE using Normal Float Datatype

Bit Rank MNLI QNLI RTE SST MRPC CoLA QQP STSB SQuAD ANLI
m / mm Acc Acc Acc Acc Acc Mcc P/S Corr EM/F1 Acc
16 16 90.5/90.6 94.0 82.0 95.3 89.5/93.3 69.2 92.4/89.8 91.6/91.1 88.5/92.8 59.8
2 16 84.7/85.1 86.6 61.4 90.2 83.8/88.6 37.4 90.3/86.9 87.1/86.9 81.5/88.6 47.1
2 32 86.0/86.1 89.9 61.7 92.0 83.6/87.2 47.5 91.0/87.9 87.5/87.0 82.9/89.8 49.0

DeBERTa-V3-base on GLUE using Uniform Quantization Datatype

Bit Rank MNLI QNLI RTE SST MRPC CoLA QQP STSB SQuAD
m / mm Acc Acc Acc Acc Acc Mcc P/S Corr Em/F1
16 16 90.5/90.6 94.0 82.0 95.3 89.5/93.3 69.2 92.4/89.8 91.6/91.1 88.5/92.8
2 16 87.3/87.1 90.6 61.1 94.0 87.0/90.6 59.1 90.9/88.0 87.9/87.6 84.4/91.2
2 32 88.0/88.1 92.2 63.2 94.7 87.5/91.2 60.5 91.3/88.3 89.5/89.2 85.2/91.6

Citation

@article{li2023loftq,
  title={Loftq: Lora-fine-tuning-aware quantization for large language models},
  author={Li, Yixiao and Yu, Yifan and Liang, Chen and He, Pengcheng and Karampatziakis, Nikos and Chen, Weizhu and Zhao, Tuo},
  journal={arXiv preprint arXiv:2310.08659},
  year={2023}
}

Appendix: Off-the-shelf Model List

Model Name Bits Ranks
LLAMA-3-8B 4 64
CodeLLAMA-7b 4 64
CodeLLAMA-13b 4 64
Phi-2 4 64
LLAMA-2-7b 4 64
LLAMA-2-13b 4 64
LLAMA-2-70b 4 64
Mistral 4 64
Mistral 4 32
BART-large 4 8
BART-large 4 16
BART-large 4 32
BART-large 2 8

loftq's People

Contributors

peterjc123 avatar yxli2123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

loftq's Issues

try to run quantize.py but get error CUDA out of memory

When I try to use quantize.py to quantize llama2๏ผŒI use num_bits=4, and other configurations are default, but I get the error of CUDA out of memory at quantize layer 18, what's wrong with this? My GPU is nvidia-A100 80GB, then I change the num_bits to 2 and 8, it will fast broken at layer 0. Anything wrong about these problems
model.model.layers.18.mlp.up_proj Linear(in_features=4096, out_features=11008, bias=False) ['T_destination', '__annotations__', '__call__', '__class__', '__constants__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_apply', '_backward_hooks', '_buffers', '_call_impl', '_forward_hooks', '_forward_pre_hooks', '_get_backward_hooks', '_get_name', '_is_full_backward_hook', '_is_hf_initialized', '_load_from_state_dict', '_load_state_dict_post_hooks', '_load_state_dict_pre_hooks', '_maybe_warn_non_full_backward_hook', '_modules', '_named_members', '_non_persistent_buffers_set', '_parameters', '_register_load_state_dict_pre_hook', '_register_state_dict_hook', '_replicate_for_data_parallel', '_save_to_state_dict', '_slow_forward', '_state_dict_hooks', '_version', 'add_module', 'apply', 'bfloat16', 'bias', 'buffers', 'children', 'cpu', 'cuda', 'double', 'dump_patches', 'eval', 'extra_repr', 'float', 'forward', 'get_buffer', 'get_extra_state', 'get_parameter', 'get_submodule', 'half', 'in_features', 'ipu', 'load_state_dict', 'modules', 'named_buffers', 'named_children', 'named_modules', 'named_parameters', 'out_features', 'parameters', 'register_backward_hook', 'register_buffer', 'register_forward_hook', 'register_forward_pre_hook', 'register_full_backward_hook', 'register_load_state_dict_post_hook', 'register_module', 'register_parameter', 'requires_grad_', 'reset_parameters', 'set_extra_state', 'share_memory', 'state_dict', 'to', 'to_empty', 'train', 'training', 'type', 'weight', 'xpu', 'zero_grad'] asymmetric NormalFloat asymmetric NormalFloat Traceback (most recent call last): File "quantize.py", line 143, in <module> main(args) File "quantize.py", line 105, in main utils.replace_module( File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 392, in replace_module replace_module(immediate_child_module, File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 392, in replace_module replace_module(immediate_child_module, File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 392, in replace_module replace_module(immediate_child_module, [Previous line repeated 1 more time] File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 367, in replace_module qlinear_lora.initial_backbone(weight) File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 278, in initial_backbone self.qweight, self.absmax, _ = self.quantizer.quantize_block(weight) File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 174, in quantize_block qweight = torch.argmin(abs_diff, dim=-1) # (L, B) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 688.00 MiB (GPU 0; 79.35 GiB total capacity; 77.40 GiB already allocated; 555.12 MiB free; 77.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Error with shape

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.model.embed_tokens.weight: copying a param with shape torch.Size([32001, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
size mismatch for base_model.model.lm_head.weight: copying a param with shape torch.Size([32001, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).

I used the checkpoints from the training config๏ผš
python train_gsm8k.py
--model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank
--learning_rate 3e-4
--seed 11
--expt_name gsm8k_llama2_7b_4bit_64rank_loftq
--output_dir exp_results/
--num_train_epochs 6
--per_device_train_batch_size 2
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "epoch"
--weight_decay 0.1
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 10
--do_train
--report_to tensorboard

fake and true quantization don't match

Hi,

As a debugging way, I want to check whether the fake and true quantized model's weights have the same value. Here is how I implement it:

config = AutoConfig.from_pretrained("LoftQ/Llama-2-7b-hf-bit4-rank64", trust_remote_code=False)
loftq_fp16 = AutoModelForCausalLM.from_pretrained(
    "LoftQ/Llama-2-7b-hf-bit4-rank64",
    trust_remote_code=False,
    config=config,
    token="xxx"
)

loftq_fp4 = AutoModelForCausalLM.from_pretrained(
    "LoftQ/Llama-2-7b-hf-bit4-rank64",
    config=config,
    low_cpu_mem_usage=True,
    load_in_4bit=True,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        llm_int8_has_fp16_weight=False,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type='nf4',
    ),
    token="xxx"
)

Then I print out some weight values as:
print(loftq_fp16.state_dict()['model.layers.0.self_attn.q_proj.weight'])
The output is:

tensor([[-0.0062, -0.0148, -0.0022,  ...,  0.0045,  0.0017, -0.0036],
        [ 0.0142, -0.0043,  0.0028,  ..., -0.0093, -0.0114,  0.0076],
        [-0.0146,  0.0126,  0.0005,  ...,  0.0063,  0.0188, -0.0031],
        ...,
        [ 0.0013,  0.0109, -0.0003,  ...,  0.0098, -0.0298,  0.0097],
        [ 0.0256,  0.0102,  0.0032,  ..., -0.0334, -0.0156, -0.0123],
        [-0.0134, -0.0066,  0.0018,  ...,  0.0181,  0.0166, -0.0082]])

For loftq_fp4, I do it in this way:

import copy
from bitsandbytes.functional import dequantize_4bit
with torch.no_grad():
    for name, module in loftq_fp4.named_modules():
        if name == "model.layers.0.self_attn.q_proj.base_layer":
            quant_state = copy.deepcopy(module.weight.quant_state)
            dtype = torch.float16
            weights = dequantize_4bit(module.weight.data, quant_state=quant_state, quant_type="nf4").to(dtype)
            print(weights)

The output is:

tensor([[-0.0072, -0.0153, -0.0035,  ...,  0.0047,  0.0000, -0.0054],
        [ 0.0116,  0.0000,  0.0000,  ..., -0.0108, -0.0108,  0.0061],
        [-0.0228,  0.0199,  0.0000,  ...,  0.0096,  0.0195,  0.0000],
        ...,
        [ 0.0000,  0.0141,  0.0000,  ...,  0.0124, -0.0305,  0.0124],
        [ 0.0251,  0.0092,  0.0045,  ..., -0.0317, -0.0172, -0.0111],
        [-0.0153, -0.0072,  0.0031,  ...,  0.0188,  0.0144, -0.0079]],
       device='cuda:0', dtype=torch.float16)

We can see they are quite different, which means the fake quantization doesn't truly reflect the true quantization performance.

Embedding layer

Does LoftQ support when we need to train the embedding layer with QLoRA?

How to execute uniform quantization instead of NF4 quantization?

Hi, thanks for your amzaing job. I found the code using NF4 quantization by default, but don't add any support to switch UQ. If I have a model quantized by GPTQ, how to use LoftQ on it?

I have tried a GPTQ-quantized model using PEFT, but it raised a exception as followed:

Traceback (most recent call last):
  File "quantize_save.py", line 221, in <module>
    base_dir, lora_dir = quantize_and_save()
  File "quantize_save.py", line 191, in quantize_and_save
    lora_model = get_peft_model(model, lora_config)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/mapping.py", line 133, in get_peft_model
    return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/peft_model.py", line 1043, in __init__
    super().__init__(model, peft_config, adapter_name)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/peft_model.py", line 125, in __init__
    self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/model.py", line 111, in __init__
    super().__init__(model, config, adapter_name)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/tuners_utils.py", line 87, in __init__
    self.inject_adapter(self.model, adapter_name)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/tuners_utils.py", line 244, in inject_adapter
    self._create_and_replace(peft_config, adapter_name, target, target_name, parent, **optional_kwargs)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/model.py", line 181, in _create_and_replace
    new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/model.py", line 283, in _create_new_module
    new_module = QuantLinear(target, adapter_name, **kwargs)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/gptq.py", line 40, in __init__
    self.update_layer(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights, use_rslora)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/layer.py", line 96, in update_layer
    self.loftq_init(adapter_name)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/layer.py", line 134, in loftq_init
    weight = self.get_base_layer().weight
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'QuantLinear' object has no attribute 'weight'

issues for running python test_gsm8k.py when uses LoftQ for llama

~/LoftQ-main$ python test_gsm8k.py
--model_name_or_path /rhome/yangyj/LoftQ
--batch_size 16 \

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /rhome/yangyj/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /rhome/yangyj/anaconda3/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /rhome/yangyj/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
WARNING:root:Use the checkpoint in HF hub, stored in the subfolder='gsm8k' in target model.
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3/3 [00:08<00:00, 2.80s/it]
Traceback (most recent call last):
File "/rhome/yangyj/LoftQ-main/test_gsm8k.py", line 281, in
evaluation(model_args, data_args)
File "/rhome/yangyj/LoftQ-main/test_gsm8k.py", line 128, in evaluation
model = PeftModel.from_pretrained(model,
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/peft_model.py", line 278, in from_pretrained
config = PEFT_TYPE_TO_CONFIG_MAPPING[
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/config.py", line 134, in from_pretrained
config = config_cls(**kwargs)
TypeError: LoraConfig.init() got an unexpected keyword argument 'loftq_config'
--batch_size: command not found

Performance worsens versus QLoRA with TinyLlama

When running with LoftQ, performance worsens with TinyLlama versus QLoRA. The performance gets even worse when I do more iterations for initiating the LoftQ adapters. (my grad norm gets worse the more iterations I do).

Is there any reason why applying loftq wouldn't work with TinyLlama?

When working with Mistral, I found that:

  • 1 iteration of LoftQ is slightly better than QLoRA.
  • but 3 iterations of LoftQ is worse than QLoRA.

I'm using a rank of 32 and alpha of 32 as well. My base learning rate is 1e-4 .

I am using unsloth:

if use_4bit and config['use_loftq']:
    loftq_config = LoftQConfig(loftq_bits=4, loftq_iter=1)
    init_lora_weights = "loftq"
else:
    loftq_config = None
    init_lora_weights = True

## Apply LoRA (if use_lora is True in the config)
if config.get('use_lora', False):
    model = FastLanguageModel.get_peft_model(
        model,
        r=config['lora_r'],
        lora_alpha=config['lora_alpha'],
        target_modules=config['lora_modules'],
        modules_to_save=config.get('other_trainable', None),
        lora_dropout = 0, # Dropout = 0 is currently optimized
        bias = "none",    # Bias = "none" is currently optimized
        use_gradient_checkpointing = True,
        random_state = 3407,
        use_rslora=True,
        loftq_config=loftq_config,
        init_lora_weights=init_lora_weights,
    )

quantize_save.py script fails saving lora adapter with peft>=0.7.2

Hi, when running quantize_save.py where it attempts to call lora_model.save_pretrained(lora_model_dir), an OSError exception is now being thrown saying that the config.json file for the base model doesn't exist. I believe it should be a simple fix by having the script unwrap and save the base model and tokenizer first, moving the call to lora_model.save_pretrained() to the end of quantize_and_save(). I assume the latest version of peft requires that the LoRA's base model exist on disk so it can look up the configuration.

I'm just not sure if it's okay to save the LoRA after unwrapping the base model as it kind of changes the flow of the script? Thoughts? Thanks.

Package Versions:

  • peft: 0.7.2.dev0
  • transformers: 4.36.2

Can't reproduce reported results on GSM8K

Thank you for this great work.

I ran your training script train_gsm8k.sh with only one modification, changing --per_device_train_batch_size 2 and --gradient_accumulation_steps 8 to 1 and 16, since my A100 only has 40GB memory. I also use the --fake_quantization since it's more stable for optimization.

However, my results are:
epoch 0: accuracy: 0.244882486732373
epoch 1: accuracy: 0.2880970432145565
epoch 2: accuracy: 0.3055344958301744
epoch 3: accuracy: 0.2979529946929492
epoch 4: accuracy: 0.29492039423805916

There is a huge gap between my best result and your reported result 35.0. May I ask what might be the cause?

Method fails on Gemma-7B model

Hello, I have tried your method on gemma-7b model. I found that this method is work on gsm-8k dataset, but this fails on wikitext-2 dataset. This is my training log:

[WARNING|logging.py:329] 2024-05-15 10:23:39,953 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/home/yujin-wa20/miniconda3/envs/gact/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=F
alse explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_r
eentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
{'loss': 51.6576, 'grad_norm': 470.694580078125, 'learning_rate': 0.0003, 'epoch': 0.11}     
{'loss': 47.9403, 'grad_norm': 437.8383483886719, 'learning_rate': 0.00029890633111470807, 'epoch': 0.21}                                                                                 
{'loss': 23.9947, 'grad_norm': 42.98173904418945, 'learning_rate': 0.00029564127261390776, 'epoch': 0.32}                                                                                 
{'loss': 23.057, 'grad_norm': 132.80783081054688, 'learning_rate': 0.00029025243640281223, 'epoch': 0.43}                                                                                 
{'loss': 21.0726, 'grad_norm': 24.4749755859375, 'learning_rate': 0.0002828184038479814, 'epoch': 0.53}                                                                                   
 19%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                                                                                                                       | 5/27 [39:06<2:35:50, 425.04s/it]

I didn't change the original code. Do you know why?

Cannot reproduce the result of LoftQ on gsm8k with llama2-7b

Hi,

I try to use this code to test the performance of LoftQ:

python test_gsm8k.py \
    --model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank \
    --batch_size 16

The final ACC is about 0.30. The output information is:

prediction [18.0, 4.0, 68000.0, 540.0, 3.5, 128.0, 260.0, 32.0, 500.0, 412.0, 366.0, 8184.0, 83.0, 8.0, 5.0, 3835.0]
ground truth [18.0, 3.0, 70000.0, 540.0, 20.0, 64.0, 260.0, 160.0, 45.0, 460.0, 366.0, 694.0, 13.0, 18.0, 60.0, 125.0, 230.0, 57500.0, 7.0, 6.0, 15.0, 14.0, 7.0, 8.0, 26.0, 2.0, 243.0, 16.0, 25.0, 104.0, 109.0, 80.0, 35.0, 70.0, 23.0, 9.0, 75.0, 2.0, 10.0, 18.0, 8.0, 200.0, 26.0, 48.0, 20.0, 104.0, 163.0, 800.0, 8.0, 30.0, 294.0, 5.0, 15.0, 40.0, 40.0, 14.0, 3.0, 83.0, 57.0, 187.0, 17.0, 1430.0, 25000.0, 1596.0, 300.0, 36.0, 48.0, 595.0, 36.0, 60.0, 7425.0, 60.0, 221.0, 255.0, 88.0, 60.0, 5.0, 100.0, 6.0, 70.0, 10.0, 17.0, 623.0, 600.0, 15.0, 44.0, 22.0, 9360.0, 8000.0, 24.0, 225.0, 28.0, 4.0, 36.0, 348.0, 40.0, 3.0, 12.0, 5.0, 58.0, 175.0, 6.0, 26.0, 140.0, 500.0, 20.0, 72.0, 3.0, 50.0, 28.0, 45.0, 16.0, 24.0, 25.0, 6.0, 90.0, 42.0, 360.0, 4.0, 95200.0, 240.0, 27.0, 48.0, 50.0, 10.0, 10.0, 82.0, 120.0, 880.0, 10000.0, 30.0, 940.0, 60.0, 13.0, 720.0, 40.0, 6.0, 29.0, 105.0, 70.0, 20.0, 400.0, 140.0, 16.0, 20.0, 4000.0, 2125.0, 75.0, 30.0, 16.0, 4.0, 5.0, 4.0, 48.0, 272.0, 280.0, 1400.0, 80.0, 34.0, 15.0, 16.0, 32.0, 92.0, 50.0, 15.0, 77.0, 5.0, 16.0, 18.0, 120.0, 150.0, 1210.0, 51.0, 18000.0, 95.0, 15.0, 100.0, 350.0, 122.0, 130.0, 20.0, 160.0, 23.0, 2.0, 25.0, 30.0, 5.0, 106.0, 50.0, 34.0, 360.0, 5.0, 91.0, 24.0, 10.0, 12.0, 120.0, 6277.0, 320.0, 7500.0, 55.0, 114200.0, 100.0, 31.0, 98.0, 98.0, 860.0, 2600.0, 76.0, 145.0, 10.0, 4.0, 5.0, 250.0, 8.0, 44.0, 220.0, 15.0, 45.0, 54.0, 70.0,...]
adapter: None | GSM8K test accuracy: 0.30% | full precision: False

May I ask whether there is special setting I need to focus?

Best

About the GPU memory

It seems that you use NF fake quantization, so I guess you can't save the GPU memory like QLoRA. Am I right?

About the test result on gsm8k

Hi,
I try to use this code to evaluate the performance of LoftQ:

python test_gsm8k.py \
    --model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank \
    --batch_size 16

And I find that the final ACC is about 40.11%. But the reported result in the paper is 35.0%. So I would like to ask whether my test result is right.

Thanks

SVD Implementation in loftQ Algorithm

Thanks for sharing great work! Learned a lot. I have a question regarding SVD implementation in loftQ init.

LoftQ/utils.py

Lines 25 to 26 in e6bdef4

# Use SVD to decompose a matrix, default full_matrices is False to save parameters
U, S, Vh = torch.linalg.svd(weight, full_matrices=False)

Upon reviewing the code, in the SVD decomposition, I noticed the use of a "reduced SVD" option. (full_matrices=False) I wondering the potential impact of choosing a reduced SVD over a full SVD option in the context of loftQ initialization, especially regarding its effect on the alternating optimization process.

Could you share some insights on whether there are specific advantages or reasons for choosing reduced SVD in this scenario?

Thanks in advance.

Questions about lora merge.

Thx for your great job!

After reading your paper and code, I have a question: How do you merge LoRA weights to quantized LLM for inference?

Looking forward to your reply!

Regards!

Reproduce reported LORA16bit result on GSM8K

Hello,
Great work! I want to reproduce the reported LoRA(16bit) result(36.9) on GSM8K dataset in your paper.
Could you provide the correct script or more detailed hyper-parameters? Thx a lot!

Does it support Mixtral 8x7B๏ผŸ

After I modified the code, there was a problem with the gate size of lora weight. After loading, I found that lora_a was the same as base_layer, and a size_mismatch problem occurred. Thanks!

bugs for running python test_gsm8k.py when uses LoftQ for llama

python test_gsm8k.py --model_name_or_path /rhome/yangyj/pre-train/models--LoftQ--Llama-2-7b-hf-4bit-64rank/snapshots/1bb66ebf4f9050bc619f416a4f3327a21426fc6f --batch_size 16

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /rhome/yangyj/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /rhome/yangyj/anaconda3/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /rhome/yangyj/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
WARNING:root:Use the checkpoint in HF hub, stored in the subfolder='gsm8k' in target model.
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3/3 [00:40<00:00, 13.66s/it]
Traceback (most recent call last):
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/utils/config.py", line 177, in _get_peft_type
config_file = hf_hub_download(
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
validate_repo_id(arg_value)
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/rhome/yangyj/pre-train/models--LoftQ--Llama-2-7b-hf-4bit-64rank/snapshots/1bb66ebf4f9050bc619f416a4f3327a21426fc6f'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/rhome/yangyj/LoftQ-main/test_gsm8k.py", line 281, in
evaluation(model_args, data_args)
File "/rhome/yangyj/LoftQ-main/test_gsm8k.py", line 128, in evaluation
model = PeftModel.from_pretrained(model,
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/peft_model.py", line 244, in from_pretrained
PeftConfig._get_peft_type(
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/utils/config.py", line 183, in _get_peft_type
raise ValueError(f"Can't find '{CONFIG_NAME}' at '{model_id}'")
ValueError: Can't find 'adapter_config.json' at '/rhome/yangyj/pre-train/models--LoftQ--Llama-2-7b-hf-4bit-64rank/snapshots/1bb66ebf4f9050bc619f416a4f3327a21426fc6f'

Why are base weights on HF LoftQ models in 16-bit?

The script quantize_save_load.py generates a quantized model with LoRA adapters.

The base model is then saved and uploaded to LoftQ repos such as this one.

I'm puzzled why the base model weights are 16-bits there because that implies that the base model is somehow upcasted (dequantized) in the quantize_save_load.py script, but I don't see that anywhere.

My baseline expectation is that either:
a) The backbone would be stored in nf4, and then loaded with the 16 bit adapters on top, or
b) The backbone would be upcasted to 16-bit, and then quantized in nf4 upon loading with the 16-bit adapters on top. [But then there should be some upcasting code in quantize_save_load.py].

Could someone clarify? Thanks.

loftQ can not use multi gpu to train

When I set:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
will raise error :
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [42,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.

return (element == self).any().item() # type: ignore[union-attr]
RuntimeError: CUDA error: device-side assert triggered

how can I do this?

loss tend to be nan or inf

Thank you for this great work!

May I ask whether the reported results are from fake_quantization=True or fake_quantization=False? When I use fake_quantization=False, the training doesn't succeed for the gsm8k task for llama-2-7b, always leading to nan or inf loss.

Quantized models issue

Hello and thank you for the work.

I have been attempting to quantize the Mistral, T5, and Falcon models, I can finish the process and save them, but they do not seem to perform inference properly with the safetensors saved model after it is loaded (either loading in 4bit or 16bit). I suspect that I might have made some errors.
The sequence I code:
1- Load the pretrained model and tokenizer
2- Configure LoraConfig
3- Run utils.replace_module() [ screnshot attached]
4 - Save pretrained model

Is that right?
Besides, I didn't fine-tunned with PEFT yet, after quantization.

While I am familiar with performing PEFT/LoRA and believe I have correctly identified the target modules (referred to as "allow_name" in your documentation), I am struggling to find out the correct "block_name" for each model (I'm aware of the default allow name and block name lists and tried to investigate using the snippet:
for name, param in pretrained_model.named_parameters():
print(name, param.shape, param.max(), param.mean(), param.requires_grad)
I suspect some mistake on block names might be the root of the issue, but you can tell me best.

I would greatly appreciate your help in resolving this issue.

In addition, I have implemented the code in a notebook, following the quantize.py example. I am executing the Lora_Config before calling utils.replace_module, in line with the sequence in your file. Please let me know if there is any aspect of this process that I might be misunderstanding.

Thanks a lot once again.

image

Failing to converge when using some random seeds

Dear Authors,

Your work is truly exceptional and I am currently attempting to reproduce it. However, I've observed noticeable performance variations when employing different random seeds. For example, during the fine-tuning of Deberta-v3-base on the 'mrpc' task, setting the random seed to '0' results in an evaluation accuracy of 85.05. In contrast, when I choose '71' or '37' as the random seed, the evaluation accuracy significantly drops to 68.38, essentially failing to converge.

Could you possibly offer any guidance regarding this matter? Moreover, I would greatly appreciate it if you could disclose the random seeds you utilized in this work.

Thank you!

A question from a novice.

Hello, if I wish to validate your baseline, what should I do? I am a beginner, and I'm still not quite clear after reading the Quick Start guide.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.