yxli2123 / loftq Goto Github PK

View Code? Open in Web Editor NEW

182.0 182.0 15.0 406 KB

License: MIT License

Python 97.07% Shell 2.93%

loftq's People

Contributors

Stargazers

Watchers

Forkers

vriveras stiphyjay 0616ygh stjordanis qingruzhang peterjc123 jxzhangjhu nguyenvuthientrang phamvuhuyentrang ther-nullptr tiantian-han kaizizzzzzz mgddestiny deuterium1729 jie311

loftq's Issues

A question from a novice.

Hello, if I wish to validate your baseline, what should I do? I am a beginner, and I'm still not quite clear after reading the Quick Start guide.

bugs for running python test_gsm8k.py when uses LoftQ for llama

python test_gsm8k.py --model_name_or_path /rhome/yangyj/pre-train/models--LoftQ--Llama-2-7b-hf-4bit-64rank/snapshots/1bb66ebf4f9050bc619f416a4f3327a21426fc6f --batch_size 16

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /rhome/yangyj/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /rhome/yangyj/anaconda3/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /rhome/yangyj/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
WARNING:root:Use the checkpoint in HF hub, stored in the subfolder='gsm8k' in target model.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:40<00:00, 13.66s/it]
Traceback (most recent call last):
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/utils/config.py", line 177, in _get_peft_type
config_file = hf_hub_download(
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
validate_repo_id(arg_value)
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/rhome/yangyj/pre-train/models--LoftQ--Llama-2-7b-hf-4bit-64rank/snapshots/1bb66ebf4f9050bc619f416a4f3327a21426fc6f'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/rhome/yangyj/LoftQ-main/test_gsm8k.py", line 281, in
evaluation(model_args, data_args)
File "/rhome/yangyj/LoftQ-main/test_gsm8k.py", line 128, in evaluation
model = PeftModel.from_pretrained(model,
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/peft_model.py", line 244, in from_pretrained
PeftConfig._get_peft_type(
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/utils/config.py", line 183, in _get_peft_type
raise ValueError(f"Can't find '{CONFIG_NAME}' at '{model_id}'")
ValueError: Can't find 'adapter_config.json' at '/rhome/yangyj/pre-train/models--LoftQ--Llama-2-7b-hf-4bit-64rank/snapshots/1bb66ebf4f9050bc619f416a4f3327a21426fc6f'

Does it support Mixtral 8x7B？

After I modified the code, there was a problem with the gate size of lora weight. After loading, I found that lora_a was the same as base_layer, and a size_mismatch problem occurred. Thanks!

[BUG]size mismatch for base_model.model.model.embed_tokens.weight

Hello! I'm facing a similar issue as well.
adapter can't use on base model after training, and the error always shows size mismatch info from PeftModelForCausalLM
I am using LoftQ/Llama-2-7b-hf-4bit-64rank,
Is there any solution to this?
Thank you!

Method fails on Gemma-7B model

Hello, I have tried your method on gemma-7b model. I found that this method is work on gsm-8k dataset, but this fails on wikitext-2 dataset. This is my training log:

[WARNING|logging.py:329] 2024-05-15 10:23:39,953 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/home/yujin-wa20/miniconda3/envs/gact/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=F
alse explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_r
eentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
{'loss': 51.6576, 'grad_norm': 470.694580078125, 'learning_rate': 0.0003, 'epoch': 0.11}     
{'loss': 47.9403, 'grad_norm': 437.8383483886719, 'learning_rate': 0.00029890633111470807, 'epoch': 0.21}                                                                                 
{'loss': 23.9947, 'grad_norm': 42.98173904418945, 'learning_rate': 0.00029564127261390776, 'epoch': 0.32}                                                                                 
{'loss': 23.057, 'grad_norm': 132.80783081054688, 'learning_rate': 0.00029025243640281223, 'epoch': 0.43}                                                                                 
{'loss': 21.0726, 'grad_norm': 24.4749755859375, 'learning_rate': 0.0002828184038479814, 'epoch': 0.53}                                                                                   
 19%|███████████████████████████▏                                                                                                                       | 5/27 [39:06<2:35:50, 425.04s/it]

I didn't change the original code. Do you know why?

About the test result on gsm8k

Hi,
I try to use this code to evaluate the performance of LoftQ:

python test_gsm8k.py \
    --model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank \
    --batch_size 16

And I find that the final ACC is about 40.11%. But the reported result in the paper is 35.0%. So I would like to ask whether my test result is right.

Thanks

About the GPU memory

It seems that you use NF fake quantization, so I guess you can't save the GPU memory like QLoRA. Am I right?

Why are the full models, and not just adapters, pushed to hub?

I'm wondering why not just push the adapter model alone? That would seem sufficient?

how to train with 2bit quantization model?

I found the implementation of 4-bit quantified , but I couldn't find a 2-bit one. Can you tell me how to implement a finereturn for a 2-bit quantization model

Questions about lora merge.

Thx for your great job!

After reading your paper and code, I have a question: How do you merge LoRA weights to quantized LLM for inference?

Looking forward to your reply!

Regards!

Can't reproduce reported results on GSM8K

Thank you for this great work.

I ran your training script train_gsm8k.sh with only one modification, changing --per_device_train_batch_size 2 and --gradient_accumulation_steps 8 to 1 and 16, since my A100 only has 40GB memory. I also use the --fake_quantization since it's more stable for optimization.

However, my results are:
epoch 0: accuracy: 0.244882486732373
epoch 1: accuracy: 0.2880970432145565
epoch 2: accuracy: 0.3055344958301744
epoch 3: accuracy: 0.2979529946929492
epoch 4: accuracy: 0.29492039423805916

There is a huge gap between my best result and your reported result 35.0. May I ask what might be the cause?

Quantized models issue

Hello and thank you for the work.

I have been attempting to quantize the Mistral, T5, and Falcon models, I can finish the process and save them, but they do not seem to perform inference properly with the safetensors saved model after it is loaded (either loading in 4bit or 16bit). I suspect that I might have made some errors.
The sequence I code:
1- Load the pretrained model and tokenizer
2- Configure LoraConfig
3- Run utils.replace_module() [ screnshot attached]
4 - Save pretrained model

Is that right?
Besides, I didn't fine-tunned with PEFT yet, after quantization.

While I am familiar with performing PEFT/LoRA and believe I have correctly identified the target modules (referred to as "allow_name" in your documentation), I am struggling to find out the correct "block_name" for each model (I'm aware of the default allow name and block name lists and tried to investigate using the snippet:
for name, param in pretrained_model.named_parameters():
print(name, param.shape, param.max(), param.mean(), param.requires_grad)
I suspect some mistake on block names might be the root of the issue, but you can tell me best.

I would greatly appreciate your help in resolving this issue.

In addition, I have implemented the code in a notebook, following the quantize.py example. I am executing the Lora_Config before calling utils.replace_module, in line with the sequence in your file. Please let me know if there is any aspect of this process that I might be misunderstanding.

Thanks a lot once again.

Why are base weights on HF LoftQ models in 16-bit?

The script quantize_save_load.py generates a quantized model with LoRA adapters.

The base model is then saved and uploaded to LoftQ repos such as this one.

I'm puzzled why the base model weights are 16-bits there because that implies that the base model is somehow upcasted (dequantized) in the quantize_save_load.py script, but I don't see that anywhere.

My baseline expectation is that either:
a) The backbone would be stored in nf4, and then loaded with the 16 bit adapters on top, or
b) The backbone would be upcasted to 16-bit, and then quantized in nf4 upon loading with the 16-bit adapters on top. [But then there should be some upcasting code in quantize_save_load.py].

Could someone clarify? Thanks.

quick question about the Llama-3 results

hey! thanks for the repo and the recent updates on Llama-3 results. i am wondering are you finetuning the Base LM or the instruct-tuned one?

fake and true quantization don't match

Hi,

As a debugging way, I want to check whether the fake and true quantized model's weights have the same value. Here is how I implement it:

config = AutoConfig.from_pretrained("LoftQ/Llama-2-7b-hf-bit4-rank64", trust_remote_code=False)
loftq_fp16 = AutoModelForCausalLM.from_pretrained(
    "LoftQ/Llama-2-7b-hf-bit4-rank64",
    trust_remote_code=False,
    config=config,
    token="xxx"
)

loftq_fp4 = AutoModelForCausalLM.from_pretrained(
    "LoftQ/Llama-2-7b-hf-bit4-rank64",
    config=config,
    low_cpu_mem_usage=True,
    load_in_4bit=True,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        llm_int8_has_fp16_weight=False,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type='nf4',
    ),
    token="xxx"
)

Then I print out some weight values as:
print(loftq_fp16.state_dict()['model.layers.0.self_attn.q_proj.weight'])
The output is:

tensor([[-0.0062, -0.0148, -0.0022,  ...,  0.0045,  0.0017, -0.0036],
        [ 0.0142, -0.0043,  0.0028,  ..., -0.0093, -0.0114,  0.0076],
        [-0.0146,  0.0126,  0.0005,  ...,  0.0063,  0.0188, -0.0031],
        ...,
        [ 0.0013,  0.0109, -0.0003,  ...,  0.0098, -0.0298,  0.0097],
        [ 0.0256,  0.0102,  0.0032,  ..., -0.0334, -0.0156, -0.0123],
        [-0.0134, -0.0066,  0.0018,  ...,  0.0181,  0.0166, -0.0082]])

For loftq_fp4, I do it in this way:

import copy
from bitsandbytes.functional import dequantize_4bit
with torch.no_grad():
    for name, module in loftq_fp4.named_modules():
        if name == "model.layers.0.self_attn.q_proj.base_layer":
            quant_state = copy.deepcopy(module.weight.quant_state)
            dtype = torch.float16
            weights = dequantize_4bit(module.weight.data, quant_state=quant_state, quant_type="nf4").to(dtype)
            print(weights)

The output is:

tensor([[-0.0072, -0.0153, -0.0035,  ...,  0.0047,  0.0000, -0.0054],
        [ 0.0116,  0.0000,  0.0000,  ..., -0.0108, -0.0108,  0.0061],
        [-0.0228,  0.0199,  0.0000,  ...,  0.0096,  0.0195,  0.0000],
        ...,
        [ 0.0000,  0.0141,  0.0000,  ...,  0.0124, -0.0305,  0.0124],
        [ 0.0251,  0.0092,  0.0045,  ..., -0.0317, -0.0172, -0.0111],
        [-0.0153, -0.0072,  0.0031,  ...,  0.0188,  0.0144, -0.0079]],
       device='cuda:0', dtype=torch.float16)

We can see they are quite different, which means the fake quantization doesn't truly reflect the true quantization performance.

issues for running python test_gsm8k.py when uses LoftQ for llama

~/LoftQ-main$ python test_gsm8k.py
--model_name_or_path /rhome/yangyj/LoftQ
--batch_size 16 \

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /rhome/yangyj/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /rhome/yangyj/anaconda3/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /rhome/yangyj/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
WARNING:root:Use the checkpoint in HF hub, stored in the subfolder='gsm8k' in target model.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:08<00:00, 2.80s/it]
Traceback (most recent call last):
File "/rhome/yangyj/LoftQ-main/test_gsm8k.py", line 281, in
evaluation(model_args, data_args)
File "/rhome/yangyj/LoftQ-main/test_gsm8k.py", line 128, in evaluation
model = PeftModel.from_pretrained(model,
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/peft_model.py", line 278, in from_pretrained
config = PEFT_TYPE_TO_CONFIG_MAPPING[
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/config.py", line 134, in from_pretrained
config = config_cls(**kwargs)
TypeError: LoraConfig.init() got an unexpected keyword argument 'loftq_config'
--batch_size: command not found

The issue of not being able to download the LoftQ model from huggingface even when using an VPN

C:\Users\41936>git clone https://huggingface.co/LoftQ/Llama-2-7b-hf-4bit-64rank
Cloning into 'Llama-2-7b-hf-4bit-64rank'...
fatal: unable to access 'https://huggingface.co/LoftQ/Llama-2-7b-hf-4bit-64rank/': Failed to connect to huggingface.co port 443 after 21077 ms: Couldn't connect to server

the train_clm.py file contains two similar main functions

as the title

Reproduce reported LORA16bit result on GSM8K

Hello,
Great work! I want to reproduce the reported LoRA(16bit) result(36.9) on GSM8K dataset in your paper.
Could you provide the correct script or more detailed hyper-parameters? Thx a lot!

How to execute uniform quantization instead of NF4 quantization?

Hi, thanks for your amzaing job. I found the code using NF4 quantization by default, but don't add any support to switch UQ. If I have a model quantized by GPTQ, how to use LoftQ on it?

I have tried a GPTQ-quantized model using PEFT, but it raised a exception as followed:

Traceback (most recent call last):
  File "quantize_save.py", line 221, in <module>
    base_dir, lora_dir = quantize_and_save()
  File "quantize_save.py", line 191, in quantize_and_save
    lora_model = get_peft_model(model, lora_config)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/mapping.py", line 133, in get_peft_model
    return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/peft_model.py", line 1043, in __init__
    super().__init__(model, peft_config, adapter_name)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/peft_model.py", line 125, in __init__
    self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/model.py", line 111, in __init__
    super().__init__(model, config, adapter_name)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/tuners_utils.py", line 87, in __init__
    self.inject_adapter(self.model, adapter_name)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/tuners_utils.py", line 244, in inject_adapter
    self._create_and_replace(peft_config, adapter_name, target, target_name, parent, **optional_kwargs)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/model.py", line 181, in _create_and_replace
    new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/model.py", line 283, in _create_new_module
    new_module = QuantLinear(target, adapter_name, **kwargs)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/gptq.py", line 40, in __init__
    self.update_layer(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights, use_rslora)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/layer.py", line 96, in update_layer
    self.loftq_init(adapter_name)
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/layer.py", line 134, in loftq_init
    weight = self.get_base_layer().weight
  File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'QuantLinear' object has no attribute 'weight'

try to run quantize.py but get error CUDA out of memory

When I try to use quantize.py to quantize llama2，I use num_bits=4, and other configurations are default, but I get the error of CUDA out of memory at quantize layer 18, what's wrong with this? My GPU is nvidia-A100 80GB, then I change the num_bits to 2 and 8, it will fast broken at layer 0. Anything wrong about these problems
model.model.layers.18.mlp.up_proj Linear(in_features=4096, out_features=11008, bias=False) ['T_destination', '__annotations__', '__call__', '__class__', '__constants__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_apply', '_backward_hooks', '_buffers', '_call_impl', '_forward_hooks', '_forward_pre_hooks', '_get_backward_hooks', '_get_name', '_is_full_backward_hook', '_is_hf_initialized', '_load_from_state_dict', '_load_state_dict_post_hooks', '_load_state_dict_pre_hooks', '_maybe_warn_non_full_backward_hook', '_modules', '_named_members', '_non_persistent_buffers_set', '_parameters', '_register_load_state_dict_pre_hook', '_register_state_dict_hook', '_replicate_for_data_parallel', '_save_to_state_dict', '_slow_forward', '_state_dict_hooks', '_version', 'add_module', 'apply', 'bfloat16', 'bias', 'buffers', 'children', 'cpu', 'cuda', 'double', 'dump_patches', 'eval', 'extra_repr', 'float', 'forward', 'get_buffer', 'get_extra_state', 'get_parameter', 'get_submodule', 'half', 'in_features', 'ipu', 'load_state_dict', 'modules', 'named_buffers', 'named_children', 'named_modules', 'named_parameters', 'out_features', 'parameters', 'register_backward_hook', 'register_buffer', 'register_forward_hook', 'register_forward_pre_hook', 'register_full_backward_hook', 'register_load_state_dict_post_hook', 'register_module', 'register_parameter', 'requires_grad_', 'reset_parameters', 'set_extra_state', 'share_memory', 'state_dict', 'to', 'to_empty', 'train', 'training', 'type', 'weight', 'xpu', 'zero_grad'] asymmetric NormalFloat asymmetric NormalFloat Traceback (most recent call last): File "quantize.py", line 143, in <module> main(args) File "quantize.py", line 105, in main utils.replace_module( File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 392, in replace_module replace_module(immediate_child_module, File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 392, in replace_module replace_module(immediate_child_module, File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 392, in replace_module replace_module(immediate_child_module, [Previous line repeated 1 more time] File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 367, in replace_module qlinear_lora.initial_backbone(weight) File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 278, in initial_backbone self.qweight, self.absmax, _ = self.quantizer.quantize_block(weight) File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 174, in quantize_block qweight = torch.argmin(abs_diff, dim=-1) # (L, B) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 688.00 MiB (GPU 0; 79.35 GiB total capacity; 77.40 GiB already allocated; 555.12 MiB free; 77.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

SVD Implementation in loftQ Algorithm

Thanks for sharing great work! Learned a lot. I have a question regarding SVD implementation in loftQ init.

LoftQ/utils.py

Lines 25 to 26 in e6bdef4

 # Use SVD to decompose a matrix, default full_matrices is False to save parameters 

 U, S, Vh = torch.linalg.svd(weight, full_matrices=False)

Upon reviewing the code, in the SVD decomposition, I noticed the use of a "reduced SVD" option. (full_matrices=False) I wondering the potential impact of choosing a reduced SVD over a full SVD option in the context of loftQ initialization, especially regarding its effect on the alternating optimization process.

Could you share some insights on whether there are specific advantages or reasons for choosing reduced SVD in this scenario?

Thanks in advance.

Cannot reproduce the result of LoftQ on gsm8k with llama2-7b

Hi,

I try to use this code to test the performance of LoftQ:

python test_gsm8k.py \
    --model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank \
    --batch_size 16

The final ACC is about 0.30. The output information is:

prediction [18.0, 4.0, 68000.0, 540.0, 3.5, 128.0, 260.0, 32.0, 500.0, 412.0, 366.0, 8184.0, 83.0, 8.0, 5.0, 3835.0]
ground truth [18.0, 3.0, 70000.0, 540.0, 20.0, 64.0, 260.0, 160.0, 45.0, 460.0, 366.0, 694.0, 13.0, 18.0, 60.0, 125.0, 230.0, 57500.0, 7.0, 6.0, 15.0, 14.0, 7.0, 8.0, 26.0, 2.0, 243.0, 16.0, 25.0, 104.0, 109.0, 80.0, 35.0, 70.0, 23.0, 9.0, 75.0, 2.0, 10.0, 18.0, 8.0, 200.0, 26.0, 48.0, 20.0, 104.0, 163.0, 800.0, 8.0, 30.0, 294.0, 5.0, 15.0, 40.0, 40.0, 14.0, 3.0, 83.0, 57.0, 187.0, 17.0, 1430.0, 25000.0, 1596.0, 300.0, 36.0, 48.0, 595.0, 36.0, 60.0, 7425.0, 60.0, 221.0, 255.0, 88.0, 60.0, 5.0, 100.0, 6.0, 70.0, 10.0, 17.0, 623.0, 600.0, 15.0, 44.0, 22.0, 9360.0, 8000.0, 24.0, 225.0, 28.0, 4.0, 36.0, 348.0, 40.0, 3.0, 12.0, 5.0, 58.0, 175.0, 6.0, 26.0, 140.0, 500.0, 20.0, 72.0, 3.0, 50.0, 28.0, 45.0, 16.0, 24.0, 25.0, 6.0, 90.0, 42.0, 360.0, 4.0, 95200.0, 240.0, 27.0, 48.0, 50.0, 10.0, 10.0, 82.0, 120.0, 880.0, 10000.0, 30.0, 940.0, 60.0, 13.0, 720.0, 40.0, 6.0, 29.0, 105.0, 70.0, 20.0, 400.0, 140.0, 16.0, 20.0, 4000.0, 2125.0, 75.0, 30.0, 16.0, 4.0, 5.0, 4.0, 48.0, 272.0, 280.0, 1400.0, 80.0, 34.0, 15.0, 16.0, 32.0, 92.0, 50.0, 15.0, 77.0, 5.0, 16.0, 18.0, 120.0, 150.0, 1210.0, 51.0, 18000.0, 95.0, 15.0, 100.0, 350.0, 122.0, 130.0, 20.0, 160.0, 23.0, 2.0, 25.0, 30.0, 5.0, 106.0, 50.0, 34.0, 360.0, 5.0, 91.0, 24.0, 10.0, 12.0, 120.0, 6277.0, 320.0, 7500.0, 55.0, 114200.0, 100.0, 31.0, 98.0, 98.0, 860.0, 2600.0, 76.0, 145.0, 10.0, 4.0, 5.0, 250.0, 8.0, 44.0, 220.0, 15.0, 45.0, 54.0, 70.0,...]
adapter: None | GSM8K test accuracy: 0.30% | full precision: False

May I ask whether there is special setting I need to focus?

Best

Embedding layer

Does LoftQ support when we need to train the embedding layer with QLoRA?

Can we use LoftQ to optimize vision foundation models like OWL-ViT v2 and Grounding Dino?

Hi Team,

I need your help to optimize VFMs like OWL-ViT v2 and Grounding Dino to reduce the memory size and deploy in the edge device.

Your feedback and reference code will be helpful.

with thanks

quantize_save.py script fails saving lora adapter with peft>=0.7.2

Hi, when running quantize_save.py where it attempts to call lora_model.save_pretrained(lora_model_dir), an OSError exception is now being thrown saying that the config.json file for the base model doesn't exist. I believe it should be a simple fix by having the script unwrap and save the base model and tokenizer first, moving the call to lora_model.save_pretrained() to the end of quantize_and_save(). I assume the latest version of peft requires that the LoRA's base model exist on disk so it can look up the configuration.

I'm just not sure if it's okay to save the LoRA after unwrapping the base model as it kind of changes the flow of the script? Thoughts? Thanks.

Package Versions:

peft: 0.7.2.dev0
transformers: 4.36.2

Performance worsens versus QLoRA with TinyLlama

When running with LoftQ, performance worsens with TinyLlama versus QLoRA. The performance gets even worse when I do more iterations for initiating the LoftQ adapters. (my grad norm gets worse the more iterations I do).

Is there any reason why applying loftq wouldn't work with TinyLlama?

When working with Mistral, I found that:

1 iteration of LoftQ is slightly better than QLoRA.
but 3 iterations of LoftQ is worse than QLoRA.

I'm using a rank of 32 and alpha of 32 as well. My base learning rate is 1e-4 .

I am using unsloth:

if use_4bit and config['use_loftq']:
    loftq_config = LoftQConfig(loftq_bits=4, loftq_iter=1)
    init_lora_weights = "loftq"
else:
    loftq_config = None
    init_lora_weights = True

## Apply LoRA (if use_lora is True in the config)
if config.get('use_lora', False):
    model = FastLanguageModel.get_peft_model(
        model,
        r=config['lora_r'],
        lora_alpha=config['lora_alpha'],
        target_modules=config['lora_modules'],
        modules_to_save=config.get('other_trainable', None),
        lora_dropout = 0, # Dropout = 0 is currently optimized
        bias = "none",    # Bias = "none" is currently optimized
        use_gradient_checkpointing = True,
        random_state = 3407,
        use_rslora=True,
        loftq_config=loftq_config,
        init_lora_weights=init_lora_weights,
    )

Error with shape

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.model.embed_tokens.weight: copying a param with shape torch.Size([32001, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
size mismatch for base_model.model.lm_head.weight: copying a param with shape torch.Size([32001, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).

I used the checkpoints from the training config：
python train_gsm8k.py
--model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank
--learning_rate 3e-4
--seed 11
--expt_name gsm8k_llama2_7b_4bit_64rank_loftq
--output_dir exp_results/
--num_train_epochs 6
--per_device_train_batch_size 2
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "epoch"
--weight_decay 0.1
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 10
--do_train
--report_to tensorboard

Is there any way for using LoftQ to GPTQ or AWQ model?

I want to use LoftQ initialization for GPTQ or AWQ baseline model.

Is that possible?

loss tend to be nan or inf

Thank you for this great work!

May I ask whether the reported results are from fake_quantization=True or fake_quantization=False? When I use fake_quantization=False, the training doesn't succeed for the gsm8k task for llama-2-7b, always leading to nan or inf loss.

Failing to converge when using some random seeds

Dear Authors,

Your work is truly exceptional and I am currently attempting to reproduce it. However, I've observed noticeable performance variations when employing different random seeds. For example, during the fine-tuning of Deberta-v3-base on the 'mrpc' task, setting the random seed to '0' results in an evaluation accuracy of 85.05. In contrast, when I choose '71' or '37' as the random seed, the evaluation accuracy significantly drops to 68.38, essentially failing to converge.

Could you possibly offer any guidance regarding this matter? Moreover, I would greatly appreciate it if you could disclose the random seeds you utilized in this work.

Thank you!

loftQ can not use multi gpu to train

When I set:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
will raise error :
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [42,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.

return (element == self).any().item() # type: ignore[union-attr]
RuntimeError: CUDA error: device-side assert triggered

how can I do this?

	# Use SVD to decompose a matrix, default full_matrices is False to save parameters
	U, S, Vh = torch.linalg.svd(weight, full_matrices=False)

yxli2123 / loftq Goto Github PK

loftq's People

Contributors

Stargazers

Watchers

Forkers

loftq's Issues

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

Recommend Projects

Recommend Topics

Recommend Org