yxli2123 / loftq Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Hello, if I wish to validate your baseline, what should I do? I am a beginner, and I'm still not quite clear after reading the Quick Start guide.
python test_gsm8k.py --model_name_or_path /rhome/yangyj/pre-train/models--LoftQ--Llama-2-7b-hf-4bit-64rank/snapshots/1bb66ebf4f9050bc619f416a4f3327a21426fc6f --batch_size 16
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
bin /rhome/yangyj/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /rhome/yangyj/anaconda3/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /rhome/yangyj/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
WARNING:root:Use the checkpoint in HF hub, stored in the subfolder='gsm8k'
in target model.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:40<00:00, 13.66s/it]
Traceback (most recent call last):
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/utils/config.py", line 177, in _get_peft_type
config_file = hf_hub_download(
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
validate_repo_id(arg_value)
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/rhome/yangyj/pre-train/models--LoftQ--Llama-2-7b-hf-4bit-64rank/snapshots/1bb66ebf4f9050bc619f416a4f3327a21426fc6f'. Use repo_type
argument if needed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/rhome/yangyj/LoftQ-main/test_gsm8k.py", line 281, in
evaluation(model_args, data_args)
File "/rhome/yangyj/LoftQ-main/test_gsm8k.py", line 128, in evaluation
model = PeftModel.from_pretrained(model,
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/peft_model.py", line 244, in from_pretrained
PeftConfig._get_peft_type(
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/utils/config.py", line 183, in _get_peft_type
raise ValueError(f"Can't find '{CONFIG_NAME}' at '{model_id}'")
ValueError: Can't find 'adapter_config.json' at '/rhome/yangyj/pre-train/models--LoftQ--Llama-2-7b-hf-4bit-64rank/snapshots/1bb66ebf4f9050bc619f416a4f3327a21426fc6f'
After I modified the code, there was a problem with the gate size of lora weight. After loading, I found that lora_a was the same as base_layer, and a size_mismatch problem occurred. Thanks!
Hello, I have tried your method on gemma-7b model. I found that this method is work on gsm-8k dataset, but this fails on wikitext-2 dataset. This is my training log:
[WARNING|logging.py:329] 2024-05-15 10:23:39,953 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/home/yujin-wa20/miniconda3/envs/gact/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=F
alse explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_r
eentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
{'loss': 51.6576, 'grad_norm': 470.694580078125, 'learning_rate': 0.0003, 'epoch': 0.11}
{'loss': 47.9403, 'grad_norm': 437.8383483886719, 'learning_rate': 0.00029890633111470807, 'epoch': 0.21}
{'loss': 23.9947, 'grad_norm': 42.98173904418945, 'learning_rate': 0.00029564127261390776, 'epoch': 0.32}
{'loss': 23.057, 'grad_norm': 132.80783081054688, 'learning_rate': 0.00029025243640281223, 'epoch': 0.43}
{'loss': 21.0726, 'grad_norm': 24.4749755859375, 'learning_rate': 0.0002828184038479814, 'epoch': 0.53}
19%|███████████████████████████▏ | 5/27 [39:06<2:35:50, 425.04s/it]
I didn't change the original code. Do you know why?
Hi,
I try to use this code to evaluate the performance of LoftQ:
python test_gsm8k.py \
--model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank \
--batch_size 16
And I find that the final ACC is about 40.11%. But the reported result in the paper is 35.0%. So I would like to ask whether my test result is right.
Thanks
It seems that you use NF fake quantization, so I guess you can't save the GPU memory like QLoRA. Am I right?
I'm wondering why not just push the adapter model alone? That would seem sufficient?
I found the implementation of 4-bit quantified , but I couldn't find a 2-bit one. Can you tell me how to implement a finereturn for a 2-bit quantization model
Thx for your great job!
After reading your paper and code, I have a question: How do you merge LoRA weights to quantized LLM for inference?
Looking forward to your reply!
Regards!
Thank you for this great work.
I ran your training script train_gsm8k.sh with only one modification, changing --per_device_train_batch_size 2
and --gradient_accumulation_steps 8
to 1 and 16, since my A100 only has 40GB memory. I also use the --fake_quantization
since it's more stable for optimization.
However, my results are:
epoch 0: accuracy: 0.244882486732373
epoch 1: accuracy: 0.2880970432145565
epoch 2: accuracy: 0.3055344958301744
epoch 3: accuracy: 0.2979529946929492
epoch 4: accuracy: 0.29492039423805916
There is a huge gap between my best result and your reported result 35.0. May I ask what might be the cause?
Hello and thank you for the work.
I have been attempting to quantize the Mistral, T5, and Falcon models, I can finish the process and save them, but they do not seem to perform inference properly with the safetensors saved model after it is loaded (either loading in 4bit or 16bit). I suspect that I might have made some errors.
The sequence I code:
1- Load the pretrained model and tokenizer
2- Configure LoraConfig
3- Run utils.replace_module() [ screnshot attached]
4 - Save pretrained model
Is that right?
Besides, I didn't fine-tunned with PEFT yet, after quantization.
While I am familiar with performing PEFT/LoRA and believe I have correctly identified the target modules (referred to as "allow_name" in your documentation), I am struggling to find out the correct "block_name" for each model (I'm aware of the default allow name and block name lists and tried to investigate using the snippet:
for name, param in pretrained_model.named_parameters():
print(name, param.shape, param.max(), param.mean(), param.requires_grad)
I suspect some mistake on block names might be the root of the issue, but you can tell me best.
I would greatly appreciate your help in resolving this issue.
In addition, I have implemented the code in a notebook, following the quantize.py example. I am executing the Lora_Config before calling utils.replace_module, in line with the sequence in your file. Please let me know if there is any aspect of this process that I might be misunderstanding.
Thanks a lot once again.
The script quantize_save_load.py generates a quantized model with LoRA adapters.
The base model is then saved and uploaded to LoftQ repos such as this one.
I'm puzzled why the base model weights are 16-bits there because that implies that the base model is somehow upcasted (dequantized) in the quantize_save_load.py script, but I don't see that anywhere.
My baseline expectation is that either:
a) The backbone would be stored in nf4, and then loaded with the 16 bit adapters on top, or
b) The backbone would be upcasted to 16-bit, and then quantized in nf4 upon loading with the 16-bit adapters on top. [But then there should be some upcasting code in quantize_save_load.py].
Could someone clarify? Thanks.
hey! thanks for the repo and the recent updates on Llama-3 results. i am wondering are you finetuning the Base LM or the instruct-tuned one?
Hi,
As a debugging way, I want to check whether the fake and true quantized model's weights have the same value. Here is how I implement it:
config = AutoConfig.from_pretrained("LoftQ/Llama-2-7b-hf-bit4-rank64", trust_remote_code=False)
loftq_fp16 = AutoModelForCausalLM.from_pretrained(
"LoftQ/Llama-2-7b-hf-bit4-rank64",
trust_remote_code=False,
config=config,
token="xxx"
)
loftq_fp4 = AutoModelForCausalLM.from_pretrained(
"LoftQ/Llama-2-7b-hf-bit4-rank64",
config=config,
low_cpu_mem_usage=True,
load_in_4bit=True,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type='nf4',
),
token="xxx"
)
Then I print out some weight values as:
print(loftq_fp16.state_dict()['model.layers.0.self_attn.q_proj.weight'])
The output is:
tensor([[-0.0062, -0.0148, -0.0022, ..., 0.0045, 0.0017, -0.0036],
[ 0.0142, -0.0043, 0.0028, ..., -0.0093, -0.0114, 0.0076],
[-0.0146, 0.0126, 0.0005, ..., 0.0063, 0.0188, -0.0031],
...,
[ 0.0013, 0.0109, -0.0003, ..., 0.0098, -0.0298, 0.0097],
[ 0.0256, 0.0102, 0.0032, ..., -0.0334, -0.0156, -0.0123],
[-0.0134, -0.0066, 0.0018, ..., 0.0181, 0.0166, -0.0082]])
For loftq_fp4, I do it in this way:
import copy
from bitsandbytes.functional import dequantize_4bit
with torch.no_grad():
for name, module in loftq_fp4.named_modules():
if name == "model.layers.0.self_attn.q_proj.base_layer":
quant_state = copy.deepcopy(module.weight.quant_state)
dtype = torch.float16
weights = dequantize_4bit(module.weight.data, quant_state=quant_state, quant_type="nf4").to(dtype)
print(weights)
The output is:
tensor([[-0.0072, -0.0153, -0.0035, ..., 0.0047, 0.0000, -0.0054],
[ 0.0116, 0.0000, 0.0000, ..., -0.0108, -0.0108, 0.0061],
[-0.0228, 0.0199, 0.0000, ..., 0.0096, 0.0195, 0.0000],
...,
[ 0.0000, 0.0141, 0.0000, ..., 0.0124, -0.0305, 0.0124],
[ 0.0251, 0.0092, 0.0045, ..., -0.0317, -0.0172, -0.0111],
[-0.0153, -0.0072, 0.0031, ..., 0.0188, 0.0144, -0.0079]],
device='cuda:0', dtype=torch.float16)
We can see they are quite different, which means the fake quantization doesn't truly reflect the true quantization performance.
~/LoftQ-main$ python test_gsm8k.py
--model_name_or_path /rhome/yangyj/LoftQ
--batch_size 16 \
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
bin /rhome/yangyj/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /rhome/yangyj/anaconda3/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /rhome/yangyj/anaconda3/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
WARNING:root:Use the checkpoint in HF hub, stored in the subfolder='gsm8k'
in target model.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:08<00:00, 2.80s/it]
Traceback (most recent call last):
File "/rhome/yangyj/LoftQ-main/test_gsm8k.py", line 281, in
evaluation(model_args, data_args)
File "/rhome/yangyj/LoftQ-main/test_gsm8k.py", line 128, in evaluation
model = PeftModel.from_pretrained(model,
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/peft_model.py", line 278, in from_pretrained
config = PEFT_TYPE_TO_CONFIG_MAPPING[
File "/rhome/yangyj/anaconda3/lib/python3.10/site-packages/peft/config.py", line 134, in from_pretrained
config = config_cls(**kwargs)
TypeError: LoraConfig.init() got an unexpected keyword argument 'loftq_config'
--batch_size: command not found
C:\Users\41936>git clone https://huggingface.co/LoftQ/Llama-2-7b-hf-4bit-64rank
Cloning into 'Llama-2-7b-hf-4bit-64rank'...
fatal: unable to access 'https://huggingface.co/LoftQ/Llama-2-7b-hf-4bit-64rank/': Failed to connect to huggingface.co port 443 after 21077 ms: Couldn't connect to server
as the title
Hello,
Great work! I want to reproduce the reported LoRA(16bit) result(36.9) on GSM8K dataset in your paper.
Could you provide the correct script or more detailed hyper-parameters? Thx a lot!
Hi, thanks for your amzaing job. I found the code using NF4 quantization by default, but don't add any support to switch UQ. If I have a model quantized by GPTQ, how to use LoftQ on it?
I have tried a GPTQ-quantized model using PEFT, but it raised a exception as followed:
Traceback (most recent call last):
File "quantize_save.py", line 221, in <module>
base_dir, lora_dir = quantize_and_save()
File "quantize_save.py", line 191, in quantize_and_save
lora_model = get_peft_model(model, lora_config)
File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/mapping.py", line 133, in get_peft_model
return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/peft_model.py", line 1043, in __init__
super().__init__(model, peft_config, adapter_name)
File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/peft_model.py", line 125, in __init__
self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/model.py", line 111, in __init__
super().__init__(model, config, adapter_name)
File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/tuners_utils.py", line 87, in __init__
self.inject_adapter(self.model, adapter_name)
File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/tuners_utils.py", line 244, in inject_adapter
self._create_and_replace(peft_config, adapter_name, target, target_name, parent, **optional_kwargs)
File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/model.py", line 181, in _create_and_replace
new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/model.py", line 283, in _create_new_module
new_module = QuantLinear(target, adapter_name, **kwargs)
File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/gptq.py", line 40, in __init__
self.update_layer(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights, use_rslora)
File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/layer.py", line 96, in update_layer
self.loftq_init(adapter_name)
File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/peft/tuners/lora/layer.py", line 134, in loftq_init
weight = self.get_base_layer().weight
File "/root/miniconda3/envs/chatglm3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'QuantLinear' object has no attribute 'weight'
When I try to use quantize.py to quantize llama2,I use num_bits=4, and other configurations are default, but I get the error of CUDA out of memory at quantize layer 18, what's wrong with this? My GPU is nvidia-A100 80GB, then I change the num_bits to 2 and 8, it will fast broken at layer 0. Anything wrong about these problems
model.model.layers.18.mlp.up_proj Linear(in_features=4096, out_features=11008, bias=False) ['T_destination', '__annotations__', '__call__', '__class__', '__constants__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_apply', '_backward_hooks', '_buffers', '_call_impl', '_forward_hooks', '_forward_pre_hooks', '_get_backward_hooks', '_get_name', '_is_full_backward_hook', '_is_hf_initialized', '_load_from_state_dict', '_load_state_dict_post_hooks', '_load_state_dict_pre_hooks', '_maybe_warn_non_full_backward_hook', '_modules', '_named_members', '_non_persistent_buffers_set', '_parameters', '_register_load_state_dict_pre_hook', '_register_state_dict_hook', '_replicate_for_data_parallel', '_save_to_state_dict', '_slow_forward', '_state_dict_hooks', '_version', 'add_module', 'apply', 'bfloat16', 'bias', 'buffers', 'children', 'cpu', 'cuda', 'double', 'dump_patches', 'eval', 'extra_repr', 'float', 'forward', 'get_buffer', 'get_extra_state', 'get_parameter', 'get_submodule', 'half', 'in_features', 'ipu', 'load_state_dict', 'modules', 'named_buffers', 'named_children', 'named_modules', 'named_parameters', 'out_features', 'parameters', 'register_backward_hook', 'register_buffer', 'register_forward_hook', 'register_forward_pre_hook', 'register_full_backward_hook', 'register_load_state_dict_post_hook', 'register_module', 'register_parameter', 'requires_grad_', 'reset_parameters', 'set_extra_state', 'share_memory', 'state_dict', 'to', 'to_empty', 'train', 'training', 'type', 'weight', 'xpu', 'zero_grad'] asymmetric NormalFloat asymmetric NormalFloat Traceback (most recent call last): File "quantize.py", line 143, in <module> main(args) File "quantize.py", line 105, in main utils.replace_module( File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 392, in replace_module replace_module(immediate_child_module, File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 392, in replace_module replace_module(immediate_child_module, File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 392, in replace_module replace_module(immediate_child_module, [Previous line repeated 1 more time] File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 367, in replace_module qlinear_lora.initial_backbone(weight) File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 278, in initial_backbone self.qweight, self.absmax, _ = self.quantizer.quantize_block(weight) File "/data/juicefs_sharing_data/11164126/py-proj/loftQ/LoftQ/utils.py", line 174, in quantize_block qweight = torch.argmin(abs_diff, dim=-1) # (L, B) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 688.00 MiB (GPU 0; 79.35 GiB total capacity; 77.40 GiB already allocated; 555.12 MiB free; 77.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Thanks for sharing great work! Learned a lot. I have a question regarding SVD implementation in loftQ init.
Lines 25 to 26 in e6bdef4
Upon reviewing the code, in the SVD decomposition, I noticed the use of a "reduced SVD" option. (full_matrices=False) I wondering the potential impact of choosing a reduced SVD over a full SVD option in the context of loftQ initialization, especially regarding its effect on the alternating optimization process.
Could you share some insights on whether there are specific advantages or reasons for choosing reduced SVD in this scenario?
Thanks in advance.
Hi,
I try to use this code to test the performance of LoftQ:
python test_gsm8k.py \
--model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank \
--batch_size 16
The final ACC is about 0.30. The output information is:
prediction [18.0, 4.0, 68000.0, 540.0, 3.5, 128.0, 260.0, 32.0, 500.0, 412.0, 366.0, 8184.0, 83.0, 8.0, 5.0, 3835.0]
ground truth [18.0, 3.0, 70000.0, 540.0, 20.0, 64.0, 260.0, 160.0, 45.0, 460.0, 366.0, 694.0, 13.0, 18.0, 60.0, 125.0, 230.0, 57500.0, 7.0, 6.0, 15.0, 14.0, 7.0, 8.0, 26.0, 2.0, 243.0, 16.0, 25.0, 104.0, 109.0, 80.0, 35.0, 70.0, 23.0, 9.0, 75.0, 2.0, 10.0, 18.0, 8.0, 200.0, 26.0, 48.0, 20.0, 104.0, 163.0, 800.0, 8.0, 30.0, 294.0, 5.0, 15.0, 40.0, 40.0, 14.0, 3.0, 83.0, 57.0, 187.0, 17.0, 1430.0, 25000.0, 1596.0, 300.0, 36.0, 48.0, 595.0, 36.0, 60.0, 7425.0, 60.0, 221.0, 255.0, 88.0, 60.0, 5.0, 100.0, 6.0, 70.0, 10.0, 17.0, 623.0, 600.0, 15.0, 44.0, 22.0, 9360.0, 8000.0, 24.0, 225.0, 28.0, 4.0, 36.0, 348.0, 40.0, 3.0, 12.0, 5.0, 58.0, 175.0, 6.0, 26.0, 140.0, 500.0, 20.0, 72.0, 3.0, 50.0, 28.0, 45.0, 16.0, 24.0, 25.0, 6.0, 90.0, 42.0, 360.0, 4.0, 95200.0, 240.0, 27.0, 48.0, 50.0, 10.0, 10.0, 82.0, 120.0, 880.0, 10000.0, 30.0, 940.0, 60.0, 13.0, 720.0, 40.0, 6.0, 29.0, 105.0, 70.0, 20.0, 400.0, 140.0, 16.0, 20.0, 4000.0, 2125.0, 75.0, 30.0, 16.0, 4.0, 5.0, 4.0, 48.0, 272.0, 280.0, 1400.0, 80.0, 34.0, 15.0, 16.0, 32.0, 92.0, 50.0, 15.0, 77.0, 5.0, 16.0, 18.0, 120.0, 150.0, 1210.0, 51.0, 18000.0, 95.0, 15.0, 100.0, 350.0, 122.0, 130.0, 20.0, 160.0, 23.0, 2.0, 25.0, 30.0, 5.0, 106.0, 50.0, 34.0, 360.0, 5.0, 91.0, 24.0, 10.0, 12.0, 120.0, 6277.0, 320.0, 7500.0, 55.0, 114200.0, 100.0, 31.0, 98.0, 98.0, 860.0, 2600.0, 76.0, 145.0, 10.0, 4.0, 5.0, 250.0, 8.0, 44.0, 220.0, 15.0, 45.0, 54.0, 70.0,...]
adapter: None | GSM8K test accuracy: 0.30% | full precision: False
May I ask whether there is special setting I need to focus?
Best
Does LoftQ support when we need to train the embedding layer with QLoRA?
Hi Team,
I need your help to optimize VFMs like OWL-ViT v2 and Grounding Dino to reduce the memory size and deploy in the edge device.
Your feedback and reference code will be helpful.
with thanks
Hi, when running quantize_save.py where it attempts to call lora_model.save_pretrained(lora_model_dir)
, an OSError exception is now being thrown saying that the config.json file for the base model doesn't exist. I believe it should be a simple fix by having the script unwrap and save the base model and tokenizer first, moving the call to lora_model.save_pretrained()
to the end of quantize_and_save()
. I assume the latest version of peft requires that the LoRA's base model exist on disk so it can look up the configuration.
I'm just not sure if it's okay to save the LoRA after unwrapping the base model as it kind of changes the flow of the script? Thoughts? Thanks.
Package Versions:
When running with LoftQ, performance worsens with TinyLlama versus QLoRA. The performance gets even worse when I do more iterations for initiating the LoftQ adapters. (my grad norm gets worse the more iterations I do).
Is there any reason why applying loftq wouldn't work with TinyLlama?
When working with Mistral, I found that:
I'm using a rank of 32 and alpha of 32 as well. My base learning rate is 1e-4 .
I am using unsloth:
if use_4bit and config['use_loftq']:
loftq_config = LoftQConfig(loftq_bits=4, loftq_iter=1)
init_lora_weights = "loftq"
else:
loftq_config = None
init_lora_weights = True
## Apply LoRA (if use_lora is True in the config)
if config.get('use_lora', False):
model = FastLanguageModel.get_peft_model(
model,
r=config['lora_r'],
lora_alpha=config['lora_alpha'],
target_modules=config['lora_modules'],
modules_to_save=config.get('other_trainable', None),
lora_dropout = 0, # Dropout = 0 is currently optimized
bias = "none", # Bias = "none" is currently optimized
use_gradient_checkpointing = True,
random_state = 3407,
use_rslora=True,
loftq_config=loftq_config,
init_lora_weights=init_lora_weights,
)
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.model.embed_tokens.weight: copying a param with shape torch.Size([32001, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
size mismatch for base_model.model.lm_head.weight: copying a param with shape torch.Size([32001, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
I used the checkpoints from the training config:
python train_gsm8k.py
--model_name_or_path LoftQ/Llama-2-7b-hf-4bit-64rank
--learning_rate 3e-4
--seed 11
--expt_name gsm8k_llama2_7b_4bit_64rank_loftq
--output_dir exp_results/
--num_train_epochs 6
--per_device_train_batch_size 2
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "epoch"
--weight_decay 0.1
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 10
--do_train
--report_to tensorboard
I want to use LoftQ initialization for GPTQ or AWQ baseline model.
Is that possible?
Thank you for this great work!
May I ask whether the reported results are from fake_quantization=True or fake_quantization=False? When I use fake_quantization=False, the training doesn't succeed for the gsm8k task for llama-2-7b, always leading to nan or inf loss.
Dear Authors,
Your work is truly exceptional and I am currently attempting to reproduce it. However, I've observed noticeable performance variations when employing different random seeds. For example, during the fine-tuning of Deberta-v3-base on the 'mrpc' task, setting the random seed to '0' results in an evaluation accuracy of 85.05. In contrast, when I choose '71' or '37' as the random seed, the evaluation accuracy significantly drops to 68.38, essentially failing to converge.
Could you possibly offer any guidance regarding this matter? Moreover, I would greatly appreciate it if you could disclose the random seeds you utilized in this work.
Thank you!
When I set:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
will raise error :
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [42,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize
failed.
return (element == self).any().item() # type: ignore[union-attr]
RuntimeError: CUDA error: device-side assert triggered
how can I do this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.