aiot-mlsys-lab / svd-llm Goto Github PK

View Code? Open in Web Editor NEW

75.0 6.0 6.0 752 KB

Official Code for "SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression"

Home Page: https://arxiv.org/abs/2403.07378

License: Apache License 2.0

Python 99.58% Shell 0.42%

generative-ai large-language-models efficient-models model-compression

svd-llm's Introduction

SVD-LLM: Singular Value Decomposition for Large Language Model Compression

Introduction

SVD-LLM: Singular Value Decomposition for Large Language Model Compression [arXiv]
Xin Wang¹, Yu Zheng², Zhongwei Wan¹, Mi Zhang¹
¹The Ohio State University, ²Michigan State University

Key Designs

Truncation-Aware Data Whitening: Ensure truncating smaller singular values has lower compression loss.
Layer-Wise Closed-Form Update: Compensate for accuracy degradation under high compression ratio.

Abstract

The advancements in Large Language Models (LLMs) have been hindered by their substantial sizes, which necessitate LLM compression methods for practical deployment. Singular Value Decomposition (SVD) offers a promising solution for LLM compression. However, state-of-the-art SVD-based LLM compression methods have two key limitations: truncating smaller singular values may lead to higher compression loss, and the lack of update on the compressed weight after SVD truncation. In this work, we propose SVD-LLM, a new SVD-based LLM compression method that addresses the limitations of existing methods. SVD-LLM incorporates a truncation-aware data whitening strategy to ensure a direct mapping between singular values and compression loss. Moreover, SVD-LLM adopts a layer-wise closed-form model parameter update strategy to compensate for accuracy degradation under high compression ratios. We evaluate SVD-LLM on a total of 10 datasets and eight models from three different LLM families at four different scales. Our results demonstrate the superiority of SVD-LLM over state-of-the-arts, especially at high model compression ratios.

Quick Start

Installation

Please keep the version of the transformers package exactly equal to 4.35.2 since the svd-compressed version of LLM has a slight change of model structure (in the component/. folder).

pip install -r requirement.txt

Quick Example

bash compress_llama.sh

This script would compress the LLaMA-7B model under 20% compression ratio and automatically run the evaluation code, including both perplexity and efficiency of the compressed model.

Step-by-Step Instructions

We implement SVD-LLM with two different pipelines:

Truncation-Aware Data Whitening + SVD Compression (used under low compression ratio)
Truncation-Aware Data Whitening + SVD Compression + Layer-Wise Closed-Form Update (used under high compression ratio)

1. Truncation-Aware Data Whitening + SVD Compression (Used under low compression ratio)

Under the low compression ratio (recommended ratio <= 0.3), we first run the data whitening of the LLM and saved the weight along with the whitening information.

python SVDLLM.py \
--step 1  \
--ratio COMPRESSION_RATIO \
--model HUGGINGFACE_MODEL_REPO \
--whitening_nsamples WHITENING_SAMPLE_NUMBER \
--dataset WHITENING_DATASET \
--seed SAMPLING_SEED \
--model_seq_len MODEL_SEQ_LEN \
--save_path WHITENING_INFO_SAVING_PATH

2. Truncation-Aware Data Whitening + SVD Compression + Layer-Wise Closed-Form Update (Used under high compression ratio)

Under the high compression ratio (recommended ratio > 0.3), we can further apply layer-wise closed-form update to update the weight matrix after the first pipeline to improve accuracy.

python SVDLLM.py \
--step 2  \
--ratio COMPRESSION_RATIO \
--model HUGGINGFACE_MODEL_REPO \
--whitening_nsamples WHITENING_SAMPLE_NUMBER \
--updating_nsamples UPDATING_SAMPLE_NUMBER \
--dataset WHITENING_DATASET \
--seed SAMPLING_SEED \
--model_seq_len MODEL_SEQ_LEN \
--save_path WHITENING_INFO_SAVING_PATH

3. SVD Compression + Layer-Wise Closed-Form Update (Although not the best but still better than exsiting baselines)

We also provide the implementation to run layer-wise closed-form update only in SVD-LLM. Although this version is not as good as the above two versions of SVD-LLM, it is still better than the existing baselines.

python SVDLLM.py \
--step 3  \
--ratio COMPRESSION_RATIO \
--model HUGGINGFACE_MODEL_REPO \
--updating_nsamples UPDATING_SAMPLE_NUMBER \
--dataset WHITENING_DATASET \
--seed SAMPLING_SEED \
--model_seq_len MODEL_SEQ_LEN \
--save_path WHITENING_INFO_SAVING_PATH

4. LoRA Fine-Tuning

The compressed model from either of the two pipelines above can also be combined with LoRA fine-tuning to get a better accuracy. We borrowed the LoRA fine-tuning code from LLM-Pruner with the same configuration.

python LoRA.py \
--prune_model COMPRESSED_MODEL_PATH \
--data_path yahma/alpaca-cleaned \
--output_dir LORA_OUTPUT_PATH  \
--lora_r 8 \
--num_epochs 2 \
--learning_rate 1e-4 \
--batch_size 64

5. SVD-LLM + GPTQ

SVD-LLM can also be integrated with quantization methods to achieve a better compression. Here is the example of how to integrate SVD-LLM (20% compression ratio) with GPTQ-4bit to compress LLaMA-7B

bash svdllm_gptq.sh

6. Evaluation

Perplexity Evaluation:

python SVDLLM.py \
--step 4 \
--model_path COMPRESSD_MODEL_SAVING_PATH  \

We use the same c4 dataset as in SparseGPT. Since the original dowload link is invalid, please directly download it from this link and add the two json files under the utils/. folder.

Efficiency Evaluation:

python SVDLLM.py \
--step 5 \
--model_path COMPRESSD_MODEL_SAVING_PATH  \

Citation

If you find this work useful, please cite

@article{wang2024svd,
  title={Svd-llm: Truncation-aware singular value decomposition for large language model compression},
  author={Wang, Xin and Zheng, Yu and Wan, Zhongwei and Zhang, Mi},
  journal={arXiv preprint arXiv:2403.07378},
  year={2024}
}

svd-llm's People

Contributors

Stargazers

Watchers

Forkers

pvti valeriy42 tuidan qipengwang s1ghhh lihuang258

svd-llm's Issues

Compressed Model Produces Random and Repetitive Output - Request for Compressed Weights

Description
When running the code, we successfully obtain a compressed model. However, when prompted with an input, the model generates random and repetitive outputs, often repeating the same letters or phrases. This significantly impacts the usability and reliability of the model.

Request
What could be the reason for this issue? Could you please share your compressed weights to help diagnose and resolve the problem?

Thank you!

fail to apply on llama-13b

Hello,I have some trouble to reproduce the results on llama-13b.An error "scaling_matrix_inv = torch.linalg.inv(scaling_diag_matrix) torch._C._LinAlgError: linalg.inv: The diagonal element 6940 is zero, the inversion could not be completed because the input matrix is singular" occurs on line 203, in whitening function.
How can I sovle this problem? Thanks.

recommended hardware resources

What is the minimum hardware resources required to test out this codebase for llama 7B.

Converting Compressed LLaMA2 Model to Hugging Face-Compatible Format

Issue: Converting Compressed LLaMA2 Model to Hugging Face-Compatible Format

Description

We have successfully compressed a LLaMA2 model with 4.4 billion parameters. However, I am encountering issues when trying to convert the compressed model to a Hugging Face-compatible format. Specifically, when I use the model.save_pretrained(output_dir) and tokenizer.save_pretrained(output_dir) methods, the model parameters revert to the original 6.7 billion, and the output becomes worse and incoherent.

Steps to Reproduce

Compress a LLaMA2 model to 4.4 billion parameters.

Use the following code to save the model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def save_compressed_model(model, tokenizer, output_dir):
    # Save the model and tokenizer using Hugging Face's save_pretrained method
    model.save_pretrained(output_dir, safe_serialization=True)
    tokenizer.save_pretrained(output_dir)

# Load your compressed model
model_path = "path_to_your_compressed_model"
tokenizer_path = "path_to_your_tokenizer"
output_dir = "path_to_output_directory"

model = torch.load(model_path)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

# Save the model and tokenizer
save_compressed_model(model, tokenizer, output_dir)

Attempt to use the model from the output directory.

Observed Behavior

The model parameters revert to the original 6.7 billion.
The model output becomes worse and generates random gibberish.

Expected Behavior

The model should retain its compressed state with 4.4 billion parameters.
The model output should remain coherent and consistent with the compressed model's performance.

Additional Context

I have also attempted to convert the model to GGUF format, but encountered similar issues. Any guidance on correctly converting and saving the compressed model for Hugging Face would be greatly appreciated.

Thank you for your assistance!

compresssed checkpoints

is it possible to obtain compressed checkpoints for llama2?

still fail to apply on llama-13b

Hi, thank you for your reply. But I still get the same problem as mentioned before.
Traceback (most recent call last):
File "/home/xxx/SVD-LLM/SVDLLM_new.py", line 193, in whitening
scaling_matrix_inv = torch.linalg.inv(scaling_diag_matrix)
torch._C._LinAlgError: linalg.inv: The diagonal element 6940 is zero, the inversion could not be completed because the input matrix is singular. "
My python environment is built on requirements.txt. And I run the code on 2 3090 GPUs

Question about ASVD and FWSVD

Dear author, I'm not sure if this is a proper request. Do you have ASVD or FWSVD code that has been integrated into this repository? If there is, it will be very convenient for other people to compare different methods. Thanks!

Request for Code Integration of SVD-LLM with GPTQ

Hello,

Firstly, I want to express my gratitude for the fascinating work you've been doing. It's been inspiring.

I've recently come across your paper where you describe the integration of SVD-LLM with GPTQ, and I'm eager to explore the implementation further.
Could you please share the code where you've integrated SVD-LLM with GPTQ as described in the paper?

Your assistance in providing access to this code would be appreciated. Thank you for your time and consideration.

Instead of ratio-compress to specific layer size

Hello!
I was wondering if this code can be manipulated to transform a tensor - say (32128128) to a smaller tensor (86464).
Basically reduce the size of the llm layer by layer.

Incorrect Model Responses after compression

I tried to use the provided scripts to compress LLAMA 2 with 0.2 compression ratio. The model evaluation script shows a perplexity of 7.2 on wikitext, but the model responses are mostly incoherent. I am getting responses like

Instruction: tell me about you==\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ selecting\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

where as original model is giving decent responses.

Is there any modification to be done for the inference script or the tokeniser after model compression? , Is there an inference script within the repository?

Thanks for your help

Question about C4 dataset

Hello, thanks for this great project. I found:
test_data = load_dataset("json", data_files="utils/c4-validation.json")['train']
in the code but there is no "utils/c4-validation.json" file. Where should I find this?