Giter VIP home page Giter VIP logo

svd-llm's Introduction


SVD-LLM: Singular Value Decomposition for Large Language Model Compression

image

Introduction

SVD-LLM: Singular Value Decomposition for Large Language Model Compression [arXiv]
Xin Wang1, Yu Zheng2, Zhongwei Wan1, Mi Zhang1
1The Ohio State University, 2Michigan State University

Key Designs

  • Truncation-Aware Data Whitening: Ensure truncating smaller singular values has lower compression loss.
  • Layer-Wise Closed-Form Update: Compensate for accuracy degradation under high compression ratio.

Abstract

The advancements in Large Language Models (LLMs) have been hindered by their substantial sizes, which necessitate LLM compression methods for practical deployment. Singular Value Decomposition (SVD) offers a promising solution for LLM compression. However, state-of-the-art SVD-based LLM compression methods have two key limitations: truncating smaller singular values may lead to higher compression loss, and the lack of update on the compressed weight after SVD truncation. In this work, we propose SVD-LLM, a new SVD-based LLM compression method that addresses the limitations of existing methods. SVD-LLM incorporates a truncation-aware data whitening strategy to ensure a direct mapping between singular values and compression loss. Moreover, SVD-LLM adopts a layer-wise closed-form model parameter update strategy to compensate for accuracy degradation under high compression ratios. We evaluate SVD-LLM on a total of 10 datasets and eight models from three different LLM families at four different scales. Our results demonstrate the superiority of SVD-LLM over state-of-the-arts, especially at high model compression ratios.

Quick Start

Installation

Please keep the version of the transformers package exactly equal to 4.35.2 since the svd-compressed version of LLM has a slight change of model structure (in the component/. folder).

pip install -r requirement.txt

Quick Example

bash compress_llama.sh

This script would compress the LLaMA-7B model under 20% compression ratio and automatically run the evaluation code, including both perplexity and efficiency of the compressed model.

Step-by-Step Instructions

We implement SVD-LLM with two different pipelines:

  • Truncation-Aware Data Whitening + SVD Compression (used under low compression ratio)
  • Truncation-Aware Data Whitening + SVD Compression + Layer-Wise Closed-Form Update (used under high compression ratio)

1. Truncation-Aware Data Whitening + SVD Compression (Used under low compression ratio)

Under the low compression ratio (recommended ratio <= 0.3), we first run the data whitening of the LLM and saved the weight along with the whitening information.

python SVDLLM.py \
--step 1  \
--ratio COMPRESSION_RATIO \
--model HUGGINGFACE_MODEL_REPO \
--whitening_nsamples WHITENING_SAMPLE_NUMBER \
--dataset WHITENING_DATASET \
--seed SAMPLING_SEED \
--model_seq_len MODEL_SEQ_LEN \
--save_path WHITENING_INFO_SAVING_PATH

2. Truncation-Aware Data Whitening + SVD Compression + Layer-Wise Closed-Form Update (Used under high compression ratio)

Under the high compression ratio (recommended ratio > 0.3), we can further apply layer-wise closed-form update to update the weight matrix after the first pipeline to improve accuracy.

python SVDLLM.py \
--step 2  \
--ratio COMPRESSION_RATIO \
--model HUGGINGFACE_MODEL_REPO \
--whitening_nsamples WHITENING_SAMPLE_NUMBER \
--updating_nsamples UPDATING_SAMPLE_NUMBER \
--dataset WHITENING_DATASET \
--seed SAMPLING_SEED \
--model_seq_len MODEL_SEQ_LEN \
--save_path WHITENING_INFO_SAVING_PATH

3. SVD Compression + Layer-Wise Closed-Form Update (Although not the best but still better than exsiting baselines)

We also provide the implementation to run layer-wise closed-form update only in SVD-LLM. Although this version is not as good as the above two versions of SVD-LLM, it is still better than the existing baselines.

python SVDLLM.py \
--step 3  \
--ratio COMPRESSION_RATIO \
--model HUGGINGFACE_MODEL_REPO \
--updating_nsamples UPDATING_SAMPLE_NUMBER \
--dataset WHITENING_DATASET \
--seed SAMPLING_SEED \
--model_seq_len MODEL_SEQ_LEN \
--save_path WHITENING_INFO_SAVING_PATH

4. LoRA Fine-Tuning

The compressed model from either of the two pipelines above can also be combined with LoRA fine-tuning to get a better accuracy. We borrowed the LoRA fine-tuning code from LLM-Pruner with the same configuration.

python LoRA.py \
--prune_model COMPRESSED_MODEL_PATH \
--data_path yahma/alpaca-cleaned \
--output_dir LORA_OUTPUT_PATH  \
--lora_r 8 \
--num_epochs 2 \
--learning_rate 1e-4 \
--batch_size 64

5. SVD-LLM + GPTQ

SVD-LLM can also be integrated with quantization methods to achieve a better compression. Here is the example of how to integrate SVD-LLM (20% compression ratio) with GPTQ-4bit to compress LLaMA-7B

bash svdllm_gptq.sh

6. Evaluation

  • Perplexity Evaluation:
python SVDLLM.py \
--step 4 \
--model_path COMPRESSD_MODEL_SAVING_PATH  \

We use the same c4 dataset as in SparseGPT. Since the original dowload link is invalid, please directly download it from this link and add the two json files under the utils/. folder.

  • Efficiency Evaluation:
python SVDLLM.py \
--step 5 \
--model_path COMPRESSD_MODEL_SAVING_PATH  \

Citation

If you find this work useful, please cite

@article{wang2024svd,
  title={Svd-llm: Truncation-aware singular value decomposition for large language model compression},
  author={Wang, Xin and Zheng, Yu and Wan, Zhongwei and Zhang, Mi},
  journal={arXiv preprint arXiv:2403.07378},
  year={2024}
}

svd-llm's People

Contributors

mi-zhang avatar tuidan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

svd-llm's Issues

Compressed Model Produces Random and Repetitive Output - Request for Compressed Weights

Description
When running the code, we successfully obtain a compressed model. However, when prompted with an input, the model generates random and repetitive outputs, often repeating the same letters or phrases. This significantly impacts the usability and reliability of the model.

Request
What could be the reason for this issue? Could you please share your compressed weights to help diagnose and resolve the problem?

Thank you!

fail to apply on llama-13b

Hello,I have some trouble to reproduce the results on llama-13b.An error "scaling_matrix_inv = torch.linalg.inv(scaling_diag_matrix) torch._C._LinAlgError: linalg.inv: The diagonal element 6940 is zero, the inversion could not be completed because the input matrix is singular" occurs on line 203, in whitening function.
How can I sovle this problem? Thanks.

Converting Compressed LLaMA2 Model to Hugging Face-Compatible Format

Issue: Converting Compressed LLaMA2 Model to Hugging Face-Compatible Format

Description

We have successfully compressed a LLaMA2 model with 4.4 billion parameters. However, I am encountering issues when trying to convert the compressed model to a Hugging Face-compatible format. Specifically, when I use the model.save_pretrained(output_dir) and tokenizer.save_pretrained(output_dir) methods, the model parameters revert to the original 6.7 billion, and the output becomes worse and incoherent.

Steps to Reproduce

  1. Compress a LLaMA2 model to 4.4 billion parameters.

  2. Use the following code to save the model:

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    def save_compressed_model(model, tokenizer, output_dir):
        # Save the model and tokenizer using Hugging Face's save_pretrained method
        model.save_pretrained(output_dir, safe_serialization=True)
        tokenizer.save_pretrained(output_dir)
    
    # Load your compressed model
    model_path = "path_to_your_compressed_model"
    tokenizer_path = "path_to_your_tokenizer"
    output_dir = "path_to_output_directory"
    
    model = torch.load(model_path)
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
    
    # Save the model and tokenizer
    save_compressed_model(model, tokenizer, output_dir)
  3. Attempt to use the model from the output directory.

Observed Behavior

  • The model parameters revert to the original 6.7 billion.
  • The model output becomes worse and generates random gibberish.

Expected Behavior

  • The model should retain its compressed state with 4.4 billion parameters.
  • The model output should remain coherent and consistent with the compressed model's performance.

Additional Context

I have also attempted to convert the model to GGUF format, but encountered similar issues. Any guidance on correctly converting and saving the compressed model for Hugging Face would be greatly appreciated.

Thank you for your assistance!

still fail to apply on llama-13b

Hi, thank you for your reply. But I still get the same problem as mentioned before.
Traceback (most recent call last):
File "/home/xxx/SVD-LLM/SVDLLM_new.py", line 193, in whitening
scaling_matrix_inv = torch.linalg.inv(scaling_diag_matrix)
torch._C._LinAlgError: linalg.inv: The diagonal element 6940 is zero, the inversion could not be completed because the input matrix is singular. "
My python environment is built on requirements.txt. And I run the code on 2 3090 GPUs

Question about ASVD and FWSVD

Dear author, I'm not sure if this is a proper request. Do you have ASVD or FWSVD code that has been integrated into this repository? If there is, it will be very convenient for other people to compare different methods. Thanks!

Request for Code Integration of SVD-LLM with GPTQ

Hello,

Firstly, I want to express my gratitude for the fascinating work you've been doing. It's been inspiring.

I've recently come across your paper where you describe the integration of SVD-LLM with GPTQ, and I'm eager to explore the implementation further.
Could you please share the code where you've integrated SVD-LLM with GPTQ as described in the paper?

Your assistance in providing access to this code would be appreciated. Thank you for your time and consideration.

Incorrect Model Responses after compression

I tried to use the provided scripts to compress LLAMA 2 with 0.2 compression ratio. The model evaluation script shows a perplexity of 7.2 on wikitext, but the model responses are mostly incoherent. I am getting responses like

Instruction: tell me about you==\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ selecting\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

where as original model is giving decent responses.

Is there any modification to be done for the inference script or the tokeniser after model compression? , Is there an inference script within the repository?

Thanks for your help

Question about C4 dataset

Hello, thanks for this great project. I found:
test_data = load_dataset("json", data_files="utils/c4-validation.json")['train']
in the code but there is no "utils/c4-validation.json" file. Where should I find this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.