Giter VIP home page Giter VIP logo

compressed-tensors's Introduction

compressed-tensors

The compressed-tensors library extends the safetensors format, providing a versatile and efficient way to store and manage compressed tensor data. This library supports various quantization and sparsity schemes, making it a unified format for handling different model optimizations like GPTQ, AWQ, SmoothQuant, INT8, FP8, SparseGPT, and more.

Why compressed-tensors?

As model compression becomes increasingly important for efficient deployment of LLMs, the landscape of quantization and compression techniques has become increasingly fragmented. Each method often comes with its own storage format and loading procedures, making it challenging to work with multiple techniques or switch between them. compressed-tensors addresses this by providing a single, extensible format that can represent a wide variety of compression schemes.

  • Unified Checkpoint Format: Supports various compression schemes in a single, consistent format.
  • Wide Compatibility: Works with popular quantization methods like GPTQ, SmoothQuant, and FP8. See llm-compressor
  • Flexible Quantization Support:
    • Weight-only quantization (e.g., W4A16, W8A16, WnA16)
    • Activation quantization (e.g., W8A8)
    • KV cache quantization
    • Non-uniform schemes (different layers can be quantized in different ways!)
  • Sparsity Support: Handles both unstructured and semi-structured (e.g., 2:4) sparsity patterns.
  • Open-Source Integration: Designed to work seamlessly with Hugging Face models and PyTorch.

This allows developers and researchers to easily experiment with composing different quantization methods, simplify model deployment pipelines, and reduce the overhead of supporting multiple compression formats in inference engines.

Installation

From PyPI

Stable release:

pip install compressed-tensors

Nightly release:

pip install compressed-tensors-nightly

From Source

git clone https://github.com/neuralmagic/compressed-tensors
cd compressed-tensors
pip install -e .

Getting started

Saving/Loading Compressed Tensors (Bitmask Compression)

The function save_compressed uses the compression_format argument to apply compression to tensors. The function load_compressed reverses the process: converts the compressed weights on disk to decompressed weights in device memory.

from compressed_tensors import save_compressed, load_compressed, BitmaskConfig
from torch import Tensor
from typing import Dict

# the example BitmaskConfig method efficiently compresses 
# tensors with large number of zero entries 
compression_config = BitmaskConfig()

tensors: Dict[str, Tensor] = {"tensor_1": Tensor(
    [[0.0, 0.0, 0.0], 
     [1.0, 1.0, 1.0]]
)}
# compress tensors using BitmaskConfig compression format (save them efficiently on disk)
save_compressed(tensors, "model.safetensors", compression_format=compression_config.format)

# decompress tensors (load_compressed returns a generator for memory efficiency)
decompressed_tensors = {}
for tensor_name, tensor in load_compressed("model.safetensors", compression_config = compression_config):
    decompressed_tensors[tensor_name] = tensor

Saving/Loading Compressed Models (Bitmask Compression)

We can apply bitmask compression to a whole model. For more detailed example see example directory.

from compressed_tensors import save_compressed_model, load_compressed, BitmaskConfig
from transformers import AutoModelForCausalLM

model_name = "neuralmagic/llama2.c-stories110M-pruned50"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")

original_state_dict = model.state_dict()

compression_config = BitmaskConfig()

# save compressed model weights
save_compressed_model(model, "compressed_model.safetensors", compression_format=compression_config.format)

# load compressed model weights (`dict` turns generator into a dictionary)
state_dict = dict(load_compressed("compressed_model.safetensors", compression_config))

For more in-depth tutorial on bitmask compression, refer to the notebook.

Saving a Compressed Model with PTQ

We can use compressed-tensors to run basic post training quantization (PTQ) and save the quantized model compressed on disk

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda:0", torch_dtype="auto")

config = QuantizationConfig.parse_file("./examples/bit_packing/int4_config.json")
config.quantization_status = QuantizationStatus.CALIBRATION
apply_quantization_config(model, config)

dataset = load_dataset("ptb_text_only")["train"]
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding=False, truncation=True, max_length=1024)

tokenized_dataset = dataset.map(tokenize_function, batched=True)
data_loader = DataLoader(tokenized_dataset, batch_size=1, collate_fn=DefaultDataCollator())

with torch.no_grad():
    for idx, sample in tqdm(enumerate(data_loader), desc="Running calibration"):
        sample = {key: value.to(device) for key,value in sample.items()}
        _ = model(**sample)

        if idx >= 512:
            break

model.apply(freeze_module_quantization)
model.apply(compress_quantized_weights)

output_dir = "./ex_llama1.1b_w4a16_packed_quantize"
compressor = ModelCompressor(quantization_config=config)
compressed_state_dict = compressor.compress(model)
model.save_pretrained(output_dir, state_dict=compressed_state_dict)

For more in-depth tutorial on quantization compression, refer to the notebook.

compressed-tensors's People

Contributors

satrat avatar dbogunowicz avatar bfineran avatar horheynm avatar rahul-tuli avatar dsikka avatar mgoin avatar markurtz avatar eldarkurtic avatar abhinavnmagic avatar dhuangnm avatar

Stargazers

Jiayi Yao avatar Vinh Tran avatar Yi Liu avatar Bilel Omrani avatar Rob Greenberg avatar C H avatar Allen Guo avatar 爱可可-爱生活 avatar aniket mishrikotkar avatar Mark Saroufim avatar Uranus avatar  avatar Chenghao Mou avatar Marc Sun avatar A.J avatar Michael J Clark avatar Robert Shaw avatar  avatar Clément POIRET avatar Xiang Zhao avatar  avatar

Watchers

Jen Iofinova avatar Shubhra Pandit avatar Tyler Michael Smith avatar Tuan Nguyen avatar  avatar  avatar  avatar  avatar Dan Alistarh avatar Alexandre Marques avatar  avatar Dmytro Parfeniuk avatar

compressed-tensors's Issues

Regular expression failure

When I try to use regular expression in the "ignore" field of GPTQModifier it fails to recognize the appropriate models.

Example:
ignore: ["lm_head", “re:.*gate”]
Mixtral-7x8B-Instruct-v0.1

Some layers that were to be ignored were not found in the model: {'re:.*gate'}

Why does `QuantizationScheme.default_scheme` provide an INT8 W8A8 scheme?

Reference:

@classmethod
def default_scheme(
cls,
targets: Optional[List[str]] = None,
):
if targets is None:
# default to quantizing all Linear layers
targets = ["Linear"]
# default to 8 bit integer symmetric quantization
# for weights
weights = QuantizationArgs(num_bits=8, symmetric=True)
# default to 8 bit integer asymmetric quantization
input_activations = QuantizationArgs(num_bits=8, symmetric=True)
# Do not quantize the output activations
# by default
output_activations = None
return cls(
targets=targets,
weights=weights,
input_activations=input_activations,
output_activations=output_activations,
)

I'm not sure this is the best default, especially in the case where I might want to produce quantized kv cache scales for a FP16 model.

I would expect this recipe to keep the model relatively unquantized, but adding the quantized kv scales. Currently this produces an INT8 W8A8 model with kv cache scales:

quant_stage:
    quant_modifiers:
        QuantizationModifier:
            kv_cache_scheme:
                num_bits: 8
                type: float
                strategy: tensor
                dynamic: false
                symmetric: true

[Feature] Change `kv_cache_scheme` to HF QuantizedCache rather than Linear.output_scale

Some models in Transformers have merged qkv_proj Linear modules, like Phi-3, so our current scheme of adding output activation observers to separate k_proj and v_proj Linear modules will not work.

We should be able to use the Cache and QuantizedCache classes in the HF Transformers library, which has been added to most modeling definitions as the class to manage past_key_values: https://github.com/huggingface/transformers/blob/8820fe8b8c4b9da94cf1e4761876f85c562e0efe/src/transformers/cache_utils.py#L770

This allows us to implement our own QuantizedCache with a quantize and dequantize function, where we can calculate the statistics needed for our kv_cache_quant scheme.

Prototype QuantizedCacheConfig impl #87

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.