megvii-research / sparsebit Goto Github PK

View Code? Open in Web Editor NEW

323.0 12.0 40.0 7.63 MB

A model compression and acceleration toolbox based on pytorch.

License: Apache License 2.0

Python 75.44% Cuda 22.52% C++ 1.84% Shell 0.19%

deep-learning post-training-quantization pruning quantization quantization-aware-training sparse tensorrt

sparsebit's Introduction

中文版

News

2023.04.27: 🔥 Pipeline parallelism is supported for alpaca-qlora which enables fine-tuning llama-65b with 8*2080ti within 13 hours.
2023.04.15: 🔥 We release alpaca-qlora which reduce a half model size gpu-memory than alpaca-lora. With alpaca-qlora support, you can use a single 2080ti to instruct fine-tuning llama-7b/13b.
2023.03.20: 🔥 We implemented a GPTQ cuda kernel with groupsize feature and add --single_device_mode to support all quant LLaMAs run in a single GPU(i.e. 2080ti). GPTQ for LLaMA.
2023.03.08: Release a mix-precision quantization method based on GPTQ for LLaMA.
2023.02.23: Release a PTQ example of GPT2 on wikiText2
2022.11.24: Release a QAT example of BEVDet
2022.12.13: Release some examples of BERT.
2022.12.14: Release a QAT example of BEVDepth
2022.12.26: Release a QAT example of BEVDet4D

Introduction

Sparsebit is a toolkit with pruning and quantization capabilities. It is designed to help researchers compress and accelerate neural network models by modifying only a few codes in existing pytorch project.

Quantization

Quantization turns full-precision params into low-bit precision params, which can compress and accelerate the model without changing its structure. This toolkit supports two common quantization paradigms, Post-Training-Quantization and Quantization-Aware-Training, with following features:

Benefiting from the support of torch.fx, Sparsebit operates on a QuantModel, and each operation becomes a QuantModule.
Sparsebit can easily be extended by users to accommodate their own researches. Users can register to extend important objects such as QuantModule, Quantizer and Observer by themselves.
Exporting QDQ-ONNX is supported, which can be loaded and deployed by backends such as TensorRT and OnnxRuntime.

Results

PTQ results on ImageNet-1k: link
PTQ results of Vision Transformer on ImageNet-1k: link
PTQ results of YOLO related works on COCO: link
QAT results on ImageNet-1k: link

Sparse

Sparse is often used in deep learning to refer to operations such as reducing network parameters or network computation. At present, Sparse supported by the toolbox has the following characteristics:

Supports two types of pruning: structured/unstructured;
Supports a variety of operation objects including: weights, activations, model-blocks, model-layers, etc.;
Supports multiple pruning algorithms: L1-norm/L0-norm/Fisher-pruning/Hrank/Slimming...
Users can extend a custom pruning algorithm easily by defining a Sparser
Using ONNX as the export format for the pruned model

Resources

Documentations

Detailed usage and development guidance is located in the document. Refer to: docs

CV-Master

We maintain a public course on quantification at Bilibili, introducing the basics of quantification and our latest work. Interested users can join the course.video
Aiming at better enabling users to understand and apply the knowledge related to model compression, we designed related homework based on Sparsebit. Interested users can complete it by themselves.quantization_homework

Plan to re-implement

Join Us

Welcome to be a member (or an intern) of our team if you are interested in Quantization, Pruning, Distillation, Self-Supervised Learning and Model Deployment.
Submit your resume to: [email protected]

Acknowledgement

Sparsebit was inspired by several open source projects. We are grateful for these excellent projects and list them as follows:

License

Sparsebit is released under the Apache 2.0 license.

sparsebit's People

Contributors

Stargazers

Watchers

sparsebit's Issues

Should we install a full-version cuda instead of an anaconda-version cuda-toolkit?

Because I am using the server in my lab, so I don't have sudo access. When I was trying to run setup, it shows fatal error: cuda.h: No such file or directory.

Should I install a full-version cuda?

Dimension mismatch when QAT model export to onnx

This issue can be easily reproduced with the example of QAT, along with quant min, max disabled in PyTorch onnx operator. The error message comes from quantizers/quant_tensor.py saying dimensions of scale and zero_point are inconsistent with input tensor. I checked the scale and zero_point shape of first-layer convolution and it returns me with [3136, 1, 1, 1], where 3136=64*7*7, not [64, 1, 1, 1]...

QAT cifar10 example 报错

执行 https://github.com/megvii-research/Sparsebit/blob/main/examples/quantization_aware_training/cifar10/basecase/main.py 报错

执行过程

python3 main.py qconfig_lsq.yaml --epochs=0

报错

Traceback (most recent call last):
  File "main.py", line 428, in <module>
    main()
  File "main.py", line 219, in main
    qmodel.export_onnx(
  File "/data/Project/Sparsebit/sparsebit/quantization/quant_model.py", line 254, in export_onnx
    self.add_extra_info_to_onnx(name)
  File "/data/Project/Sparsebit/sparsebit/quantization/quant_model.py", line 298, in add_extra_info_to_onnx
    input_dequant = nodes[tensor_inputs[onnx_op.input[0]][0]]
KeyError: 'conv1.weight_quantizer.scale'

QTP for swin transformer

code

 B = int(windows.shape[0] / (H * W / window_size / window_size))

error

TypeError: int() argument must be a string, a bytes-like object or a number, not 'Proxy'

KeyError: 'onnx::QuantizeLinear_711'

Sparsebit/examples/post_training_quantization/cifar10/basecase/main.py

Line 148 in 0c0e9f6

qmodel.export_onnx(

File "main.py", line 282, in
main()
File "main.py", line 148, in main
qmodel.export_onnx(
File "/home/hongyang/codebase/quantization_code/Sparsebit/sparsebit/quantization/quant_model.py", line 260, in export_onnx
self.add_extra_info_to_onnx(name)
File "/home/hongyang/codebase/quantization_code/Sparsebit/sparsebit/quantization/quant_model.py", line 304, in add_extra_info_to_onnx
input_dequant = nodes[tensor_inputs[onnx_op.input[0]][0]]
KeyError: 'onnx::QuantizeLinear_711'

我对你们的2080ti pipeline很感兴趣

感觉这个是个低成本的setup，对于train llama 7b，一定要2080ti吗？因为11G mem，还是8块8G mem的显卡也可以。能公开你们整个机器的配置吗

error when running generate.py under alpaca-lama

it shows following error

Traceback (most recent call last):
File "/home/missa/dev/Sparsebit/large_language_models/alpaca-qlora/generate.py", line 29, in
model = PeftQModel.from_pretrained(
File "/home/missa/miniconda3/envs/sparsebitv6/lib/python3.9/site-packages/peft/peft_model.py", line 135, in from_pretrained
config = PEFT_TYPE_TO_CONFIG_MAPPING[PeftConfig.from_pretrained(model_id).peft_type].from_pretrained(model_id)
File "/home/missa/miniconda3/envs/sparsebitv6/lib/python3.9/site-packages/peft/utils/config.py", line 95, in from_pretrained
if os.path.isfile(os.path.join(pretrained_model_name_or_path, CONFIG_NAME)):
File "/home/missa/miniconda3/envs/sparsebitv6/lib/python3.9/posixpath.py", line 76, in join
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

I see CHECKPOINT_PATH = None in generate.py, is this expected?

swin transformer

is this working for swin transformer?

torchvision densenet121 无法转成 sparsebit QuantModel

执行如下代码报错：

import torchvision
import torch

from sparsebit.quantization import QuantModel, parse_qconfig


qconfig_path = "./qconfig.yaml"
# BACKEND: virtual
# W:
#   QSCHEME: per-channel-symmetric
#   QUANTIZER: 
#     TYPE: lsq
#     BIT: 4
# A:
#   QSCHEME: per-tensor-affine
#   QUANTIZER:
#     TYPE: lsq
#     BIT: 4
#   QADD:
#     ENABLE_QUANT: true

model = torchvision.models.densenet121(pretrained=True)
qconfig = parse_qconfig(qconfig_path)
model = QuantModel(model, config=qconfig)
inp = torch.randn(2, 3, 224, 224)
out = model(inp)

errors using qViT.onnx to do inference

With no modification，I using your ptq code to export deit onnx models. But error occurs when using onnxruntime to inference the onnx model.

onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : This is an invalid model. Error in Node:QuantizeLinear_2 : No Op registered for QuantizeLinear with domain_version of 13

torchvision mobilenet_v2 导出 4w4f onnx 报错

导出代码

import torchvision
import torch

from sparsebit.quantization import QuantModel, parse_qconfig


qconfig_path = "./qconfig_lsq.yaml"

# BACKEND: virtual
# W:
#   QSCHEME: per-channel-symmetric
#   QUANTIZER: 
#     TYPE: lsq
#     BIT: 4
# A:
#   QSCHEME: per-tensor-affine
#   QUANTIZER:
#     TYPE: lsq
#     BIT: 4
#   QADD:
#     ENABLE_QUANT: true

model = torchvision.models.mobilenet_v2(pretrained=True)
qconfig = parse_qconfig(qconfig_path)
qmodel = QuantModel(model, config=qconfig)
qmodel.eval()
inp = torch.randn(2, 3, 224, 224)
out = qmodel(inp)
print(out.shape)
with torch.no_grad():
    qmodel.export_onnx(
        inp, name="mobilenet_v2_4w4f.onnx", extra_info=True
    )

报错信息

Traceback (most recent call last):
  File "dump_onnx.py", line 17, in <module>
    qmodel.export_onnx(
  File "/data/Project/Sparsebit/sparsebit/quantization/quant_model.py", line 256, in export_onnx
    self.add_extra_info_to_onnx(name)
  File "/data/Project/Sparsebit/sparsebit/quantization/quant_model.py", line 294, in add_extra_info_to_onnx
    onnx_op = onnx_model.graph.node[op_pos]
IndexError: list index (914) out of range

English Docs

Thank you for this great share!

Do you have any plans to add English Readme/Docs?

update quantizer preprocess when export onnx

由于类似dorefa等quantizer在forward过程中会产出一些对scale / zeropoint等量化参数的额外操作, 导致export onnx生成的weight或scale, zerpoint等参数并非实际forward过程中的参数, 故引起onnx运行不正确的现象.
希望增加预处理操作实现两者一致性.

How could I reproduce the results of QAT_DeiT on ImageNet?

I rewrite the related code in main.py as follows:

    # set head and tail of model is 8bit
    model.model.patch_embed_proj.weight_quantizer.set_bit(bit=8)
    model.model.head.input_quantizer.set_bit(bit=8)
    model.model.head.weight_quantizer.set_bit(bit=8)
    # model.model.conv1.weight_quantizer.set_bit(bit=8)
    # model.model.fc.input_quantizer.set_bit(bit=8)
    # model.model.fc.weight_quantizer.set_bit(bit=8)

and training for 90 epoches, but I can`t get the same result that the readme provided.

An error about "x_dq = self._forward(x, scale, zero_point) "

An error exposing, when I run cifar10_qat_pact/main.py . as follows：

Traceback (most recent call last):
  File "/root/miniconda3/envs/sb/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/sb/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
 ..........
  File "/root/Sparsebit/examples/cifar10_qat_pact/main.py", line 311, in <module>
    train(
  File "/root/Sparsebit/examples/cifar10_qat_pact/main.py", line 149, in train
    output = model(images)
  File "/root/miniconda3/envs/sb/lib/python3.8/site-packages/torch-1.11.0-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/Sparsebit/sparsebit/quantization/quant_model.py", line 198, in forward
    return self.model.forward(*args)
  File "<eval_with_key>.129", line 8, in forward
  File "/root/miniconda3/envs/sb/lib/python3.8/site-packages/torch-1.11.0-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/Sparsebit/sparsebit/quantization/modules/conv.py", line 39, in forward
    x_in = self.input_quantizer(x_in)
  File "/root/miniconda3/envs/sb/lib/python3.8/site-packages/torch-1.11.0-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/Sparsebit/sparsebit/quantization/quantizers/base.py", line 54, in forward
    x_dq = self._forward(x, scale, zero_point)
TypeError: _forward() takes 2 positional arguments but 4 were given

I found that def _forward(self, x): of pact does only accept two parameters, but dorefa does accept four parameters def _forward(self, x, scale, zero_point):, and also found that in fact dorefa also only two parameters are required. So, I think there are two solutions:

either: Change def _forward(self, x): of pact to def _forward(self, x, scale, zero_point):
or: Change def _forward(self, x, scale, zero_point): of dorefa to def _forward(self, x):, and change x_dq = self._forward(x, scale, zero_point) to:

if self.TYPE == "PACT" or self.TYPE == "DoReFa":
     x_dq = self._forward(x)
else:
     x_dq = self._forward(x, scale, zero_point)

The error will be solved!

error in homework Q3 code.

Homework ans, Q3 code
data.transpose(self.qdesc.ch_axis, 0)
here, 0 should be 1, the first axis (0) is the calibration-size, the second axis (1) is the channel axis according to the code.

and if self.qdesc.ch_axis==0 mean observers for weights, which should not be calculated here.

will support lower bit quantization?

Thank you for open this repo.
Will support lower bit（2/4bits） quantization

RuntimeError: Ninja is required to load C++ extensions

/opt/python3.8.6/bin/python /home/hongyang/codebase/quantization_code/Sparsebit/examples/quantization_aware_training/cifar10/basecase/main.py
Traceback (most recent call last):
File "/opt/python3.8.6/bin/ninja", line 33, in
sys.exit(load_entry_point('ninja', 'console_scripts', 'ninja')())
File "/opt/python3.8.6/lib/python3.8/site-packages/ninja-1.11.1-py3.8-linux-x86_64.egg/ninja/init.py", line 51, in ninja
raise SystemExit(_program('ninja', sys.argv[1:]))
File "/opt/python3.8.6/lib/python3.8/site-packages/ninja-1.11.1-py3.8-linux-x86_64.egg/ninja/init.py", line 47, in _program
return subprocess.call([os.path.join(BIN_DIR, name)] + args, close_fds=False)
File "/opt/python3.8.6/lib/python3.8/subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "/opt/python3.8.6/lib/python3.8/subprocess.py", line 854, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/opt/python3.8.6/lib/python3.8/subprocess.py", line 1592, in _execute_child
self._posix_spawn(args, executable, env, restore_signals,
File "/opt/python3.8.6/lib/python3.8/subprocess.py", line 1543, in _posix_spawn
self.pid = os.posix_spawn(executable, args, env, **kwargs)
PermissionError: [Errno 13] Permission denied: '/opt/python3.8.6/lib/python3.8/site-packages/ninja-1.11.1-py3.8-linux-x86_64.egg/ninja/data/bin/ninja'
Traceback (most recent call last):
File "/home/hongyang/codebase/quantization_code/Sparsebit/examples/quantization_aware_training/cifar10/basecase/main.py", line 23, in
from sparsebit.quantization import QuantModel, parse_qconfig
File "/home/hongyang/codebase/quantization_code/Sparsebit/sparsebit/quantization/init.py", line 1, in
from .quant_model import *
File "/home/hongyang/codebase/quantization_code/Sparsebit/sparsebit/quantization/quant_model.py", line 18, in
from sparsebit.quantization.modules import *
File "/home/hongyang/codebase/quantization_code/Sparsebit/sparsebit/quantization/modules/init.py", line 17, in
from .base import QuantOpr, MultipleInputsQuantOpr
File "/home/hongyang/codebase/quantization_code/Sparsebit/sparsebit/quantization/modules/base.py", line 4, in
from sparsebit.quantization.quantizers import build_quantizer
File "/home/hongyang/codebase/quantization_code/Sparsebit/sparsebit/quantization/quantizers/init.py", line 9, in
from .base import Quantizer
File "/home/hongyang/codebase/quantization_code/Sparsebit/sparsebit/quantization/quantizers/base.py", line 4, in
from sparsebit.quantization.observers import build_observer
File "/home/hongyang/codebase/quantization_code/Sparsebit/sparsebit/quantization/observers/init.py", line 10, in
from . import minmax, percentile, mse, moving_average, kl_histogram, aciq
File "/home/hongyang/codebase/quantization_code/Sparsebit/sparsebit/quantization/observers/mse.py", line 6, in
from sparsebit.quantization.quantizers.quant_tensor import STE
File "/home/hongyang/codebase/quantization_code/Sparsebit/sparsebit/quantization/quantizers/quant_tensor.py", line 13, in
fake_quant_kernel = load(
File "/opt/python3.8.6/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/opt/python3.8.6/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1425, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/python3.8.6/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1506, in _write_ninja_file_and_build_library
verify_ninja_availability()
File "/opt/python3.8.6/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1562, in verify_ninja_availability
raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions

ONNX can not be loaded by Tensorrt 8.2.5& 8003

homework, Q4 export onnx file and conver trt engine. I use trtexec --workspace=4096 --int8 --onnx=./qresnet18.onnx in 8.2.5 &8.0.03 version.
But encountered that op is not supported, as follows:
[07/30/2022-08:13:33] [I] TensorRT version: 8003 [07/30/2022-08:13:33] [I] [TRT] [MemUsageChange] Init CUDA: CPU +250, GPU +0, now: CPU 257, GPU 482 (MiB) [07/30/2022-08:13:33] [I] Start parsing network model [07/30/2022-08:13:33] [I] [TRT] ---------------------------------------------------------------- [07/30/2022-08:13:33] [I] [TRT] Input filename: ./qresnet18.onnx [07/30/2022-08:13:33] [I] [TRT] ONNX IR version: 0.0.7 [07/30/2022-08:13:33] [I] [TRT] Opset version: 13 [07/30/2022-08:13:33] [I] [TRT] Producer name: pytorch [07/30/2022-08:13:33] [I] [TRT] Producer version: 1.12.0 [07/30/2022-08:13:33] [I] [TRT] Domain: [07/30/2022-08:13:33] [I] [TRT] Model version: 0 [07/30/2022-08:13:33] [I] [TRT] Doc string: [07/30/2022-08:13:33] [I] [TRT] ---------------------------------------------------------------- [07/30/2022-08:13:33] [E] Error[3]: onnx::QuantizeLinear_710: invalid weights type of Int8 [07/30/2022-08:13:33] [E] [TRT] ModelImporter.cpp:720: While parsing node number 0 [Identity -> "onnx::QuantizeLinear_872"]: [07/30/2022-08:13:33] [E] [TRT] ModelImporter.cpp:721: --- Begin node --- [07/30/2022-08:13:33] [E] [TRT] ModelImporter.cpp:722: input: "onnx::QuantizeLinear_710" output: "onnx::QuantizeLinear_872" name: "Identity_0" op_type: "Identity"

Do I need to add some other settings？

QAT cifar10 example with QADD Quant enable 报错

执行 https://github.com/megvii-research/Sparsebit/blob/main/examples/quantization_aware_training/cifar10/basecase/main.py 报错

执行过程

python3 main.py qconfig_lsq.yaml --epochs=0

qconfig_lsq.yaml 内容：

BACKEND: virtual
W:
  QSCHEME: per-channel-symmetric
  QUANTIZER: 
    TYPE: lsq
    BIT: 4
A:
  QSCHEME: per-tensor-affine
  QUANTIZER:
    TYPE: lsq
    BIT: 4
  QADD:
    ENABLE_QUANT: true

报错

Traceback (most recent call last):
  File "main.py", line 439, in <module>
    main()
  File "main.py", line 223, in main
    qmodel.export_onnx(
  File "/data/Project/Sparsebit/sparsebit/quantization/quant_model.py", line 257, in export_onnx
    self.add_extra_info_to_onnx(name)
  File "/data/Project/Sparsebit/sparsebit/quantization/quant_model.py", line 313, in add_extra_info_to_onnx
    weight_dequant = nodes[tensor_inputs[onnx_op.input[1]][0]]
IndexError: list index (1) out of range

ImportError

My setting:
cuda 10.2
python=3.8

intsall SparseBit by :

git clone https://github.com/megvii-research/Sparsebit.git
cd sparsebit
python3 setup.py develop --user
pip3 install tensorrt-8.2.5.1-cp38-none-linux_x86_64.whl

after installed that:
run /root/Sparsebit/examples/cifar10_ptq/main.ipynb, but exposing RuntimeError: Ninja is required to load C++ extensions, so ,also run pip install Ninja, then run main.ipynb again, exposing an error:

Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?9abbaedb-76bc-4b54-8e8c-3079840743b9)
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
/root/Sparsebit/examples/cifar10_ptq/main.ipynb Cell 2 in <cell line: 24>()
     [21](vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a2268616964756f546974616e5f7870227d/root/Sparsebit/examples/cifar10_ptq/main.ipynb#W1sdnNjb2RlLXJlbW90ZQ%3D%3D?line=20) import torchvision.datasets as datasets
     [22](vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a2268616964756f546974616e5f7870227d/root/Sparsebit/examples/cifar10_ptq/main.ipynb#W1sdnNjb2RlLXJlbW90ZQ%3D%3D?line=21) from model import resnet20
---> [24](vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a2268616964756f546974616e5f7870227d/root/Sparsebit/examples/cifar10_ptq/main.ipynb#W1sdnNjb2RlLXJlbW90ZQ%3D%3D?line=23) from sparsebit.quantization import QuantModel, parse_qconfig

File ~/Sparsebit/sparsebit/quantization/__init__.py:1, in <module>
----> 1 from .quant_model import *
      2 from .quant_config import parse_qconfig

File ~/Sparsebit/sparsebit/quantization/quant_model.py:18, in <module>
     15 import onnx
     17 from sparsebit.utils import update_config
---> 18 from sparsebit.quantization.modules import *
     19 from sparsebit.quantization.observers import Observer
     20 from sparsebit.quantization.quantizers import Quantizer

File ~/Sparsebit/sparsebit/quantization/modules/__init__.py:16, in <module>
     12     return real_register
     15 # 将需要注册的module文件填写至此
---> 16 from .base import QuantOpr, MultipleInputsQuantOpr
     17 from .activations import *
     18 from .conv import *
...
-> 1775     module = importlib.util.module_from_spec(spec)
   1776     assert isinstance(spec.loader, importlib.abc.Loader)
   1777     spec.loader.exec_module(module)

ImportError: /root/Sparsebit/sparsebit/quantization/torch_extensions/build/fake_quant.so: cannot open shared object file: No such file or directory

I examine the "build directory", and find it is empty:

error raisedfrom sparsebit.quantization import QuantModel, parse_qconfig

A bug may have been resolved.

There is a bug here, I thought of a simple way to fix it, which is applicable to QAT of ViT.

elif 'input_quantizer.scale' in dict(_module.state_dict()).keys():
      _module.input_quantizer.set_fake_fused()  # 有bug, quant_state会来回切.
else:
     print("no_set_fake_fused:", _user.name, _module.input_quantizer_generated)

when quantizing this layer, we met this error.

torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow