Giter VIP home page Giter VIP logo

ptq4vit's Introduction

PTQ4ViT

Post-Training Quantization Framework for Vision Transformers. We use the twin uniform quantization method to reduce the quantization error on these activation values. And we use a Hessian guided metric to evaluate different scaling factors, which improves the accuracy of calibration with a small cost. The quantized vision transformers (ViT, DeiT, and Swin) achieve near-lossless prediction accuracy (less than 0.5% drop at 8-bit quantization) on the ImageNet classification task. Please read the paper for details.

Updates

19/07/2022 Add discussion on Base PTQ, and provide more ablation study results.

Number of Calibration Images

Model W8A8 #ims=32 W6A6 #ims=32 W8A8 #ims=128 W6A6 #ims=128
ViT-S/224/32 75.58 71.91 75.54 72.29
ViT-S/224 81.00 78.63 80.99 78.44
ViT-B/224 84.25 81.65 84.27 81.84
ViT-B/384 85.83 83.35 85.81 83.84
DeiT-S/224 79.47 76.28 79.41 76.51
DeiT-B/224 81.48 80.25 81.54 80.30
DeiT-B/384 82.97 81.55 83.01 81.67
Swin-T/224 81.25 80.47 81.27 80.30
Swin-S/224 83.11 82.38 83.15 82.38
Swin-B/224 85.15 84.01 85.17 84.15
Swin-B/384 86.39 85.39 86.36 85.45
Model Time #ims=32 Time #ims=128
ViT-S/224/32 2 min 5 min
ViT-S/224 3 min 7 min
ViT-B/224 4 min 13 min
ViT-B/384 12 min 43 min
DeiT-S/224 3 min 7 min
DeiT-B/224 4 min 16 min
DeiT-B/384 14 min 52 min
Swin-T/224 3 min 9 min
Swin-S/224 8 min 17 min
Swin-B/224 10 min 23 min
Swin-B/384 25 min 69 min

One of the targets of PTQ4ViT is to quickly quantize a vision transformer. We have proposed to pre-compute the output and gradient of each layer and compute the influence of scaling factor candidates in batches to reduce the quantization time. As demonstrated in the second table, PTQ4ViT can quantize most vision transformers in several minutes using 32 calibration images. Using 128 calibration images significantly increases the quantization time.
We observe the Top-1 accuracy varies slightly in the first table, demonstrating PTQ4ViT is not very sensitive to the number of calibration images.

Base PTQ

Base PTQ is a simple quantization strategy and serves as a benchmark for our experiments. Like PTQ4ViT, we quantize all weights and inputs for fully-connect layers (including the first projection layer and the last prediction layer), as well as all input matrices of matrix multiplication operations. For fully-connected layers, we use layerwise scaling factors $\Delta_W$ for weight quantization and $\Delta_X$ for input quantization; while for matrix multiplication operations, we use $\Delta_A$ and $\Delta_B$ for A's quantization and B's quantization respectively.

To get the best scaling factors, we apply a linear grid search on the search space. The same as EasyQuantand Liu et al., we take hyper-parameters $\alpha=0.5$, $\beta = 1.2$, one search round and use cosine distance as the metric. Note that in PTQ4ViT, we change the hyper-parameters to $\alpha=0$, $\beta = 1.2$ and three search rounds, which slightly improves the performance.

It should be noticed that Base PTQ adopts a parallel quantization paradigm, which makes it essentially different from sequential quantization paradigms such as EasyQuant. In sequential quantization, the input data of the current quantizing layer is generated with all previous layers quantizing weights and activations. While in parallel quantization, the input data of the current quantizing layer is simply the raw output of the previous layer.

In practice, we found sequential quantization on vision transformers suffers from significant accuracy degradation on small calibration datasets. While parallel quantization shows robustness on small calibration datasets. Therefore, we choose parallel quantization for both Base PTQ and PTQ4ViT.

More Ablation Study

We supply more ablation studies for the hyper-parameters. It is enough to set the number of quantization intervals $\ge$ 20 (accuracy change $< 0.3%$). It is enough to set the upper bound of m $\ge$ 15 (no accuracy change). The best settings of alpha and beta vary from different layers. It is appropriate to set $\alpha=0$ and $\beta=1/2^{k-1}$, which has little impact on search efficiency. We observe that search rounds has little impact on the prediction accuracy (accuracy change $<$ 0.05% when search rounds $>1$).

We randomly take 32 calibration images to quantize different models 20 times and we observe the fluctuation is not significant. The mean/std of accuracies are: ViT-S/32 $75.55%/0.055%$ , ViT-S $80.96%/0.046%$, ViT-B $84.12%/0.068%$, DeiT-S $79.45%/0.094%$ , and Swin-S $83.11%/0.035%$.

15/01/2022 Add saved quantized models with PTQ4ViT.

model link
ViT-S/224/32 Google
ViT-S/224 Google
ViT-B/224 Google
ViT-B/384 Google
DeiT-S/224 Google
DeiT-B/224 Google
DeiT-B/384 Google
Swin-T/224 Google
Swin-S/224 Google
Swin-B/224 Google
Swin-B/384 Google

10/12/2021 Add utils/integer.py, you can now:

  1. convert calibrated fp32 model into int8
  2. register pre-forward hook in the model, and fetch activation in int8. (We use uint8 to store results of twin quantization, please refer to the paper to see the bits' layout).

Install

Requirement

  • python>=3.5
  • pytorch>=1.5
  • matplotlib
  • pandas
  • timm

Datasets

To run example testing, you should put your ImageNet2012 dataset in path /datasets/imagenet.

We use ViTImageNetLoaderGenerator in utils/datasets.py to initialize our DataLoader. If your Imagenet datasets are stored elsewhere, you'll need to manually pass its root as an argument when instantiating a ViTImageNetLoaderGenerator.

Usage

1. Run example quantization

To test on all models with BasePTQ/PTQ4ViT, run

python example/test_all.py

To run ablation testing, run

python example/test_ablation.py

You can run the testing scripts with multiple GPUs. For example, calling

python example/test_all.py --multigpu --n_gpu 6

will use 6 gpus to run the test.

2. Download quantized model checkpoints

(Coming soon)

Results

Results of BasePTQ

model original w8a8 w6a6
ViT-S/224/32 75.99 73.61 60.144
ViT-S/224 81.39 80.468 70.244
ViT-B/224 84.54 83.896 75.668
ViT-B/384 86.00 85.352 46.886
DeiT-S/224 79.80 77.654 72.268
DeiT-B/224 81.80 80.946 78.786
DeiT-B/384 83.11 82.33 68.442
Swin-T/224 81.39 80.962 78.456
Swin-S/224 83.23 82.758 81.742
Swin-B/224 85.27 84.792 83.354
Swin-B/384 86.44 86.168 85.226

Results of PTQ4ViT

model original w8a8 w6a6
ViT-S/224/32 75.99 75.582 71.908
ViT-S/224 81.39 81.002 78.63
ViT-B/224 84.54 84.25 81.65
ViT-B/384 86.00 85.828 83.348
DeiT-S/224 79.80 79.474 76.282
DeiT-B/224 81.80 81.482 80.25
DeiT-B/384 83.11 82.974 81.55
Swin-T/224 81.39 81.246 80.47
Swin-S/224 83.23 83.106 82.38
Swin-B/224 85.27 85.146 84.012
Swin-B/384 86.44 86.394 85.388

Results of Ablation

  • ViT-S/224 (original top-1 accuracy 81.39%)
Hessian Guided Softmax Twin GELU Twin W8A8 W6A6
80.47 70.24
80.93 77.20
81.11 78.57
80.84 76.93
79.25 74.07
81.00 78.63
  • ViT-B/224 (original top-1 accuracy 84.54%)
Hessian Guided Softmax Twin GELU Twin W8A8 W6A6
83.90 75.67
83.97 79.90
84.07 80.76
84.10 80.82
83.40 78.86
84.25 81.65
  • ViT-B/384 (original top-1 accuracy 86.00%)
Hessian Guided Softmax Twin GELU Twin W8A8 W6A6
85.35 46.89
85.42 79.99
85.67 82.01
85.60 82.21
84.35 80.86
85.89 83.19

Citation

@article{PTQ4ViT_arixv2022,
    title={PTQ4ViT: Post-Training Quantization Framework for Vision Transformers},
    author={Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, Guangyu Sun},
    journal={arXiv preprint arXiv:2111.12293},
    year={2022},
}

ptq4vit's People

Contributors

hahnyuan avatar supervan-young avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ptq4vit's Issues

Why i cant get the accuracy in README.txt

I just run the code with any changed and I got the different accuracy like:
0.7599 as vit_small_patch32_224 (float)
PTQ4VIT: 0.7541 as vit_small_patch32_224 (w8,a8) BasePTQ: 0.74856 vit_small_patch32_224 (w8,a8)
PTQ4VIT: 0.7174 as vit_small_patch32_224 (w6,a6) BasePTQ: 0.64272 vit_small_patch32_224 (w6,a6)

and other models have the same problem
I didnt change any in original code
Is it because we're using different libraries? Versions of my library:
Python==3.10.6
timm==0.6.13
torch==1.12.1

量化效果验证的一致性

请问量化后,分类准确率衰减与其它下游任务如目标检测、语义分割等,效果会基本一致吗?或者作者您对目前大多数的量化工作,只在分类上验证效果,有什么见解吗?

How to load the quantized weights file

I got the weight file after successful quantization, but I found that the weight file after quantization is inconsistent with that before quantization, how can I read the int8 weight file correctly.
2022-01-20 15-54-02 的屏幕截图

TaskLoss

I am wondering why the taskloss in the code is kl_div(log_softmax(quant_out),softmax(fp32_out)), which seems to be misaligned with the formula $$\frac{\partial L}{\partial O^l}$$ in paper

Shape of saved quantized model parameter

Hi, thanks for sharing the work!

I meet a problem when trying to load the "vit_base_patch16_224.pth". The shape of 'blocks.0.attn.qkv' in pth file is torch.Size([3, 1, 2304, 768]). However, the shape of 'blocks.0.attn.qkv.weight' in model should be torch.Size([2304, 768]). What does the first and second dimension in torch.Size([3, 1, 2304, 768]) mean? I think it should be torch.Size([2304, 768]).

Setting of self.sequential in HessianQuantCalibrator

In the batching_quant_calib() function of the HessianQuantCalibrator class in quant_calib.py,
if self.sequential is set to False (the default value in the provided code), the quantization interval is not reflected in the calibration step of the next module after calibrating the current module.
In this case, when calculating loss = KL_div(y, y_hat), instead of computing it between the output y of the full-precision (FP) model and the output y_hat of the quantized model, it would be calculated between two outputs of the FP model (i.e., KL_div(y, y)).

So, should self.sequential be set to True?
However, when setting self.sequential=True, the accuracy is reported to be very low.
Could you please clarify this issue and provide guidance on the appropriate setting for self.sequential?

Constrain the scaling factors of the two ranges

First of all, thank you for the great work and the official code.

I have one question.

Where is the code implementation for constraining the scaling factors for post-softmax and post-gelu, i.e., ∆R2 = 2m∆R1, where m is an unsigned integer, in order for an efficient process.

I really appreciate for providing the code once again.

calibration parameters

Hi, may I ask for more details on what this function does?

   def _initialize_calib_parameters(self):
        """ 
        set parameters for feeding calibration data
        """
        self.calib_size = int(self.raw_input.shape[0])
        self.calib_batch_size = int(self.raw_input.shape[0])
        while True:
            numel = (2*(self.raw_input.numel()+self.raw_out.numel()) /
                     self.calib_size*self.calib_batch_size)  # number of parameters on GPU
            self.parallel_eq_n = int((15*1024*1024*1024/4)//numel)
            if self.parallel_eq_n <= 1:
                self.calib_need_batching = True
                self.calib_batch_size //= 2
            else:
                break

I am adapting your code for 1DConvolution and the self.parallel_eq_n is multiplying by the output channels of the weights and causing memory issues. If you could provide further details it would be really helpful. Thankyou. Ed

The problems of program result running error

Dear author, thank you for your outstanding contribution. But I'm having some problems running the program:
I used the dataset Imagenet2012 in your program and did not change any parameters (I found the α of PTQ4ViT =0.01 in the program), but there is a little difference between the result of several attempts and the original result. For example, in the result of PTQ4ViT, your ViT-B/224 accuracy rate of w8a8 is 84.25, but mine is 84.148. Your ViT-B/224 accuracy rate of w6a6 is 81.65, but mine is 81.844, and so on. I use an Nvidia RTX 3090-24G.
Did I do something wrong? We look forward to your reply. Thanks again!

The saved quantized model cannot be loaded

I download one quantized model "vit_small_patch16_224.pth", but when I tried the following code:

pthfile = "vit_small_patch16_224.pth"
net.load_state_dict(torch.load(pthfile))

I got this error:

RuntimeError: Error(s) in loading state_dict for VisionTransformer:
Missing key(s) in state_dict: "cls_token", "pos_embed", "patch_embed.proj.weight", "patch_embed.proj.bias", "blocks.0.norm1.weight", "blocks.0.norm1.bias", "blocks.0.attn.qkv.weight", "blocks.0.attn.qkv.bias", "blocks.0.attn.proj.weight", "blocks.0.attn.proj.bias",...

I guess this may be caused by not saving the bias? I was also wondering the correct way to load that model.
Really thank you for your help!

推理过程中所用数据集

你好,请问在推理过程中选用的是Imagenet 验证集吗?我看代码中datasets是 train 和 val 两部分,不需要用到测试集吗?谢谢

Questions about the reasoning process

Thank you for your excellent work. I have a question about the reasoning process.
Regarding the reasoning process:
y =x * w + b
Quantitative reasoning process:
y = (x_int * _int) *( w_interval * a_interval )
This can achieve the x * w process of int8 type.

But in the source code, I noticed that the quantization linear layer such as MinMaxQuantLinear() has the following implementation:
The quant_weight_bias (self) and quant_input (self, x) methods:
w_sim=w.mul_(self.w_interval)
x_sim.mul_(self.a_interval)
This means that after quantifying the activation values and weights, they are immediately inverted to float32, which means that in actual operations, it is still float32. This kind of reasoning seems to have no acceleration effect. Is my understanding correct?

Are the quantized models differentiable

Hi,
Are the quantized models differentiable(i.e. can we get gradients using backprop using these) or is it not possible due to actual INT8 quantization? Pl reply ASAP

How to load the quantized models with PTQ4ViT into the net?

Hi!
Thanks for your great work! There was a problem when I was trying to load the quantized model 'vit_base_patch16_384.pth' (you've provided) into the net created by timm.create_model. The error is as follows.

RuntimeError: Error(s) in loading state_dict for VisionTransformer:
Missing key(s) in state_dict: "cls_token", "pos_embed", "patch_embed.proj.weight", "patch_embed.proj.bias", "blocks.0.norm1.weight", "blocks.0.norm1.bias", "blocks.0.attn.qkv.weight", "blocks.0.attn.qkv.bias", "blocks.0.attn.proj.weight", "blocks.0.attn.proj.bias", "blocks.0.norm2.weight", "blocks.0.norm2.bias", "blocks.0.mlp.fc1.weight", "blocks.0.mlp.fc1.bias", "blocks.0.mlp.fc2.weight", "blocks.0.mlp.fc2.bias".......

Would you please provide the correct method to load your quantized model? Thank you a lot~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.