Comments (14)
Hey it works with a really great result. Whisper-int4
from olive.
A simple experiment:
model: whisper-medium
experiment: 15s audio running 10 times.
device | bit | running time | cpu core/gpu util | memory |
---|---|---|---|---|
cpu | int8 | 45s | i9 24 core | 4.1G |
cpu | int4 | 63s | i9 24 core | 2.9G |
gpu | int8 | 57s | 4090 39% | 2.7G |
gpu | int4 | 7s | 4090 80~90% | 3.8G |
in conclusion:
Running int4 on the CPU seems to be limited by the CPU processing speed, resulting in 40% slower than int8 and 1G less memory.
Int4 on the GPU runs 7 to 8 times faster than int8. At the same time, the computing resource usage is close to 100%. The memory has increased by 1G. Int8 cannot fully utilize the performance of 4090.
Overall, Int4 should be faster but requires more computing resources.
from olive.
Hey. I made rud test
Driver Description: NVIDIA GeForce GTX 1660 Ti with Max-Q Design
Video Memory: 6144 MBytes of GDDR6 SDRAM [Micron]
1272-141231-0002.mp3
File duration : 13s 428ms
whisper-large-v2-int8
Transcribe duration avg : 16s 713ms
realtime ratio 0.803458437
whisper-large-v2-int4
Transcribe duration avg : 2s 963ms
faster in 5.640482619 times
realtime ratio 4.531893351
I tested in C# demo app
from olive.
"blockwise_quant_int4":{
"type": "OnnxMatMul4Quantizer",
"disable_search": true
},
"bnb_quantization": {
"type": "OnnxBnb4Quantization",
"config": {
"quant_type": "fp4",
"save_as_external_data": true,
"all_tensors_to_one_file": true
}
},
"bnb_quantization": {
"type": "OnnxBnb4Quantization",
"config": {
"quant_type": "nf4",
"save_as_external_data": true,
"all_tensors_to_one_file": true
}
},
bnb_quantization fp4 has better performance than nf4 and blockwise_quant_int4. Compared with blockwise_quant_int4, the performance is improved by 40%.
15s audio running 50 times on nvidia 4090.
bit | model size | running time | running memory | gpu util |
---|---|---|---|---|
fp4 | 636M | 22s | 3.62G | 95% |
nf4 | 636M | 24s | 3.67G | 95% |
int4 | 681M | 34s | 3.64G | 85%~95% |
from olive.
I did not compare the latency with fp32/fp16, but the int4 should work for olive whisper example, you can just change the int8 dynamic quantization as above config. Like:
- run whisper prepare config to generate a few configs fp32/fp16/int8 and etc.
- pick one config to add int4 config.
- run it with olive as the whisper readme shows.
from olive.
@trajepl Thanks I will try it later
from olive.
Hey it works with a really great result. Whisper-int4
Do you mind paste some perf numbers? Would love to see the exciting results. :)
We recently are adding other int4 supports which may help if you met any performance issues.
from olive.
Wow, great to see this big improvements!
from olive.
I ran the inference of the original model (GPU) through the following command. The speed of audio inference for 5 minutes is similar to that of int4 onnx (GPU), and the memory usage is about twice as much. Is there any problem with the int8 onnx model (GPU) that prevents full use of gpu resources?
gpu inference (4090 utilization 90%):
time whisper 1.wav --language zh --model medium
real 0m28.003s
user 0m40.866s
sys 0m3.852s
from olive.
Thanks for the comparison! Currently, the int8 is not supported very well in onnxruntime gpu. But fortunately, the supporting is coming.
I do not know the detailed reason whey int8 does not fully utilize the gpu memory. But I suppose it might be caused by some operations(quantization) are not optimized in gpu.
Also, we have similar experiences before, that leads following convention:
- For gpu, int4, fp16 are better, and avoid apply int8 quantization as of now.
- For cpu, avoid fp16. For int4 there are several passes to get the int4 model, we need to search for better performance.
from olive.
Excellent! Will try to add the official example in Olive later.
from olive.
The whisper variants I currently use include faster-whisper, whisper.cpp, and olive-whisper. If sorted by inference speed (gpu nvidia 4090), these three are almost the same. If sorted by memory usage (from small to large), faster-whisper->whisper.cpp->olive-whisper.
The memory footprint of whisper.cpp or faster-whisper is half that of olive-whisper.
@trajepl Is there a way to optimize the memory footprint of olive-whisper?
I prefer using the onnx model and don't want to give up the olive whisper model.
from olive.
I have no idea about how to optimize memory footprint.
It might not be a fair comparison? The factors like input data size, batch size, or memory cache may affect the memory footprint.
from olive.
Closing issue since int4 quantization is already supported.
Memory footprint is out of scope for Olive and handled my onnxruntime https://github.com/microsoft/onnxruntime.
There have been multiple int4 improvements made in ORT so it might have already been addressed by them.
from olive.
Related Issues (20)
- whisper pipeline corrupting the model, unable to run on DML EP HOT 1
- GenAIModelExporter Component - parameter mismatch HOT 3
- Missing dependency: psutil HOT 2
- [FR]: FlashAttention support for Whisper HOT 1
- pydantic.error_wrappers.ValidationError: 7 validation errors for RunConfig HOT 1
- Olive workflow for mistral model optimization does not work HOT 16
- Exception while running SD XL: Not enough memory resources are available to complete this operation HOT 1
- Failed to run symbolic shape inference when doing LLM Optimization with DirectML HOT 8
- Error on the Generate an ONNX model and optimize step HOT 5
- status.IsOK() was false. Tensor shape cannot contain any negative value HOT 1
- Vitis quantization is broken with ORT 1.18 HOT 2
- Enabling openai/whisper-large-v3 using olive-ai-0.6.0 [onnxruntime-gpu: 1.17.1] on Intel CPU/GPU is not supporting HOT 2
- Llava-7b model Conversion to ONNX and Latency Optimization - OOM error (even after setting paging file size) HOT 2
- safetensor model
- onnx
- huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.'
- "num_images" doesn't work for the example of directml stable_diffusion_xl.
- Get Error while optimizing SDXL of DirectML example
- [Bug]: Optimization of Unet fails - AMD RDNA3.5 Strix Point Processor HOT 1
- [FR]: Finetuning QLoRA adapters and swapping HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from olive.