Comments (5)
Act-order can be easily supported using the same trick of exllama. I did a few test converting off-the-shelf GPTQ models to gpt-fast format using sample code here python convert_hf_checkpoint.py --checkpoint_dir Llama-2-7B-GPTQ --model_name llama-7B
and got 193 tokens/s in single A100 which is really impressive.
from gpt-fast.
It should share the same group size support and such. I’m not sure about activation order.
One note is that for 4-bit support we do require the weights to be packed in a certain order, so we will need to preprocess the weights a bit. I’ll look into it.
from gpt-fast.
Test
from gpt-fast.
Oh this is very interesting! Thanks for the ping. The performance numbers look very impressive! It's great to see the PyTorch team working on thius.
And yes it'd be awesome if this could also support the thousands of existing GPTQ models out there.
All my recent GPTQs have act_order / desc_act (as AutoGPTQ calls it) on. In the early days of my releasing models I also released GPTQs without act_order, as back then there were clients/libraries that didn't support it with group size, or had performance issues. But I stopped doing that 2-3 months ago.
This project reminds me of @turboderp 's ExLlama and ExLlamav2 - they are PyTorch-only inference systems with highly optimised performance. Turboderp got act_order working with no performance drop, so it should definitely be possible.
from gpt-fast.
Great to hear!
from gpt-fast.
Related Issues (20)
- repeat sentence and non-complete sentence in the end
- 'Triton Error [CUDA]: device kernel image is invalid' while compiling HOT 2
- Device-side assertions’ error when speculative decoding with different length of prompts.
- Does `gpt-fast` work on V100 GPUs? HOT 1
- TypeError: __init__() got an unexpected keyword argument 'mmap' HOT 1
- Error when running convert_hf_checkpoint.py for TinyLlama-1.1B-intermediate-step-480k-1T
- Inference on a dataset instead of an individual prompt
- Code is extremely slow! HOT 1
- torch.compile leads to OOM with different prompts.
- How is llama-7b trained, what is the verification accuracy? HOT 2
- RuntimeError: CUDA error: named symbol not found HOT 1
- Size mismatch error occurs when loading models quantized by GPTQ HOT 1
- `eval.py` uses older version of lm_eval HOT 1
- Can GPT-Fast support larger batch sizes HOT 3
- I try to speed up with llava,but this it slower then eager mode,why?
- pass@1 score extremely low using GPT-fast API HOT 2
- AssertionError: assert model_map_json.is_file() HOT 1
- token/s speed HOT 4
- Problem with NVLink setup
- Bandwidth achieved for INT8 is much smaller than FP16 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt-fast.