Comments (8)
Same problem here as well.
from text-generation-inference.
@Narsil do you have any thoughts on why this could be happening?
from text-generation-inference.
@OlivierDehaene maybe you have some insights on this matter?
Thank you for your time!
from text-generation-inference.
Hey @Venkat2811 , maybe you could enlighten me on this area? Would really appreciate it!
from text-generation-inference.
Hey @martinigoyanes ,
Just taking a stab at the issue here. Without knowing actual model config, as per this article:
The amount of GPU memory consumed scales with the base model size + the length of the token sequence.
Base model size: 14GB (2*7B param model), 80GB - 14GB = 66GB available for inference
During pre-fill: 53 * 4k seq_len = 53 * 4 * 1 GB = 212 GB (as per article, 1 token requires ~1MB)
Resulting in error during prefill:
2024-04-30T10:35:01.733254Z ERROR prefill{id=0 size=53}:prefill{id=0 size=53}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.32 GiB. GPU 0 has a total capacty of 79.14 GiB of which 3.98 GiB is free. Process 2218444 has 75.15 GiB memory in use. Of the allocated memory 73.66 GiB is allocated by PyTorch, and 963.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
This worked for you:
- I have also tried with lower batch sizes: 50, 40, 32 and they also lead to an OOM error. I could get
batch_size=2
to work though.
With 66GB available for inference
Mem of pre-fill + decode: 2 * 4k seq_len + 4k output_len = 2 * 8 * 1GB = 16GB
Did you try batch size of 3 and 4 ? it should be possible i think.
- I have also tried to not bypass the router and send 60 concurrent requests with
sequence+decode lengths = 8000
and this DOES WORK,
From graph, it took ~10mins. If this was the time to complete inference of 60 requests, I think the above math holds. i.e., concurrent request processing of 2-4 req per batch.
I haven't looked into tgi_batch_current_max_tokens
& MAX_BATCH_TOTAL_TOKENS
yet.
from text-generation-inference.
Thanks for the response, yeah the math makes sense, I completely agree. My problem is that I also did the math but then TGI tells me MAX_BATCH_TOTAL_TOKENS= 425472
while then if you try to do so many tokens you actually get OOM errors. If 1 token ~ 1MB then that implies I need 425k MB which is like 425 GB of free VRAM.
While in theory MAX_BATCH_TOTAL_TOKENS
are all the tokens that can fit in a batch to the LLM. See
text-generation-inference/router/src/queue.rs
Line 185 in d348d2b
Also for reference:
--max-batch-total-tokens <MAX_BATCH_TOTAL_TOKENS>
**IMPORTANT** This is one critical control to allow maximum usage of the available hardware.
This represents the total amount of potential tokens within a batch. When using padding (not recommended) this would be equivalent of `batch_size` * `max_total_tokens`.
However in the non-padded (flash attention) version this can be much finer.
For `max_batch_total_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.
Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded). Since the actual memory overhead depends on other parameters like if you're using quantization, flash attention or the model implementation, text-generation-inference cannot infer this number automatically.
from text-generation-inference.
maybe related #1286
from text-generation-inference.
+1
from text-generation-inference.
Related Issues (20)
- Document Request HOT 2
- metric: tgi_request_total increments by 2 upon every request
- error: unexpected argument ‘–max-input-tokens’ found HOT 1
- Clarification and supplement to the online docs example
- Docs missing for LLaVA NeXT Model
- Phi-3 not starting on TGI 2.0.3 in kubernetes cluster HOT 2
- Wrong validations on `Parameters` in TGI python library
- LlavaNext Model cannot be started
- version in docker not correct
- Pydantic validation error re: ChoiceDelta (text_generation/types.py)
- TGI crash during Warming up model - invalid opcode in rotary_emb.cpython-310-x86_64-linux-gnu.so HOT 1
- Phi-3 medium 128k instruct fails to start HOT 7
- Falcon 11B VLM Support
- multiple origins
- Launching Idefics2 QLoRA failing on warmup - shape mismatch: value tensor of shape [64, 4096] cannot be broadcast to indexing result of shape [320, 4096] HOT 3
- Expose `ignore_eos_token` to HTTP endpoints
- AttributeError: module 'vllm._C.ops' has no attribute 'moe_align_block_size'
- Can't load local models
- TGI does not always preserve order of grammar's JSON keys/Pydantic arguments HOT 2
- Gibberish generated with deepseek-ai/deepseek-coder-6.7b-instruct
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from text-generation-inference.