Environment Runtime environment: Target: x

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

maybe related <a class="issue-link js-issue-link" data-error-text="Failed to load titl

Out of Memory Errors When Running text-generation-benchmark Despite Compliant Batch Token Limit about text-generation-inference HOT 8 OPEN

martinigoyanes commented on June 11, 2024 1

Out of Memory Errors When Running text-generation-benchmark Despite Compliant Batch Token Limit

from text-generation-inference.

Comments (8)

fhkingma commented on June 11, 2024 1

Same problem here as well.

from text-generation-inference.

martinigoyanes commented on June 11, 2024

@Narsil do you have any thoughts on why this could be happening?

from text-generation-inference.

martinigoyanes commented on June 11, 2024

@OlivierDehaene maybe you have some insights on this matter?

Thank you for your time!

from text-generation-inference.

martinigoyanes commented on June 11, 2024

Hey @Venkat2811 , maybe you could enlighten me on this area? Would really appreciate it!

from text-generation-inference.

Venkat2811 commented on June 11, 2024

Hey @martinigoyanes ,

Just taking a stab at the issue here. Without knowing actual model config, as per this article:
The amount of GPU memory consumed scales with the base model size + the length of the token sequence.

Base model size: 14GB (2*7B param model), 80GB - 14GB = 66GB available for inference
During pre-fill: 53 * 4k seq_len = 53 * 4 * 1 GB = 212 GB (as per article, 1 token requires ~1MB)

Resulting in error during prefill:

2024-04-30T10:35:01.733254Z ERROR prefill{id=0 size=53}:prefill{id=0 size=53}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.32 GiB. GPU 0 has a total capacty of 79.14 GiB of which 3.98 GiB is free. Process 2218444 has 75.15 GiB memory in use. Of the allocated memory 73.66 GiB is allocated by PyTorch, and 963.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

This worked for you:

I have also tried with lower batch sizes: 50, 40, 32 and they also lead to an OOM error. I could get batch_size=2 to work though.

With 66GB available for inference
Mem of pre-fill + decode: 2 * 4k seq_len + 4k output_len = 2 * 8 * 1GB = 16GB

Did you try batch size of 3 and 4 ? it should be possible i think.

I have also tried to not bypass the router and send 60 concurrent requests with sequence+decode lengths = 8000 and this DOES WORK,

From graph, it took ~10mins. If this was the time to complete inference of 60 requests, I think the above math holds. i.e., concurrent request processing of 2-4 req per batch.

I haven't looked into tgi_batch_current_max_tokens & MAX_BATCH_TOTAL_TOKENS yet.

from text-generation-inference.

martinigoyanes commented on June 11, 2024

Thanks for the response, yeah the math makes sense, I completely agree. My problem is that I also did the math but then TGI tells me MAX_BATCH_TOTAL_TOKENS= 425472 while then if you try to do so many tokens you actually get OOM errors. If 1 token ~ 1MB then that implies I need 425k MB which is like 425 GB of free VRAM.

While in theory MAX_BATCH_TOTAL_TOKENS are all the tokens that can fit in a batch to the LLM. See

text-generation-inference/router/src/queue.rs

Line 185 in d348d2b

fn next_batch(

Also for reference:

 --max-batch-total-tokens <MAX_BATCH_TOTAL_TOKENS>
          **IMPORTANT** This is one critical control to allow maximum usage of the available hardware.
          
          This represents the total amount of potential tokens within a batch. When using padding (not recommended) this would be equivalent of `batch_size` * `max_total_tokens`.
          
          However in the non-padded (flash attention) version this can be much finer.
          
          For `max_batch_total_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.
          
          Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded). Since the actual memory overhead depends on other parameters like if you're using quantization, flash attention or the model implementation, text-generation-inference cannot infer this number automatically.

from text-generation-inference.

fxmarty commented on June 11, 2024

maybe related #1286

from text-generation-inference.

fxmarty commented on June 11, 2024

from text-generation-inference.

Out of Memory Errors When Running text-generation-benchmark Despite Compliant Batch Token Limit about text-generation-inference HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent