Giter VIP home page Giter VIP logo

Comments (8)

fhkingma avatar fhkingma commented on June 11, 2024 1

Same problem here as well.

from text-generation-inference.

martinigoyanes avatar martinigoyanes commented on June 11, 2024

@Narsil do you have any thoughts on why this could be happening?

from text-generation-inference.

martinigoyanes avatar martinigoyanes commented on June 11, 2024

@OlivierDehaene maybe you have some insights on this matter?

Thank you for your time!

from text-generation-inference.

martinigoyanes avatar martinigoyanes commented on June 11, 2024

Hey @Venkat2811 , maybe you could enlighten me on this area? Would really appreciate it!

from text-generation-inference.

Venkat2811 avatar Venkat2811 commented on June 11, 2024

Hey @martinigoyanes ,

Just taking a stab at the issue here. Without knowing actual model config, as per this article:
The amount of GPU memory consumed scales with the base model size + the length of the token sequence.

Base model size: 14GB (2*7B param model), 80GB - 14GB = 66GB available for inference
During pre-fill: 53 * 4k seq_len = 53 * 4 * 1 GB = 212 GB (as per article, 1 token requires ~1MB)

Resulting in error during prefill:

2024-04-30T10:35:01.733254Z ERROR prefill{id=0 size=53}:prefill{id=0 size=53}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.32 GiB. GPU 0 has a total capacty of 79.14 GiB of which 3.98 GiB is free. Process 2218444 has 75.15 GiB memory in use. Of the allocated memory 73.66 GiB is allocated by PyTorch, and 963.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

This worked for you:

  • I have also tried with lower batch sizes: 50, 40, 32 and they also lead to an OOM error. I could get batch_size=2 to work though.

With 66GB available for inference
Mem of pre-fill + decode: 2 * 4k seq_len + 4k output_len = 2 * 8 * 1GB = 16GB

Did you try batch size of 3 and 4 ? it should be possible i think.

  • I have also tried to not bypass the router and send 60 concurrent requests with sequence+decode lengths = 8000 and this DOES WORK,

From graph, it took ~10mins. If this was the time to complete inference of 60 requests, I think the above math holds. i.e., concurrent request processing of 2-4 req per batch.

I haven't looked into tgi_batch_current_max_tokens & MAX_BATCH_TOTAL_TOKENS yet.

from text-generation-inference.

martinigoyanes avatar martinigoyanes commented on June 11, 2024

Thanks for the response, yeah the math makes sense, I completely agree. My problem is that I also did the math but then TGI tells me MAX_BATCH_TOTAL_TOKENS= 425472 while then if you try to do so many tokens you actually get OOM errors. If 1 token ~ 1MB then that implies I need 425k MB which is like 425 GB of free VRAM.

While in theory MAX_BATCH_TOTAL_TOKENS are all the tokens that can fit in a batch to the LLM. See

Also for reference:

 --max-batch-total-tokens <MAX_BATCH_TOTAL_TOKENS>
          **IMPORTANT** This is one critical control to allow maximum usage of the available hardware.
          
          This represents the total amount of potential tokens within a batch. When using padding (not recommended) this would be equivalent of `batch_size` * `max_total_tokens`.
          
          However in the non-padded (flash attention) version this can be much finer.
          
          For `max_batch_total_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.
          
          Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded). Since the actual memory overhead depends on other parameters like if you're using quantization, flash attention or the model implementation, text-generation-inference cannot infer this number automatically.

from text-generation-inference.

fxmarty avatar fxmarty commented on June 11, 2024

maybe related #1286

from text-generation-inference.

fxmarty avatar fxmarty commented on June 11, 2024

+1

from text-generation-inference.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.