Comments (5)
Could you please explain why this happens? @OlivierDehaene @Narsil Thank you!
from text-generation-inference.
@martinigoyanes Could you share TGI startup logs for serving Mistral7b ? Are you sure no other process is running and using GPU ? nvtop
is super useful to see more details of processes using GPU.
Update: Apologies, I misread the question & assumed kv cache was already taken into account. See: #1863 (comment)
from text-generation-inference.
Hello @Venkat2811 , I do not have access to the logs right now. But I am confident no other processes were running on that GPU. The GPU was reserved for serving TGI.
I think I found what is "the problem". Basically, when TGI warms up the model it allocates memory for the KV cache so, those 67 GB RAM are coming from the 14GB of Mistral7B and the remaining are the GB allocated for KV Cache.
This is the piece of code I refer to, from flash_causal_lm.py
:
try:
cache_manager = set_cache_manager(
batch.blocks,
self.num_layers,
self.num_kv_heads,
self.head_size,
self.sliding_window is not None,
self.dtype,
self.device,
)
max_bt = batch.max_blocks
max_s = max_bt * get_cache_manager().block_size
_, batch, _ = self.generate_token(batch)
except torch.cuda.OutOfMemoryError as e:
raise RuntimeError(
f"Not enough memory to handle {len(batch.input_ids)} prefill tokens. "
f"You need to decrease `--max-batch-prefill-tokens`"
) from e
...
# Inspired by the original implementation in [vllm](https://github.com/vllm-project/vllm)
# Calculate the number of blocks that can be allocated with the free memory
dtype_size = torch.tensor([], dtype=self.dtype).element_size()
cache_block_size = BLOCK_SIZE * self.num_kv_heads * self.head_size
total_cache_size = self.num_layers * cache_block_size * 2 * dtype_size
if IS_CUDA_SYSTEM or IS_ROCM_SYSTEM:
total_free_memory, _ = torch.cuda.mem_get_info(self.device)
total_gpu_memory = torch.cuda.get_device_properties(
self.device
).total_memory
free_memory = max(
0, total_free_memory - (1 - MEMORY_FRACTION) * total_gpu_memory
)
elif IS_XPU_SYSTEM:
total_gpu_memory = torch.xpu.get_device_properties(self.device).total_memory
free_memory = int(total_gpu_memory * 0.5)
else:
raise NotImplementedError("FlashModel is only available on GPU")
num_blocks = (
# Leave 5% for some wiggle room
int((free_memory * 0.95) // total_cache_size)
# Add batch.blocks as we allocated it above, so it is included in the peak memory.
+ cache_manager.num_blocks
)
del batch
del cache_manager
set_cache_manager(
num_blocks,
self.num_layers,
self.num_kv_heads,
self.head_size,
self.sliding_window is not None,
self.dtype,
self.device,
)
if CUDA_GRAPHS:
try:
logger.info(f"Cuda Graphs are enabled for sizes {CUDA_GRAPHS}")
# Warmup cuda graphs
for bs in CUDA_GRAPHS:
if self.speculate is None or self.speculate + 1 <= bs:
self.cuda_graph_warmup(bs, max_s, max_bt)
except torch.cuda.OutOfMemoryError:
logger.exception(f"Decode cuda graph warmup failed")
else:
logger.info(f"Cuda Graphs are disabled (CUDA_GRAPHS={CUDA_GRAPHS}).")
return int(num_blocks * BLOCK_SIZE)
However, I am missing a couple things:
- What is
BLOCK_SIZE
referring to? Why is it hardcoded to 16 and why is it used to scale all calculations? - What is exactly
total_cache_size
? - Why are
max_batch_total_tokens
computed bynum_blocks * BLOCK_SIZE
? Is it because one block of memory able to store 16 (block_size
) tokens?
from text-generation-inference.
Hey @martinigoyanes
Basically, when TGI warms up the model it allocates memory for the KV cache so, those 67 GB RAM are coming from the 14GB of Mistral7B and the remaining are the GB allocated for KV Cache.
Yes, as discussed in another threads, KV cache is essential in inference and without it, there is no inference. So there is no problem here.
- KV cache memory calculation - #1831 (comment).
- #1875 (comment)
Would you consider closing this issue as it's no longer an issue ?
However, I am missing a couple things:
- What is
BLOCK_SIZE
referring to? Why is it hardcoded to 16 and why is it used to scale all calculations?- What is exactly
total_cache_size
?- Why are
max_batch_total_tokens
computed bynum_blocks * BLOCK_SIZE
? Is it because one block of memory able to store 16 (block_size
) tokens?
Regarding your questions, I don't know CUDA, but my high level understanding is that it's related to efficient GPU computation (thread block, warp, thread). Depending on underlying hardware architecture, and model architecture, BLOCK_SIZE
needs to be adjusted for GPU efficient memory bandwidth and compute utilization. Maybe move this to discussions ?
from text-generation-inference.
Thank you so much for your reply! I will close the issue and rely on the discussion #1897
from text-generation-inference.
Related Issues (20)
- Logging has no formating when using docker enviroment instead of command HOT 1
- SnapKV support HOT 1
- Question about KV cache HOT 3
- Min P generation parameter HOT 3
- Router /v1/chat/completions not compatible with openai spec HOT 2
- TGI 2.0.2 CodeLlama error `piece id is out of range.` HOT 1
- LoRA Adapter from local model are leading to error HOT 5
- HF web service streaming response differs from OpenAI, breaking clients
- StarCoder2 AWQ does not work correctly
- Document Request HOT 2
- metric: tgi_request_total increments by 2 upon every request
- error: unexpected argument ‘–max-input-tokens’ found HOT 1
- Clarification and supplement to the online docs example HOT 1
- Docs missing for LLaVA NeXT Model HOT 1
- Phi-3 not starting on TGI 2.0.3 in kubernetes cluster HOT 3
- Wrong validations on `Parameters` in TGI python library
- LlavaNext Model cannot be started
- version in docker not correct
- Pydantic validation error re: ChoiceDelta (text_generation/types.py)
- TGI crash during Warming up model - invalid opcode in rotary_emb.cpython-310-x86_64-linux-gnu.so HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from text-generation-inference.