Environment Setup Runtime environment: Target: x

Could you please explain why this happens? <a class="user-mention notranslate" data-ho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Mistral7b takes 4 times its size in VRAM on A100 about text-generation-inference HOT 5 CLOSED

martinigoyanes commented on July 1, 2024

Mistral7b takes 4 times its size in VRAM on A100

from text-generation-inference.

Comments (5)

martinigoyanes commented on July 1, 2024

Could you please explain why this happens? @OlivierDehaene @Narsil Thank you!

from text-generation-inference.

Venkat2811 commented on July 1, 2024

~~@martinigoyanes Could you share TGI startup logs for serving Mistral7b ? Are you sure no other process is running and using GPU ? nvtop is super useful to see more details of processes using GPU.~~

Update: Apologies, I misread the question & assumed kv cache was already taken into account. See: #1863 (comment)

from text-generation-inference.

martinigoyanes commented on July 1, 2024

Hello @Venkat2811 , I do not have access to the logs right now. But I am confident no other processes were running on that GPU. The GPU was reserved for serving TGI.

I think I found what is "the problem". Basically, when TGI warms up the model it allocates memory for the KV cache so, those 67 GB RAM are coming from the 14GB of Mistral7B and the remaining are the GB allocated for KV Cache.

This is the piece of code I refer to, from flash_causal_lm.py:

 try:
            cache_manager = set_cache_manager(
                batch.blocks,
                self.num_layers,
                self.num_kv_heads,
                self.head_size,
                self.sliding_window is not None,
                self.dtype,
                self.device,
            )
            max_bt = batch.max_blocks
            max_s = max_bt * get_cache_manager().block_size
            _, batch, _ = self.generate_token(batch)
        except torch.cuda.OutOfMemoryError as e:
            raise RuntimeError(
                f"Not enough memory to handle {len(batch.input_ids)} prefill tokens. "
                f"You need to decrease `--max-batch-prefill-tokens`"
            ) from e

       ...

        # Inspired by the original implementation in [vllm](https://github.com/vllm-project/vllm)
        # Calculate the number of blocks that can be allocated with the free memory
        dtype_size = torch.tensor([], dtype=self.dtype).element_size()
        cache_block_size = BLOCK_SIZE * self.num_kv_heads * self.head_size
        total_cache_size = self.num_layers * cache_block_size * 2 * dtype_size

        if IS_CUDA_SYSTEM or IS_ROCM_SYSTEM:
            total_free_memory, _ = torch.cuda.mem_get_info(self.device)
            total_gpu_memory = torch.cuda.get_device_properties(
                self.device
            ).total_memory

            free_memory = max(
                0, total_free_memory - (1 - MEMORY_FRACTION) * total_gpu_memory
            )
        elif IS_XPU_SYSTEM:
            total_gpu_memory = torch.xpu.get_device_properties(self.device).total_memory
            free_memory = int(total_gpu_memory * 0.5)
        else:
            raise NotImplementedError("FlashModel is only available on GPU")

        num_blocks = (
            # Leave 5% for some wiggle room
            int((free_memory * 0.95) // total_cache_size)
            # Add batch.blocks as we allocated it above, so it is included in the peak memory.
            + cache_manager.num_blocks
        )

        del batch
        del cache_manager

        set_cache_manager(
            num_blocks,
            self.num_layers,
            self.num_kv_heads,
            self.head_size,
            self.sliding_window is not None,
            self.dtype,
            self.device,
        )

        if CUDA_GRAPHS:
            try:
                logger.info(f"Cuda Graphs are enabled for sizes {CUDA_GRAPHS}")
                # Warmup cuda graphs
                for bs in CUDA_GRAPHS:
                    if self.speculate is None or self.speculate + 1 <= bs:
                        self.cuda_graph_warmup(bs, max_s, max_bt)
            except torch.cuda.OutOfMemoryError:
                logger.exception(f"Decode cuda graph warmup failed")
        else:
            logger.info(f"Cuda Graphs are disabled (CUDA_GRAPHS={CUDA_GRAPHS}).")

        return int(num_blocks * BLOCK_SIZE)

However, I am missing a couple things:

What is BLOCK_SIZE referring to? Why is it hardcoded to 16 and why is it used to scale all calculations?
What is exactly total_cache_size?
Why are max_batch_total_tokens computed by num_blocks * BLOCK_SIZE? Is it because one block of memory able to store 16 (block_size) tokens?

from text-generation-inference.

Venkat2811 commented on July 1, 2024

Hey @martinigoyanes

Basically, when TGI warms up the model it allocates memory for the KV cache so, those 67 GB RAM are coming from the 14GB of Mistral7B and the remaining are the GB allocated for KV Cache.

Yes, as discussed in another threads, KV cache is essential in inference and without it, there is no inference. So there is no problem here.

KV cache memory calculation - #1831 (comment).
#1875 (comment)

Would you consider closing this issue as it's no longer an issue ?

However, I am missing a couple things:

What is BLOCK_SIZE referring to? Why is it hardcoded to 16 and why is it used to scale all calculations?

What is exactly total_cache_size?

Why are max_batch_total_tokens computed by num_blocks * BLOCK_SIZE? Is it because one block of memory able to store 16 (block_size) tokens?

Regarding your questions, I don't know CUDA, but my high level understanding is that it's related to efficient GPU computation (thread block, warp, thread). Depending on underlying hardware architecture, and model architecture, BLOCK_SIZE needs to be adjusted for GPU efficient memory bandwidth and compute utilization. Maybe move this to discussions ?

from text-generation-inference.

martinigoyanes commented on July 1, 2024

Thank you so much for your reply! I will close the issue and rely on the discussion #1897

from text-generation-inference.

Mistral7b takes 4 times its size in VRAM on A100 about text-generation-inference HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent