Giter VIP home page Giter VIP logo

Comments (5)

martinigoyanes avatar martinigoyanes commented on July 1, 2024

Could you please explain why this happens? @OlivierDehaene @Narsil Thank you!

from text-generation-inference.

Venkat2811 avatar Venkat2811 commented on July 1, 2024

@martinigoyanes Could you share TGI startup logs for serving Mistral7b ? Are you sure no other process is running and using GPU ? nvtop is super useful to see more details of processes using GPU.

Update: Apologies, I misread the question & assumed kv cache was already taken into account. See: #1863 (comment)

from text-generation-inference.

martinigoyanes avatar martinigoyanes commented on July 1, 2024

Hello @Venkat2811 , I do not have access to the logs right now. But I am confident no other processes were running on that GPU. The GPU was reserved for serving TGI.

I think I found what is "the problem". Basically, when TGI warms up the model it allocates memory for the KV cache so, those 67 GB RAM are coming from the 14GB of Mistral7B and the remaining are the GB allocated for KV Cache.

This is the piece of code I refer to, from flash_causal_lm.py:

 try:
            cache_manager = set_cache_manager(
                batch.blocks,
                self.num_layers,
                self.num_kv_heads,
                self.head_size,
                self.sliding_window is not None,
                self.dtype,
                self.device,
            )
            max_bt = batch.max_blocks
            max_s = max_bt * get_cache_manager().block_size
            _, batch, _ = self.generate_token(batch)
        except torch.cuda.OutOfMemoryError as e:
            raise RuntimeError(
                f"Not enough memory to handle {len(batch.input_ids)} prefill tokens. "
                f"You need to decrease `--max-batch-prefill-tokens`"
            ) from e

       ...

        # Inspired by the original implementation in [vllm](https://github.com/vllm-project/vllm)
        # Calculate the number of blocks that can be allocated with the free memory
        dtype_size = torch.tensor([], dtype=self.dtype).element_size()
        cache_block_size = BLOCK_SIZE * self.num_kv_heads * self.head_size
        total_cache_size = self.num_layers * cache_block_size * 2 * dtype_size

        if IS_CUDA_SYSTEM or IS_ROCM_SYSTEM:
            total_free_memory, _ = torch.cuda.mem_get_info(self.device)
            total_gpu_memory = torch.cuda.get_device_properties(
                self.device
            ).total_memory

            free_memory = max(
                0, total_free_memory - (1 - MEMORY_FRACTION) * total_gpu_memory
            )
        elif IS_XPU_SYSTEM:
            total_gpu_memory = torch.xpu.get_device_properties(self.device).total_memory
            free_memory = int(total_gpu_memory * 0.5)
        else:
            raise NotImplementedError("FlashModel is only available on GPU")

        num_blocks = (
            # Leave 5% for some wiggle room
            int((free_memory * 0.95) // total_cache_size)
            # Add batch.blocks as we allocated it above, so it is included in the peak memory.
            + cache_manager.num_blocks
        )

        del batch
        del cache_manager

        set_cache_manager(
            num_blocks,
            self.num_layers,
            self.num_kv_heads,
            self.head_size,
            self.sliding_window is not None,
            self.dtype,
            self.device,
        )

        if CUDA_GRAPHS:
            try:
                logger.info(f"Cuda Graphs are enabled for sizes {CUDA_GRAPHS}")
                # Warmup cuda graphs
                for bs in CUDA_GRAPHS:
                    if self.speculate is None or self.speculate + 1 <= bs:
                        self.cuda_graph_warmup(bs, max_s, max_bt)
            except torch.cuda.OutOfMemoryError:
                logger.exception(f"Decode cuda graph warmup failed")
        else:
            logger.info(f"Cuda Graphs are disabled (CUDA_GRAPHS={CUDA_GRAPHS}).")

        return int(num_blocks * BLOCK_SIZE)

However, I am missing a couple things:

  • What is BLOCK_SIZE referring to? Why is it hardcoded to 16 and why is it used to scale all calculations?
  • What is exactly total_cache_size?
  • Why are max_batch_total_tokens computed by num_blocks * BLOCK_SIZE? Is it because one block of memory able to store 16 (block_size) tokens?

from text-generation-inference.

Venkat2811 avatar Venkat2811 commented on July 1, 2024

Hey @martinigoyanes

Basically, when TGI warms up the model it allocates memory for the KV cache so, those 67 GB RAM are coming from the 14GB of Mistral7B and the remaining are the GB allocated for KV Cache.

Yes, as discussed in another threads, KV cache is essential in inference and without it, there is no inference. So there is no problem here.

Would you consider closing this issue as it's no longer an issue ?

However, I am missing a couple things:

  • What is BLOCK_SIZE referring to? Why is it hardcoded to 16 and why is it used to scale all calculations?
  • What is exactly total_cache_size?
  • Why are max_batch_total_tokens computed by num_blocks * BLOCK_SIZE? Is it because one block of memory able to store 16 (block_size) tokens?

Regarding your questions, I don't know CUDA, but my high level understanding is that it's related to efficient GPU computation (thread block, warp, thread). Depending on underlying hardware architecture, and model architecture, BLOCK_SIZE needs to be adjusted for GPU efficient memory bandwidth and compute utilization. Maybe move this to discussions ?

from text-generation-inference.

martinigoyanes avatar martinigoyanes commented on July 1, 2024

Thank you so much for your reply! I will close the issue and rely on the discussion #1897

from text-generation-inference.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.