🐛 Describe the bug Hey, First of all, thanks fo

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Really, really appreciate your input, <a class="user-mention notranslate" data-hoverca

CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered) about serve HOT 5 OPEN

emilwallner commented on May 29, 2024

CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered)

from serve.

Comments (5)

mreso commented on May 29, 2024

Hi @emilwallner,
thanks for the extensive issue report.

My thought on this are:

You're looking at the server after the crash, right? Meaning that the worker process has died, gets restarted and and thus memory is back to normal.
I can't find the line from your stack trace in your code but I assume that its basically the next line from your code. Detach does not create a copy of the data so you should still be having a single batch on device.
You're resizing the images with a resolution coming from the requests and then re-resizing the tensor in preprocess_and_stack_images to (3,768,768). Then you're stacking them along the channel dimension creating e.g. (6,768,768) before you add a batch dimension with unsqueeze. Not sure about your model by maybe it does something funky when it gets (1,6,768,768) instead of(2,3,768,768).
What is your batch size? Did you try using batch_size=1 for some time?
In the video there are multiple processes on the GPU, do you use multiple worker for the same model?

Thats all I have for now but happy to continue spitballing and iterating over this until you find s solution!

Best
Matthias

from serve.

emilwallner commented on May 29, 2024

Really, really appreciate your input, @mreso!

The worker crashes and returns 507 and doesn't recover.
Yeah, I added detach to make sure requires_grad is set to False
Yeah, that could be it
I switched the batch size to 1 following your suggestion. Also, I check that it has the correct type, and final batch size.
Yes, multiple workers per model.

I also realized CUDA_LAUNCH_BLOCKING 1 reduces performance by about 70%, so I'll turn it off for now.

Here's my updated check:

  def preprocess_and_stack_images(self, images):
        preprocessed_images = []

        for i, img in enumerate(images):
            try:
                preprocessed_img = self.resize_tensor(img)
              
                if preprocessed_img.shape != (3, 768, 768) or preprocessed_img.min() < 0 or preprocessed_img.max() > 1 or preprocessed_img.dtype != torch.float32:
                    # Log information about the image that doesn't meet the requirements
                    logger.info(f"Image {i} does not meet the requirements. Replacing with a blank image.")
                    preprocessed_img = torch.zeros((3, 768, 768))
            except Exception as e:
                # Log the error message and load a blank image
                logger.error(f"Error occurred while processing Image {i}: {str(e)}. Loading a blank image.")
                preprocessed_img = torch.zeros((3, 768, 768))

            preprocessed_images.append(preprocessed_img)

        images_batch = torch.stack(preprocessed_images, dim=0)

        if len(images_batch.shape) == 3:
            images_batch = images_batch.unsqueeze(0)

        # Second test: Check if the size is (1, 3, 768, 768)
        if images_batch.shape != (1, 3, 768, 768):
            # Log information about the batch that doesn't meet the requirements
            logger.info(f"Batch size {images_batch.shape} does not match the required shape (1, 3, 768, 768). Replacing with a blank batch.")
            images_batch = torch.zeros((1, 3, 768, 768))
        

        return images_batch

Again, really appreciate the brainstorming — let’s keep at it until we crack this!

from serve.

mreso commented on May 29, 2024

Yeah, performance will suffer significant from CUDA_LAUNCH_BLOCKING as kernels will not run asynchronously. So only activate if really necessary for debugging.

You could try to run the model in a notebook with a (1,6,768,768) input and observe the memory usage compared to (2,3,768,768). Wondering why this actually seem to to work in the first place.

from serve.

emilwallner commented on May 29, 2024

I haven’t tried the (1,6,768,768) input yet, but since our model is based on three channels, it should throw an error during execution.

Now, I double-check the size (1,3,768,768), dtype, and ensured the values are in the correct range. Despite that, I’m still hitting a CUDA error: device-side assert triggered when moving the batch with images_batch = images_batch.to(self.device).detach()

Got any more suggestions on what might be causing this?

from serve.

CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered) about serve HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent