Giter VIP home page Giter VIP logo

pytorch-memory-utils's Introduction

Pytorch-Memory-Utils

These codes can help you to detect your GPU memory during training with Pytorch.

A blog about this tool and explain the details : https://oldpan.me/archives/pytorch-gpu-memory-usage-track

Usage:

Put modelsize_estimate.py or gpu_mem_track.py under your current working directory and import them.

The following is the print content.

  • Calculate the memory usage of a single model
Model Sequential : params: 0.450304M
Model Sequential : intermedite variables: 336.089600 M (without backward)
Model Sequential : intermedite variables: 672.179200 M (with backward)
  • Track the amount of GPU memory usage
# 30-Apr-21-20:25:29-gpu_mem_track.txt

GPU Memory Track | 30-Apr-21-20:25:29 | Total Tensor Used Memory:0.0    Mb Total Used Memory:0.0    Mb


At main.py line 10: <module>                          Total Tensor Used Memory:0.0    Mb Total Allocated Memory:0.0    Mb

+ | 1 * Size:(64, 64, 3, 3)       | Memory: 0.1406 M | <class 'torch.nn.parameter.Parameter'> | torch.float32
+ | 1 * Size:(128, 128, 3, 3)     | Memory: 0.5625 M | <class 'torch.nn.parameter.Parameter'> | torch.float32
+ | 1 * Size:(256, 128, 3, 3)     | Memory: 1.125 M | <class 'torch.nn.parameter.Parameter'> | torch.float32
+ | 1 * Size:(512, 256, 3, 3)     | Memory: 4.5 M | <class 'torch.nn.parameter.Parameter'> | torch.float32
+ | 3 * Size:(256, 256, 3, 3)     | Memory: 6.75 M | <class 'torch.nn.parameter.Parameter'> | torch.float32
+ | 8 * Size:(512,)               | Memory: 0.0156 M | <class 'torch.nn.parameter.Parameter'> | torch.float32
+ | 2 * Size:(64,)                | Memory: 0.0004 M | <class 'torch.nn.parameter.Parameter'> | torch.float32
+ | 7 * Size:(512, 512, 3, 3)     | Memory: 63.0 M | <class 'torch.nn.parameter.Parameter'> | torch.float32
+ | 4 * Size:(256,)               | Memory: 0.0039 M | <class 'torch.nn.parameter.Parameter'> | torch.float32
+ | 1 * Size:(128, 64, 3, 3)      | Memory: 0.2812 M | <class 'torch.nn.parameter.Parameter'> | torch.float32
+ | 2 * Size:(128,)               | Memory: 0.0009 M | <class 'torch.nn.parameter.Parameter'> | torch.float32
+ | 1 * Size:(64, 3, 3, 3)        | Memory: 0.0065 M | <class 'torch.nn.parameter.Parameter'> | torch.float32

At main.py line 12: <module>                          Total Tensor Used Memory:76.4   Mb Total Allocated Memory:76.4   Mb

+ | 1 * Size:(60, 3, 512, 512)    | Memory: 180.0 M | <class 'torch.Tensor'> | torch.float32
+ | 1 * Size:(40, 3, 512, 512)    | Memory: 120.0 M | <class 'torch.Tensor'> | torch.float32
+ | 1 * Size:(30, 3, 512, 512)    | Memory: 90.0 M | <class 'torch.Tensor'> | torch.float32

At main.py line 18: <module>                          Total Tensor Used Memory:466.4  Mb Total Allocated Memory:466.4  Mb

+ | 1 * Size:(120, 3, 512, 512)   | Memory: 360.0 M | <class 'torch.Tensor'> | torch.float32
+ | 1 * Size:(80, 3, 512, 512)    | Memory: 240.0 M | <class 'torch.Tensor'> | torch.float32

At main.py line 23: <module>                          Total Tensor Used Memory:1066.4 Mb Total Allocated Memory:1066.4 Mb

- | 1 * Size:(40, 3, 512, 512)    | Memory: 120.0 M | <class 'torch.Tensor'> | torch.float32
- | 1 * Size:(120, 3, 512, 512)   | Memory: 360.0 M | <class 'torch.Tensor'> | torch.float32

At main.py line 29: <module>                          Total Tensor Used Memory:586.4  Mb Total Allocated Memory:586.4  Mb

How to use

Track the amount of GPU memory usage

simple example:

import torch

from torchvision import models
from gpu_mem_track import MemTracker

device = torch.device('cuda:0')

gpu_tracker = MemTracker()         # define a GPU tracker

gpu_tracker.track()                     # run function between the code line where uses GPU
cnn = models.vgg19(pretrained=True).features.to(device).eval()
gpu_tracker.track()                     # run function between the code line where uses GPU

dummy_tensor_1 = torch.randn(30, 3, 512, 512).float().to(device)  # 30*3*512*512*4/1024/1024 = 90.00M
dummy_tensor_2 = torch.randn(40, 3, 512, 512).float().to(device)  # 40*3*512*512*4/1024/1024 = 120.00M
dummy_tensor_3 = torch.randn(60, 3, 512, 512).float().to(device)  # 60*3*512*512*4/1024/1024 = 180.00M

gpu_tracker.track()

dummy_tensor_4 = torch.randn(120, 3, 512, 512).float().to(device)  # 120*3*512*512*4/1024/1024 = 360.00M
dummy_tensor_5 = torch.randn(80, 3, 512, 512).float().to(device)  # 80*3*512*512*4/1024/1024 = 240.00M

gpu_tracker.track()

dummy_tensor_4 = dummy_tensor_4.cpu()
dummy_tensor_2 = dummy_tensor_2.cpu()
gpu_tracker.clear_cache() # or torch.cuda.empty_cache()

gpu_tracker.track()

This will output a .txt to current dir and the content of output is above(print content).

FAQs

  1. Why Total Tensor Used Memory is much smaller than Total Allocated Memory?
  • Total Allocated Memory is the peak of the memory usage. When you delete some tensors, PyTorch will not release the space to the device, until you call gpu_tracker.clear_cache() like the example script.

  • The cuda kernel will take some space. See pytorch/pytorch#12873

  1. Why does Total Allocated Memory stay unchanged?
  • See Q1.
  1. I deleted some tensors. Why are they not deleted in tracker's output?
  • Make sure that you have released all the references to the tensor object. Then you can call "import gc; gc.collect()" and tell python to collect the unreferenced tensor.

REFERENCE

Part of the code is referenced from:

http://jacobkimmel.github.io/pytorch_estimating_model_size/ https://gist.github.com/MInner/8968b3b120c95d3f50b8a22a74bf66bc

pytorch-memory-utils's People

Contributors

feifeibear avatar hzhwcmhf avatar oldpan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-memory-utils's Issues

gpu_profile()的入参

请问track_mem_use.py.py中的gpu_profile(), 输入参数frame是什么? 如何填写呢?

.txt in Ubuntu 16.04

Hi, why cannot I find where the outputting .txt file is? It doesn`t in the current dir of my code.
What I run is your example code and it runs in Ubuntu 16.04.

Is the value different from the nvidia-smi?

Hi , I use the following to track my gpu utility

from gpu_mem_track import MemTracker

gpu_tracker = MemTracker()
gpu_tracker.track()
model = Model()
model.to("cuda:0")
gpu_tracker.track()

the tracker tell me the total use gpu is 109Mb

But I use nvidia-smi, I found that this process use 499MB.

Will there be different?

How to check the memory of the others?

My code runs successfully during the first few epoch but suddenly OOM. I have add your code to check the memory of tensor in every epoch. The tensor seems to have little influence on Memory but the memory still grows. I don't know which part takes most of the memory.
I use gpu_tracker.track() at the beginning of each epoch like:
for epoch in range(self.nepoch):
# a=torch.cuda.memory_allocated(device=0)
# print a
gpu_tracker.track()
for i in range(0, self.ntrain, self.batch_size):

The TXT is as follow:

GPU Memory Track | 16-Nov-18-14:29:37 | Total Used Memory:1038.09024Mb

  • | 1 * Size:(7057, 2048) | Memory: 57.810944 M | <class 'torch.Tensor'>
  • | 1 * Size:(5, 624) | Memory: 0.01248 M | <class 'torch.Tensor'>
  • | 2 * Size:(1024, 624) | Memory: 5.111808 M | <class 'torch.Tensor'>
  • | 6 * Size:(150,) | Memory: 0.0036 M | <class 'torch.nn.parameter.Parameter'>
  • | 6 * Size:(150,) | Memory: 0.0036 M | <class 'torch.Tensor'>
  • | 1 * Size:(250,) | Memory: 0.001 M | <class 'torch.Tensor'>
  • | 20 * Size:(1024,) | Memory: 0.08192 M | <class 'torch.nn.parameter.Parameter'>
  • | 5 * Size:(200,) | Memory: 0.004 M | <class 'torch.Tensor'>
  • | 2 * Size:(1024, 2048) | Memory: 16.777216 M | <class 'torch.nn.parameter.Parameter'>
  • | 1 * Size:(200, 1024) | Memory: 0.8192 M | <class 'torch.nn.parameter.Parameter'>
  • | 2 * Size:(5, 312) | Memory: 0.01248 M | <class 'torch.Tensor'>
  • | 1 * Size:(200, 312) | Memory: 0.2496 M | <class 'torch.Tensor'>
  • | 5 * Size:(200,) | Memory: 0.004 M | <class 'torch.nn.parameter.Parameter'>
  • | 1 * Size:(5,) | Memory: 2e-05 M | <class 'torch.Tensor'>
  • | 2 * Size:(2048, 1024) | Memory: 16.777216 M | <class 'torch.Tensor'>
  • | 1 * Size:(1, 312) | Memory: 0.001248 M | <class 'torch.Tensor'>
  • | 1 * Size:(7307,) | Memory: 0.029228 M | <class 'torch.Tensor'>
  • | 1 * Size:(150, 1024) | Memory: 0.6144 M | <class 'torch.nn.parameter.Parameter'>
  • | 1 * Size:(250, 2048) | Memory: 2.048 M | <class 'torch.Tensor'>
  • | 1 * Size:(1764,) | Memory: 0.007056 M | <class 'torch.Tensor'>
  • | 2 * Size:(7057,) | Memory: 0.056456 M | <class 'torch.Tensor'>
  • | 1 * Size:(5, 2048) | Memory: 0.04096 M | <class 'torch.Tensor'>
  • | 1 * Size:(50,) | Memory: 0.0002 M | <class 'torch.Tensor'>
  • | 1 * Size:() | Memory: 4e-06 M | <class 'torch.Tensor'>
  • | 1 * Size:(2967,) | Memory: 0.011868 M | <class 'torch.Tensor'>
  • | 20 * Size:(1024,) | Memory: 0.08192 M | <class 'torch.Tensor'>
  • | 2 * Size:(1024, 624) | Memory: 5.111808 M | <class 'torch.nn.parameter.Parameter'>
  • | 1 * Size:(7307, 2048) | Memory: 59.858944 M | <class 'torch.Tensor'>
  • | 1 * Size:(2967, 2048) | Memory: 24.305664 M | <class 'torch.Tensor'>
  • | 10 * Size:(2048,) | Memory: 0.08192 M | <class 'torch.nn.parameter.Parameter'>
  • | 2 * Size:(2048, 1024) | Memory: 16.777216 M | <class 'torch.nn.parameter.Parameter'>
  • | 10 * Size:(2048,) | Memory: 0.08192 M | <class 'torch.Tensor'>
  • | 1 * Size:(1764, 2048) | Memory: 14.450688 M | <class 'torch.Tensor'>

At classifier_zsl : line 44 Total Used Memory:1038.09024Mb

GPU Memory Track | 16-Nov-18-14:29:37 | Total Used Memory:1400.897536Mb

  • | 1 * Size:(64, 2048) | Memory: 0.524288 M | <class 'torch.Tensor'>
  • | 2 * Size:(7307,) | Memory: 0.058456 M | <class 'torch.Tensor'>
  • | 2 * Size:() | Memory: 8e-06 M | <class 'torch.Tensor'>
  • | 1 * Size:(64, 200) | Memory: 0.0512 M | <class 'torch.Tensor'>
  • | 2 * Size:(7307, 2048) | Memory: 119.717888 M | <class 'torch.Tensor'>
  • | 2 * Size:(64,) | Memory: 0.000512 M | <class 'torch.Tensor'>
  • | 1 * Size:(7307, 2048) | Memory: 59.858944 M | <class 'torch.Tensor'>
  • | 1 * Size:() | Memory: 4e-06 M | <class 'torch.Tensor'>
  • | 1 * Size:(7307,) | Memory: 0.029228 M | <class 'torch.Tensor'>

At classifier_zsl : line 44 Total Used Memory:1400.897536Mb

GPU Memory Track | 16-Nov-18-14:29:37 | Total Used Memory:1461.714944Mb

At classifier_zsl : line 44 Total Used Memory:1461.714944Mb

GPU Memory Track | 16-Nov-18-14:29:37 | Total Used Memory:1573.781504Mb

At classifier_zsl : line 44 Total Used Memory:1573.781504Mb

GPU Memory Track | 16-Nov-18-14:29:37 | Total Used Memory:1695.41632Mb

At classifier_zsl : line 44 Total Used Memory:1695.41632Mb

GPU Memory Track | 16-Nov-18-14:29:37 | Total Used Memory:1817.051136Mb

At classifier_zsl : line 44 Total Used Memory:1817.051136Mb

GPU Memory Track | 16-Nov-18-14:29:38 | Total Used Memory:1938.685952Mb

At classifier_zsl : line 44 Total Used Memory:1938.685952Mb

What else takes the major of the memory?
Best wishes!

No detailed use info, only Total Tensor Used Memory

Hi, I'm using your code with:
torch 1.10.0+cu113

I used the example code as follow:

import torch
import inspect

from torchvision import models
from gpu_mem_track import MemTracker  # 引用显存跟踪代码

device = torch.device('cuda:0')

frame = inspect.currentframe()     
gpu_tracker = MemTracker(frame)      # 创建显存检测对象

gpu_tracker.track()                  # 开始检测


dummy_tensor_1 = torch.randn(30, 3, 512, 512).float().to(device)  # 30*3*512*512*4/1000/1000 = 94.37M
dummy_tensor_2 = torch.randn(40, 3, 512, 512).float().to(device)  # 40*3*512*512*4/1000/1000 = 125.82M
dummy_tensor_3 = torch.randn(60, 3, 512, 512).float().to(device)  # 60*3*512*512*4/1000/1000 = 188.74M

gpu_tracker.track()                  # 开始检测

and got the output in the txt file as follow:

GPU Memory Track | 27-Oct-21-18:37:56 | Total Tensor Used Memory:0.0 Mb Total Allocated Memory:0.0 Mb
At run.py line 12: Total Tensor Used Memory:0.0 Mb Total Allocated Memory:0.0 Mb
At run.py line 19: Total Tensor Used Memory:390.0 Mb Total Allocated Memory:390.0 Mb

As you mentioned in Readme.md, there should be detailed information about each tensor's gpu usage (which line begins with '+').

What should I do?

A question about the use of the code you provide

Thank you for the code.
Can you monitor the change of graphics card memory in real time while the program is running? If I want to calculate the memory of a model, can I only create a new. py file and then import the model into it? Or do you add the statements in your example directly in the part of the program that I want to monitor?
Thank you。

memory info doesn't match

The previous display total used memory plus the middle tensor memory does not equal the following total used memory

'__file__' is incompatible with interactive python

When executing MemTracker(frame) in ipython, it throws KeyError:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-8-fcbd53b39126> in <module>
----> 1 gpu_tracker = MemTracker(frame)

./gpu_mem_track.py in __init__(self, frame, detail, path, verbose, device)
     27
     28         self.func_name = frame.f_code.co_name
---> 29         self.filename = frame.f_globals["__file__"]
     30         if (self.filename.endswith(".pyc") or
     31                 self.filename.endswith(".pyo")):

KeyError: '__file__'

It seems that the member variable filename is never used. These lines could be safely removed.

self.filename = frame.f_globals["__file__"]
if (self.filename.endswith(".pyc") or
self.filename.endswith(".pyo")):
self.filename = self.filename[:-1]

Always show the first gpu?

Multi-GPUs in some machine, how can make the program show the memory of other gpus?
I have tested CUDA_VISIBLE_DEVICES, os.environ['VCUDA_VISIBLE_DEVICES'], but they didnot be effective.

Can you add the feature of printing variable's name

God job in this project!
Can you add one feature which can print variable's name?(Intermediate variables do not belong to network architectures. ). When I want to track how my gpu memory is used, the tool has clearly helped how many memories I've used between lines, I think printing variable names will be even helpful!

Best regards

A problem when I training network

Thank you for your script, it's really a gread work. A great tool during the commissioning phase! My network was very memory-intensive at the beginning, but I quickly located the problem with your script and realized memory optimization. Thank you very much!!!

But there may be a problem, it seems this may cause network training to be slow. I forgot to close this script after debugging the program, which caused the network training to be dozens of times slower than before.

Therefore, we need to turn off this script before actually starting the training task.

torch.short memory consumption

Hi there,
Thank you for your nice repo.
Just one simple question: in the dict that defines all the memory consumption of different pytorch data types, I found torch.short is 16/6.
But short type should consume 2 bytes, right? Why it's not 16/8 in this case?

Thank you,
Joee

OSError

Hello, when I tested the example you offered, I got the following error message:

  File "G:/0_mycode/empty_pc/git/Empty/run_test.py", line 125, in <module>
    gpu_tracker.track()
  File "G:\0_mycode\empty_pc\git\Empty\gpu_mem_track.py", line 89, in track
    with open(self.gpu_profile_fn, 'a+') as f:
OSError: [Errno 22] Invalid argument: '04-Oct-22-14:23:54-gpu_mem_track.txt'

Could you please help me?

How to use it in other .py file?

Hi, I use your code and it run successfully.But I can't get the result when I use it in its sub-module
it prints nothing when the function is called
e.g.

def dis_update(self, x_a, x_b, hyperparameters):
        gpu_tracker.track()
        self.dis_opt.zero_grad()
        x_a_1,x_a_2,h_a, n_a = self.gen_a.encode(x_a)
        x_b_1,x_b_2,h_b, n_b = self.gen_b.encode(x_b)
        # decode (cross domain)
        x_ba = self.gen_a.decode(x_b_1,x_b_2,h_b + n_b)
        x_ab = self.gen_b.decode(x_a_1,x_a_2,h_a + n_a)
        self.loss_dis_total = ......
        self.loss_dis_total.backward()
        self.dis_opt.step()
        gpu_tracker.track()

Can you help me with the problem?

How to perform distributed graphics card memory tracking

Hello, I would like to ask how to track distributed multi-graphics cards.
I am currently using wandb.

As described in the documentation, Is it possible to use in this way?
device0 = torch.device('cuda:0')
device1 = torch.device('cuda:1')
device2 = torch.device('cuda:2')
device3 = torch.device('cuda:3')

Why **2

Hi,
I want to ask about your code:
new_tensor_sizes = {(type(x), tuple(x.size()), ts_list.count(x.size()), np.prod(np.array(x.size()))*4/1000**2)

what's the means of **2 operation?

The Total Used Memory stays unchanged among .py files

When I track the GPU usage, the used memory stays unchanged wherever I write the code gpu_tracker.track(). Here is the output for one .py file:

GPU Memory Track | 03-Nov-20-21:34:31 | Total Used Memory:972.7  Mb


At flcore.servers.serveravg __init__: line 23        Total Used Memory:972.7  Mb


At flcore.servers.serveravg __init__: line 30        Total Used Memory:972.7  Mb

In other files, the outputs are still "927.7 Mb".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.