I train model from scratch on the V100 , the cpu memory grew during the training pro

There appears to be a memory leak during the training phase？ about octo HOT 4 CLOSED

octo-models commented on August 23, 2024

There appears to be a memory leak during the training phase？

from octo.

Comments (4)

kpertsch commented on August 23, 2024 1

The TFDS data loader dynamically allocates memory during training and we have observed similar "memory grows over time" behavior. Here are a few knobs you can turn to control memory usage:

reduce the amount of parallelism during data loading by reducing traj_transform_threads, traj_read_threads and num_parallel_calls inside frame_transform_kwargs (see examples/05_dataloading.ipynb for some explanations) -- this will reduce memory usage but potentially also reduce training speed
remove eval during training -- running eval uses memory and for some of our larger runs we remove it during training to avoid memory overflow and then run it post-hoc to visualize the model's performance
reduce the size of the shuffle buffer -- this is the most direct way to cut down memory usage, but careful: too small of a shuffle buffer will negatively impact training dynamics and we found models trained with a shuffle buffer of only 100k samples to perform worse

All that being said, we train our vit_s models on a TPUv4-8 machine with 400GB of memory, so if your machine has 512GB it should be able to fit it, maybe start by disabling evaluation during training.

from octo.

yangsizhe commented on August 23, 2024

Hello, I have encountered a similar issue. I am training on four A100 GPUs using PyTorch's DataLoader (https://github.com/octo-models/octo/blob/main/examples/06_pytorch_oxe_dataloader.py) with a shuffle buffer size of 500K. I found that each process is using over 200GB of memory, and the total memory usage of the four processes exceeds 1TB, leading to out of memory errors. Do you have any experience with this? How can I resolve this issue?

Thanks in advance!

from octo.

kpertsch commented on August 23, 2024

It's important that you keep the number of workers to 0, so that all the parallelism is handled by TFDS. Basically you only want a single data loading process (and shuffle buffer) per compute node, ie in your case only one. I'd expect total memory usage for a single loader with 500k shuffle buffer to not exceed 300GB.

from octo.

yangsizhe commented on August 23, 2024

Thank you for your response!

I have one more question. I am using PyTorch's DistributedDataParallel for multi-GPU training. How can I do data loading in only one process? I noticed that in the examples of DDP, each process has its own dataloader.

from octo.

Recommend Projects

There appears to be a memory leak during the training phase？ about octo HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent