Giter VIP home page Giter VIP logo

Comments (4)

kpertsch avatar kpertsch commented on August 23, 2024 1

The TFDS data loader dynamically allocates memory during training and we have observed similar "memory grows over time" behavior. Here are a few knobs you can turn to control memory usage:

  • reduce the amount of parallelism during data loading by reducing traj_transform_threads, traj_read_threads and num_parallel_calls inside frame_transform_kwargs (see examples/05_dataloading.ipynb for some explanations) -- this will reduce memory usage but potentially also reduce training speed
  • remove eval during training -- running eval uses memory and for some of our larger runs we remove it during training to avoid memory overflow and then run it post-hoc to visualize the model's performance
  • reduce the size of the shuffle buffer -- this is the most direct way to cut down memory usage, but careful: too small of a shuffle buffer will negatively impact training dynamics and we found models trained with a shuffle buffer of only 100k samples to perform worse

All that being said, we train our vit_s models on a TPUv4-8 machine with 400GB of memory, so if your machine has 512GB it should be able to fit it, maybe start by disabling evaluation during training.

from octo.

yangsizhe avatar yangsizhe commented on August 23, 2024

Hello, I have encountered a similar issue. I am training on four A100 GPUs using PyTorch's DataLoader (https://github.com/octo-models/octo/blob/main/examples/06_pytorch_oxe_dataloader.py) with a shuffle buffer size of 500K. I found that each process is using over 200GB of memory, and the total memory usage of the four processes exceeds 1TB, leading to out of memory errors. Do you have any experience with this? How can I resolve this issue?

Thanks in advance!

from octo.

kpertsch avatar kpertsch commented on August 23, 2024

It's important that you keep the number of workers to 0, so that all the parallelism is handled by TFDS. Basically you only want a single data loading process (and shuffle buffer) per compute node, ie in your case only one. I'd expect total memory usage for a single loader with 500k shuffle buffer to not exceed 300GB.

from octo.

yangsizhe avatar yangsizhe commented on August 23, 2024

Thank you for your response!

I have one more question. I am using PyTorch's DistributedDataParallel for multi-GPU training. How can I do data loading in only one process? I noticed that in the examples of DDP, each process has its own dataloader.

from octo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.