Giter VIP home page Giter VIP logo

Comments (8)

ToucheSir avatar ToucheSir commented on May 27, 2024 2

Reading through the last couple of threads on that discussion page, new Nvidia driver versions may either fix the issue or expose a flag to disable the spilling to shared memory (globally or for a particular exe like python or julia). I think this would be better documented on the CUDA.jl side because it could benefit more than DL workloads. Then we'd just link to the relevant documentation there.

from flux.jl.

chengchingwen avatar chengchingwen commented on May 27, 2024 1

The shared GPU memory seems to be a Windows OS feature that allows the GPU to use CPU memory, so the 24.0 GB are real GPU memory and the 31.9 GB are CPU memory. I don't understand how that works, but it's definitely something you would want to disable when doing model training.

from flux.jl.

ToucheSir avatar ToucheSir commented on May 27, 2024

I'm afraid we have little control over how GPU memory allocations are managed. Almost all of that happens behind the scenes in CUDA.jl. Setting that memory limit is the best knob we have to cap memory usage, but here are a couple other things you can try to keep the average usage lower:

  1. Use https://cuda.juliagpu.org/stable/usage/memory/#Batching-iterator. The MWE doesn't require this, but in practice you'd likely be using something like DataLoader which would benefit from it.
  2. Call GC.gc() every few loop iterations (my default is 10, but YMMV) to run an incremental garbage collection. This can really help with keeping memory usage low. The reason it works is because most of the GPU memory you see is actually already unused garbage, but the Julia GC doesn't feel any memory pressure from the GPU side to automatically trigger a GC.

So the bad news is that this issue has been known for many years and most of the solutions we have are the workarounds mentioned above. Thankfully, we are finally getting some movement from folks working on the Julia internals and the GPU libraries to find proper solutions to this not freeing GPU memory early enough problem. If you're curious, hop on the #gpu channel on Slack and I can point you to the discussion. In the meantime, I'll be closing this since there's little actionable on the FluxML side to resolve this issue.

from flux.jl.

koenvos avatar koenvos commented on May 27, 2024

Thanks!

The third trick that helps for me is calling CUDA.reclaim(), which seems to do something different from GC.gc().

While this is getting resolved on the GPU side, would it be possible to add a little warning about shared memory allocations, and suggest the above methods - perhaps in https://fluxml.ai/Flux.jl/stable/gpu/?

It took me a long time to realize 1) the speed was far lower than it should be, 2) this was caused by CUDA allocating shared memory, and 3) these solutions fix it. Being aware of this problem is pretty much essential to get Flux to work in real-life, but right now we let every developer discover it for themselves

from flux.jl.

ToucheSir avatar ToucheSir commented on May 27, 2024

Ok, I'm now not sure what you mean by "shared memory"? CUDA.jl by default only allocates from device-side memory (GPU VRAM), which is not shared with the CPU. You can ask it to allocate from "unified" memory, but gpu does not.

In terms of tips for speed, my experience is that calling GC.gc() early and letting it auto-trigger are pretty close when it comes to overall time spent. The latter does result in long pauses, but the former results in more frequent pauses. In either case, reclaim is generally only necessary when you allocate memory for a long time (e.g. multiple training loop iterations) without running a GC and you want to clear out a bunch of memory before proceeding. Or when you see code OOMing without it.

For documentation, the easiest approach would be to link to pages like https://cuda.juliagpu.org/stable/usage/memory/#Garbage-collection. The main problems I see are that we only have one for CUDA.jl, and different GPU libraries may require different tricks. Feel free to drop a PR with your ideas though and I can take a look.

from flux.jl.

koenvos avatar koenvos commented on May 27, 2024

Shared GPU memory as opposed to dedicated GPU memory, in Windows Task Manager:
shared_gpu_mem

I have not been able to make this happen with plain CUDA code (lots of temp allocations). My suspicion is that the bug is triggered somewhere in Zygote. [EDIT: I now did manage to make it happen with plain CUDA allocations, so it's unrelated to Zygote.]

I benchmarked the code at the top:

  • As is: 436.7 sec
  • With k % 10 == 0 && GC.gc() inside the loop: 27.7 sec
  • With ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "23GiB" at the top of the script (and no GC.gc()): 12.1 sec

So I would not recommend GC.gc() as the first approach to tackle this. I'm happy to propose a doc section.

FWIW, of two colleagues with pretty much the same setup, one sees the exact same problem and the other has never encountered it

from flux.jl.

koenvos avatar koenvos commented on May 27, 2024

Right, apparently introduced as an NVIDIA driver feature earlier this year
vladmandic/automatic#1285

from flux.jl.

chengchingwen avatar chengchingwen commented on May 27, 2024

Might be worth to mention in the docs?

from flux.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.