When training a network, the GPU sometimes uses shared GPU memory which brings everyth

Unnecessarily using shared GPU memory about flux.jl HOT 8 CLOSED

koenvos commented on May 27, 2024

Unnecessarily using shared GPU memory

from flux.jl.

Comments (8)

ToucheSir commented on May 27, 2024 2

Reading through the last couple of threads on that discussion page, new Nvidia driver versions may either fix the issue or expose a flag to disable the spilling to shared memory (globally or for a particular exe like python or julia). I think this would be better documented on the CUDA.jl side because it could benefit more than DL workloads. Then we'd just link to the relevant documentation there.

from flux.jl.

chengchingwen commented on May 27, 2024 1

The shared GPU memory seems to be a Windows OS feature that allows the GPU to use CPU memory, so the 24.0 GB are real GPU memory and the 31.9 GB are CPU memory. I don't understand how that works, but it's definitely something you would want to disable when doing model training.

from flux.jl.

ToucheSir commented on May 27, 2024

I'm afraid we have little control over how GPU memory allocations are managed. Almost all of that happens behind the scenes in CUDA.jl. Setting that memory limit is the best knob we have to cap memory usage, but here are a couple other things you can try to keep the average usage lower:

Use https://cuda.juliagpu.org/stable/usage/memory/#Batching-iterator. The MWE doesn't require this, but in practice you'd likely be using something like DataLoader which would benefit from it.
Call GC.gc() every few loop iterations (my default is 10, but YMMV) to run an incremental garbage collection. This can really help with keeping memory usage low. The reason it works is because most of the GPU memory you see is actually already unused garbage, but the Julia GC doesn't feel any memory pressure from the GPU side to automatically trigger a GC.

So the bad news is that this issue has been known for many years and most of the solutions we have are the workarounds mentioned above. Thankfully, we are finally getting some movement from folks working on the Julia internals and the GPU libraries to find proper solutions to this not freeing GPU memory early enough problem. If you're curious, hop on the #gpu channel on Slack and I can point you to the discussion. In the meantime, I'll be closing this since there's little actionable on the FluxML side to resolve this issue.

from flux.jl.

koenvos commented on May 27, 2024

Thanks!

The third trick that helps for me is calling CUDA.reclaim(), which seems to do something different from GC.gc().

While this is getting resolved on the GPU side, would it be possible to add a little warning about shared memory allocations, and suggest the above methods - perhaps in https://fluxml.ai/Flux.jl/stable/gpu/?

It took me a long time to realize 1) the speed was far lower than it should be, 2) this was caused by CUDA allocating shared memory, and 3) these solutions fix it. Being aware of this problem is pretty much essential to get Flux to work in real-life, but right now we let every developer discover it for themselves

from flux.jl.

ToucheSir commented on May 27, 2024

Ok, I'm now not sure what you mean by "shared memory"? CUDA.jl by default only allocates from device-side memory (GPU VRAM), which is not shared with the CPU. You can ask it to allocate from "unified" memory, but gpu does not.

In terms of tips for speed, my experience is that calling GC.gc() early and letting it auto-trigger are pretty close when it comes to overall time spent. The latter does result in long pauses, but the former results in more frequent pauses. In either case, reclaim is generally only necessary when you allocate memory for a long time (e.g. multiple training loop iterations) without running a GC and you want to clear out a bunch of memory before proceeding. Or when you see code OOMing without it.

For documentation, the easiest approach would be to link to pages like https://cuda.juliagpu.org/stable/usage/memory/#Garbage-collection. The main problems I see are that we only have one for CUDA.jl, and different GPU libraries may require different tricks. Feel free to drop a PR with your ideas though and I can take a look.

from flux.jl.

koenvos commented on May 27, 2024

Shared GPU memory as opposed to dedicated GPU memory, in Windows Task Manager:

I have not been able to make this happen with plain CUDA code (lots of temp allocations). My suspicion is that the bug is triggered somewhere in Zygote. [EDIT: I now did manage to make it happen with plain CUDA allocations, so it's unrelated to Zygote.]

I benchmarked the code at the top:

As is: 436.7 sec
With k % 10 == 0 && GC.gc() inside the loop: 27.7 sec
With ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "23GiB" at the top of the script (and no GC.gc()): 12.1 sec

So I would not recommend GC.gc() as the first approach to tackle this. I'm happy to propose a doc section.

FWIW, of two colleagues with pretty much the same setup, one sees the exact same problem and the other has never encountered it

from flux.jl.

koenvos commented on May 27, 2024

Right, apparently introduced as an NVIDIA driver feature earlier this year
vladmandic/automatic#1285

from flux.jl.

chengchingwen commented on May 27, 2024

Might be worth to mention in the docs?

from flux.jl.

Unnecessarily using shared GPU memory about flux.jl HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent