Comments (8)
Reading through the last couple of threads on that discussion page, new Nvidia driver versions may either fix the issue or expose a flag to disable the spilling to shared memory (globally or for a particular exe like python
or julia
). I think this would be better documented on the CUDA.jl side because it could benefit more than DL workloads. Then we'd just link to the relevant documentation there.
from flux.jl.
The shared GPU memory seems to be a Windows OS feature that allows the GPU to use CPU memory, so the 24.0 GB are real GPU memory and the 31.9 GB are CPU memory. I don't understand how that works, but it's definitely something you would want to disable when doing model training.
from flux.jl.
I'm afraid we have little control over how GPU memory allocations are managed. Almost all of that happens behind the scenes in CUDA.jl. Setting that memory limit is the best knob we have to cap memory usage, but here are a couple other things you can try to keep the average usage lower:
- Use https://cuda.juliagpu.org/stable/usage/memory/#Batching-iterator. The MWE doesn't require this, but in practice you'd likely be using something like
DataLoader
which would benefit from it. - Call
GC.gc()
every few loop iterations (my default is 10, but YMMV) to run an incremental garbage collection. This can really help with keeping memory usage low. The reason it works is because most of the GPU memory you see is actually already unused garbage, but the Julia GC doesn't feel any memory pressure from the GPU side to automatically trigger a GC.
So the bad news is that this issue has been known for many years and most of the solutions we have are the workarounds mentioned above. Thankfully, we are finally getting some movement from folks working on the Julia internals and the GPU libraries to find proper solutions to this not freeing GPU memory early enough problem. If you're curious, hop on the #gpu channel on Slack and I can point you to the discussion. In the meantime, I'll be closing this since there's little actionable on the FluxML side to resolve this issue.
from flux.jl.
Thanks!
The third trick that helps for me is calling CUDA.reclaim(), which seems to do something different from GC.gc().
While this is getting resolved on the GPU side, would it be possible to add a little warning about shared memory allocations, and suggest the above methods - perhaps in https://fluxml.ai/Flux.jl/stable/gpu/?
It took me a long time to realize 1) the speed was far lower than it should be, 2) this was caused by CUDA allocating shared memory, and 3) these solutions fix it. Being aware of this problem is pretty much essential to get Flux to work in real-life, but right now we let every developer discover it for themselves
from flux.jl.
Ok, I'm now not sure what you mean by "shared memory"? CUDA.jl by default only allocates from device-side memory (GPU VRAM), which is not shared with the CPU. You can ask it to allocate from "unified" memory, but gpu
does not.
In terms of tips for speed, my experience is that calling GC.gc()
early and letting it auto-trigger are pretty close when it comes to overall time spent. The latter does result in long pauses, but the former results in more frequent pauses. In either case, reclaim
is generally only necessary when you allocate memory for a long time (e.g. multiple training loop iterations) without running a GC and you want to clear out a bunch of memory before proceeding. Or when you see code OOMing without it.
For documentation, the easiest approach would be to link to pages like https://cuda.juliagpu.org/stable/usage/memory/#Garbage-collection. The main problems I see are that we only have one for CUDA.jl, and different GPU libraries may require different tricks. Feel free to drop a PR with your ideas though and I can take a look.
from flux.jl.
Shared GPU memory as opposed to dedicated GPU memory, in Windows Task Manager:
I have not been able to make this happen with plain CUDA code (lots of temp allocations). My suspicion is that the bug is triggered somewhere in Zygote. [EDIT: I now did manage to make it happen with plain CUDA allocations, so it's unrelated to Zygote.]
I benchmarked the code at the top:
- As is: 436.7 sec
- With
k % 10 == 0 && GC.gc()
inside the loop: 27.7 sec - With
ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "23GiB"
at the top of the script (and noGC.gc()
): 12.1 sec
So I would not recommend GC.gc() as the first approach to tackle this. I'm happy to propose a doc section.
FWIW, of two colleagues with pretty much the same setup, one sees the exact same problem and the other has never encountered it
from flux.jl.
Right, apparently introduced as an NVIDIA driver feature earlier this year
vladmandic/automatic#1285
from flux.jl.
Might be worth to mention in the docs?
from flux.jl.
Related Issues (20)
- Does `withgradient` have lower precicision than simply calling the function? HOT 3
- Warning: sort(d::Dict; args...) is deprecated, use sort!(OrderedDict(d); args...) instead. HOT 1
- `show` is confused by shared parameters
- No cudnn implementation of Conv((1,) N=>M) HOT 12
- Flux docs missing withgradient() call for multi-objective loss functions HOT 6
- can't use masks in multi-head-attention layer HOT 6
- Adapt saving & loading example to CuArrays HOT 2
- Segmentation fault when doing a forward pass with a model saved with BSON HOT 2
- Flux new explicit API does not work but old implicit API works for a simple RNN HOT 4
- Intel Arc GPU support. HOT 3
- `using Flux, cuDNN` freezes, but `using Flux, CUDA, cuDNN` works HOT 1
- Convolutional network slower than tensorflow on CPU HOT 13
- Problem with RNN and CUDA. HOT 7
- precompilation issue on Julia 1.10 HOT 1
- Android/iOS support HOT 1
- since new version: Flux throws error when for train! / update! even on quick start problem HOT 4
- Illegal Memory Access Error During Gradient Calculation of predefined losses on GPU RTX 4050 HOT 1
- Flux installation error under Julia 1.10 on Apple Silicon HOT 2
- Given that DataLoader implements `length` shouldn't it also be able to provide size? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flux.jl.