Comments (6)
@zxti - that's odd, this shouldn't be the case. Could you share a sample docker container from NGC that you're using so I can test and try to repro on our side?
from deepspeed.
@loadams I'll check which docker container I was specifically using.
But for now - you can also repro in Colab, let me know if this helps: https://colab.research.google.com/drive/1vpmay34Wfc31ilOHSB4G6WIGsmw8H0-0?usp=sharing
from deepspeed.
It was nvcr.io/nvidia/pytorch:23.12-py3
from deepspeed.
I encountered the same issue when running the CIFAR example inside the Docker image, and the solution to the problem was to run the image with the --privileged
mode.
$ docker run -it --rm --privileged ...
from deepspeed.
@zxti - does that help resolve the issue for you?
from deepspeed.
Hi @loadams, unfortunately no, since this is what I mentioned in the initial post - you can work around the issue if you have privileged access to the host, but in very many environments (such as RunPod, Colab, etc.) you don't have this.
from deepspeed.
Related Issues (20)
- [REQUEST] Deepspeed support finetune extra large model with lora + pipeline ?
- [BUG] Fail to Resume From Checkpoint with Different GPU Number(Huggingface Trainer + Deepspeed) HOT 16
- [BUG] Mis-typed free_blocks
- [BUG] Gradient Accumulation Steps Initialization Bug in Pipeline Parallel Mode
- nv-ds-chat CI test failure HOT 1
- [BUG]Zero inference return bad result and low speed inference HOT 1
- When using pure DeepSpeed ulysses and zero stage 3 to continue pre-training, the loss gap between each GPU is too large.[BUG] HOT 2
- [BUG] AttributeError deepspeed.comm has no attribute Processgroup HOT 3
- [BUG] Tensors are on different devices when model.step() HOT 13
- Is there any solusion to overcome underflow issues? HOT 2
- [BUG] Trying to finetune mistral using deepspeed but running into an error: Error building extension 'cpu_adam' HOT 1
- [BUG] No `universal_checkpoint_info` in the Accelerate+Deepspeed Checkpoint HOT 6
- nv-nightly CI test failure HOT 1
- [BUG] (flops_profiler) Duplicate registration check for start_time_hook is not working
- [BUG: Whisper model pipeline parallel training] logits and ground truth size mismatch during loss calculation
- [Q&A] Why Deepspeed Ulysses could support long sequence length?
- Why not save frozen params unless: `self.zero_optimization_stage() >= ZeroStageEnum.gradients`? HOT 2
- [REQUEST]I do not understand the meaning of ' reduction ' in the ZERO++ paper. HOT 1
- Deepspeed module not being able to install in the WSL environment HOT 2
- Cannot create wheel for version 0.14.2 on Windows HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.