Comments (7)
After I pip install ultralytics
, if I allow the training command, I will keep getting stuck in
Transferred 469/475 items from pretrained weights
DDP: debug command /home/qiuzx/miniconda3/envs/yolov8/bin/python -m torch.distributed.run --nproc_per_node 4 --master_port 48025 /home/qiuzx/.config/Ultralytics/DDP/_temp_71rgtm97139853097381840.py
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Ultralytics YOLOv8.2.9 🚀 Python-3.8.13 torch-1.12.1+cu113 CUDA:0 (NVIDIA GeForce RTX 4090, 24217MiB)
CUDA:1 (NVIDIA GeForce RTX 4090, 24217MiB)
CUDA:2 (NVIDIA GeForce RTX 4090, 24217MiB)
CUDA:3 (NVIDIA GeForce RTX 4090, 24217MiB)
WARNING ⚠️ Upgrade to torch>=2.0.0 for deterministic training.
Overriding model.yaml nc=80 with nc=2
Transferred 469/475 items from pretrained weights
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed ✅
and not go down. In addition, to make it easier for me to modify the code, I pip uninstall ultralytics
and keep reporting errors during multi GPUs training,
File "/home/qiuzx/. config/Ultralytics/DDP/_temp_18uxnvwv139801431181536. py", line 6, in<module>
From ultralytics. models. yolo. detect. train import DetectionTrainer
ModuleNotFoundError: No module named 'ultralytics'.
How to solve these two problems
from ultralytics.
@blue-q hello! It seems like you're encountering two separate issues here.
-
Training Getting Stuck at AMP Checks: If your training consistently stops at the "AMP: checks passed ✅" without proceeding, this could potentially be due to insufficient resources or a configuration oversight. First, ensure that there are no resource limitations or I/O bottlenecks. Also, check if updating to Torch>=2.0.0 as suggested by the warning improves the situation, as newer versions of Torch have better support and optimizations for multi-GPU setups.
-
Errors After Uninstalling Ultralytics Library: When you uninstall the Ultralytics package, Python is unable to find the module because it no longer exists in your environment, leading to a
ModuleNotFoundError
. If you need to make code modifications frequently, consider working in a development environment where you clone the GitHub repository and run your modified code directly from source. This approach avoids the need to uninstall and reinstall the package. You can set up this environment by cloning the repo and usingpip install -e .
within the repository directory.
For both issues, ensuring that all dependencies are correctly installed and updating to the latest versions where possible often helps. If the problem persists, providing more specific logs or error messages could help in diagnosing the issue further!
from ultralytics.
Hi @glenn-jocher ,my CUDA is now 12.1, and I have reinstalled the torch for 2.1.2. My environment information is as follows:
Package Version Editable project location
------------------------ -------------------- -------------------------
certifi 2024.2.2
charset-normalizer 3.3.2
cmake 3.29.2
contourpy 1.1.1
cycler 0.12.1
filelock 3.14.0
fonttools 4.51.0
fsspec 2024.3.1
idna 3.7
importlib_resources 6.4.0
Jinja2 3.1.4
kiwisolver 1.4.5
lit 18.1.4
MarkupSafe 2.1.5
matplotlib 3.7.5
mpmath 1.3.0
networkx 3.1
numpy 1.24.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.1.105
opencv-python 4.9.0.80
packaging 24.0
pandas 2.0.3
pillow 10.3.0
pip 23.3.1
psutil 5.9.8
py-cpuinfo 9.0.0
pyparsing 3.1.2
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
requests 2.31.0
scipy 1.10.1
seaborn 0.13.2
setuptools 68.2.2
six 1.16.0
sympy 1.12
thop 0.1.1.post2209072238
torch 2.1.2
torchaudio 2.1.2
torchvision 0.16.2
tqdm 4.66.4
triton 2.1.0
typing_extensions 4.11.0
tzdata 2024.1
ultralytics 8.1.44 /home/qiuzx/ultralytics
urllib3 2.2.1
wheel 0.43.0
zipp 3.18.1
My training command is model.train(data='/home/qiuzx/ultralytics/ultralytics/cfg/datasets/20240506_flame_smoke_class2.yaml', epochs=500, imgsz=640, batch=128, device=[0,1,2,3])
, and at this point he will still get stuck in
DDP: debug command /home/qiuzx/miniconda3/envs/yolov8/bin/python -m torch.distributed.run --nproc_per_node 4 --master_port 37947 /home/qiuzx/.config/Ultralytics/DDP/_temp_mak_nap2139734965595248.py
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Ultralytics YOLOv8.1.44 🚀 Python-3.8.13 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce RTX 4090, 24217MiB)
CUDA:1 (NVIDIA GeForce RTX 4090, 24217MiB)
CUDA:2 (NVIDIA GeForce RTX 4090, 24217MiB)
CUDA:3 (NVIDIA GeForce RTX 4090, 24217MiB)
Overriding model.yaml nc=80 with nc=2
Transferred 469/475 items from pretrained weights
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed ✅
At the same time, I checked the status of the graphics card through nvidia-smi
and found that the usage rate of all four GPUs was 100%
from ultralytics.
Is it possible that it is caused by torch.backups.cudnn.benchmark
? After setting torch.backups.cudnn.enabled=False
, I can run with two GPUs, but if I use four GPUs, it will still get stuck in
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed ✅
from ultralytics.
Hello! It sounds like you might be encountering an issue related to the CUDA cuDNN benchmarks when using multiple GPUs.
Disabling torch.backends.cudnn.benchmark
can indeed help in some cases as it turns off certain optimizations that, although generally improve performance, can cause stalemates in specific situations, especially with a variable workload between different batches.
As you noticed, setting:
torch.backends.cudnn.enabled = False
helps when using two GPUs but doesn't solve the issue with four GPUs.
It could be beneficial to ensure all GPUs synchronize properly. You might want to try setting:
torch.backends.cudnn.benchmark = False
torch.cuda.synchronize()
before your training loop or right after the AMP check, to ensure all devices are in sync.
If the issue persists, please provide more details about your specific setup or configurations that might be contributing to this behavior! Happy coding! 🚀
from ultralytics.
DDP: debug command /home/qiuzx/miniconda3/envs/yolov8/bin/python -m torch.distributed.run --nproc_per_node 4 --master_port 37947 /home/qiuzx/.config/Ultralytics/DDP/_temp_mak_nap2139734965595248.py
WARNING:main:
Setting OMP_NUM_THREADS environment
这个玩意到底有没有影响啊
from ultralytics.
@TomZhongJie hello! It looks like you're inquiring about the impact of the OMP_NUM_THREADS
environment setting during your DDP (Distributed Data Parallel) training with YOLOv8. Setting OMP_NUM_THREADS=1
is generally recommended for avoiding potential issues with overly aggressive thread usage by PyTorch, which can lead to inefficient CPU usage in multi-threading environments, especially when using multiple GPUs. It can help to stabilize your training process by ensuring that parallel execution doesn't become a bottleneck.
If you're experiencing particular issues or slowdowns, you might consider adjusting this setting to better fit your hardware capabilities, balancing between CPU threads and GPU workload. Here's how you can experiment with it:
import os
os.environ['OMP_NUM_THREADS'] = '4' # Adjust this as necessary for your machine
Add this to your script before importing any major libraries like PyTorch or starting the training process to see if it impacts performance. Happy experimenting! 🚀
from ultralytics.
Related Issues (20)
- Will you support YOLOv10 in the future? HOT 5
- YOLOv8 OBB HOT 1
- YOLOv8 with KAN HOT 2
- A question about validation set drawing image results HOT 2
- Determine the Class of a Specific Pixel-Coordinate from YOLOv8 Segmentation Results HOT 2
- Custom train for table structure HOT 8
- Please change this misleading tip HOT 2
- Tracking with 1 model and N multi-stream HOT 9
- edgetpu.ftlite is numpy.int8 but Coral only support uint8 input type HOT 1
- Getting error while Converting to tensorRT HOT 5
- Unable to Export RTDETR Large Model(best.pt) to TFLite or NCNN for Raspberry Pi 4 Deployment HOT 5
- Deepsparse provides empty results with custom yolov8 model HOT 5
- No inference with best.pt HOT 1
- YOLO HOT 6
- Training slow with large training imgsz HOT 4
- classification .pt to onnx predict error HOT 5
- onnx detect HOT 2
- ModuleNotFoundError: No module named 'ultralytics.nn.modules.conv'; 'ultralytics.nn.modules' is not a package HOT 2
- Custom model cannot export onnx from pt file HOT 2
- Custom tracker weight HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ultralytics.