ModelZoo-PyTorch/PyTorch/contrib/cv/classification/MobileNetV3_large_100_for_PyTorch# bash ./test/train_full_8p.sh --data_path=./tiny-imagenet-200
Using NVIDIA APEX AMP. Training in mixed precision.
Using NVIDIA APEX DistributedDataParallel.
\
Scheduled epochs: 12
./tiny-imagenet-200/train
./tiny-imagenet-200/val
/
/
|
/
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)
Train: 1 [ 23/24 (100%)] Loss: 317.145294 (162.0302) Time: 0.203s, 20132.34/s (0.204s, 20120.48/s) LR: 1
.875e-01 Data: 0.000 (0.000) FPS: 20120.483 Batch_Size:512.0
Current checkpoints:
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)
Train: 2 [ 23/24 (100%)] Loss: 12993.997070 (6542.2793) Time: 0.201s, 20416.47/s (0.201s, 20367.03/s) LR
: 1.067e+00 Data: 0.000 (0.000) FPS: 20367.028 Batch_Size:512.0
Current checkpoints:
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-2.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)
Train: 3 [ 23/24 (100%)] Loss: 13072.549805 (13072.2168) Time: 0.200s, 20458.09/s (0.201s, 20406.06/s) L
R: 1.000e-05 Data: 0.000 (0.000) FPS: 20406.057 Batch_Size:512.0
Current checkpoints:
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-2.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-3.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)
Train: 4 [ 23/24 (100%)] Loss: 13074.948242 (13074.1235) Time: 0.200s, 20464.24/s (0.201s, 20388.30/s) L
R: 1.000e-05 Data: 0.000 (0.000) FPS: 20388.300 Batch_Size:512.0
Current checkpoints:
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-2.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-3.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-4.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)
Train: 5 [ 23/24 (100%)] Loss: 13074.291016 (13074.4219) Time: 0.200s, 20436.34/s (0.200s, 20444.53/s) L
R: 1.000e-05 Data: 0.000 (0.000) FPS: 20444.529 Batch_Size:512.0
Current checkpoints:
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-2.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-3.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-4.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-5.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)
Train: 6 [ 23/24 (100%)] Loss: 13075.957031 (13075.1313) Time: 0.201s, 20424.17/s (0.201s, 20404.91/s) L
R: 1.000e-05 Data: 0.000 (0.000) FPS: 20404.910 Batch_Size:512.0
Current checkpoints:
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-2.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-3.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-4.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-5.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-6.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)
Train: 7 [ 23/24 (100%)] Loss: 13076.270508 (13076.5000) Time: 0.201s, 20398.41/s (0.200s, 20433.25/s) L
R: 1.000e-05 Data: 0.000 (0.000) FPS: 20433.251 Batch_Size:512.0
Current checkpoints:
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-2.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-3.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-4.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-5.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-6.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-7.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)
Train: 8 [ 23/24 (100%)] Loss: 13076.963867 (13077.1807) Time: 0.201s, 20409.10/s (0.201s, 20424.42/s) L
R: 1.000e-05 Data: 0.000 (0.000) FPS: 20424.422 Batch_Size:512.0
Current checkpoints:
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-2.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-3.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-4.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-5.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-6.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-7.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-8.pth.tar', 100.0)
('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)
-----------------------------------8卡 train_1.log 有报错
Traceback (most recent call last):
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 61, in wrapper
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 58, in wrapper
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 275, in task_distribute
key, func_name, detail = resource_proxy[TASK_QUEUE].get()
File "", line 2, in get
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/managers.py", line 819, in _callmeth
od
kind, result = conn.recv()
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 250, in recv
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_b
ytes
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
EOFError
-----------------------------------8卡 train_2.log 有报错
Traceback (most recent call last):
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 61, in wrapper
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 58, in wrapper
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 275, in task_distribute
key, func_name, detail = resource_proxy[TASK_QUEUE].get()
File "", line 2, in get
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/managers.py", line 819, in _callmeth
od
kind, result = conn.recv()
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 250, in recv
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_b
ytes
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
EOFError
/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semap
hore_tracker: There appear to be 91 leaked semaphores to clean up at shutdown
len(cache))
-----------------------------------8卡 train_3.log 还在等待
-----------------------------------8卡 train_4.log 有报错
Traceback (most recent call last):
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 61, in wrapper
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 58, in wrapper
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 275, in task_distribute
key, func_name, detail = resource_proxy[TASK_QUEUE].get()
File "", line 2, in get
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/managers.py", line 819, in _callmeth
od
kind, result = conn.recv()
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 250, in recv
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_b
ytes
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
EOFError
/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semap
hore_tracker: There appear to be 91 leaked semaphores to clean up at shutdown
len(cache))
-----------------------------------8卡 train_5.log 有报错
Traceback (most recent call last):
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 61, in wrapper
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 58, in wrapper
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 275, in task_distribute
key, func_name, detail = resource_proxy[TASK_QUEUE].get()
File "", line 2, in get
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/managers.py", line 819, in _callmeth
od
kind, result = conn.recv()
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 250, in recv
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_b
ytes
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
EOFError
-----------------------------------8卡 train_6.log 有报错
Traceback (most recent call last):
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 61, in wrapper
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 58, in wrapper
File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",
line 275, in task_distribute
key, func_name, detail = resource_proxy[TASK_QUEUE].get()
File "", line 2, in get
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/managers.py", line 819, in _callmeth
od
kind, result = conn.recv()
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 250, in recv
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_b
ytes
File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
EOFError
/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semap
hore_tracker: There appear to be 91 leaked semaphores to clean up at shutdown
len(cache))
-----------------------------------8卡 train_7.log 等待中
more /root/ascend/log/debug/plog/plog-83032_20230
727024927917.log
[TRACE] GE(83032,python3):2023-07-27-02:49:27.848.095 [status:INIT] [ge_api.cc:200]83032 GEInitializeImpl:GEI
nitialize start
[TRACE] GE(83032,python3):2023-07-27-02:49:28.073.557 [status:RUNNING] [ge_api.cc:266]83032 GEInitializeImpl:
Initializing environment
[TRACE] GE(83032,python3):2023-07-27-02:49:36.094.724 [status:STOP] [ge_api.cc:309]83032 GEInitializeImpl:GEI
nitialize finished
[TRACE] GE(83032,python3):2023-07-27-02:49:36.095.523 [status:INIT] [ge_api.cc:200]83032 GEInitializeImpl:GEI
nitialize start
[TRACE] HCCL(83032,python3):2023-07-27-02:49:57.407.898 [status:init] [op_base.cc:267][hccl-83032-0-169042619
7-hccl_world_group][7]HcclCommInitRootInfo success,take time [2890202]us, rankNum[8], rank[7],rootInfo identi
fier[10.0.48.200%enp61s0f3_60000_0_1690426193976808], server[10.0.48.200%enp61s0f3], device[7]
几个都是正常的
最后时间的三个
(base) root@hw:/media/sda/datastore/dataset/detect_dataset# more /root/ascend/log/debug/plog/plog-84469_20230
727030241041.log
[ERROR] TBE(84469,python3):2023-07-27-03:02:41.035.597 [../../../../../../latest/python/site-packages/tbe/com
mon/repository_manager/utils/repository_manager_log.py:30][log] [../../../../../../latest/python/site-package
s/tbe/common/repository_manager/utils/common.py:100][repository_manager] The main process does not exist. We
would kill multiprocess manager process: 84068.