Comments (8)
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
Seems you haven't installed nvidia-container-toolkit correctly and Docker cannot mount GPUs by --gpus
argument.
from superbenchmark.
i discovered that but nvidia container toolkit is not working either:
NVIDIA/nvidia-container-toolkit#60
from superbenchmark.
i managed to get nvidia container toolkit working but still getting error. See below log:
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2023.04.15 13:15:08 =~=~=~=~=~=~=~=~=~=~=~=
sudo sb run -f local.ini -c resnet.yaml 2>&1 | sudo tee ~/gg/log/sb.run.log
[2023-04-15 13:15:10,635 guyen-MS-7B22:8261][ansible.py:60][INFO] {'host_pattern': 'all', 'cmdline': '--forks 1 --inventory /home/guyen/gg/git/superbenchmark/local.ini'}
[2023-04-15 13:15:10,647 guyen-MS-7B22:8261][runner.py:45][INFO] Runner uses config: {'superbench': {'benchmarks': {'bert_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['bert-base',
'bert-large'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'computation-communication-overlap': {'enable': True,
'frameworks': ['pytorch'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}]},
'cpu-memory-bw-latency': {'enable': False,
'modes': [{'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'tests': ['bandwidth_matrix',
'latency_matrix',
'max_bandwidth']}},
'cublas-function': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'cudnn-function': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'densenet_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['densenet169',
'densenet201'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'disk-benchmark': {'enable': False,
'modes': [{'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'block_devices': ['/dev/nvme0n1']}},
'gemm-flops': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'gpcnet-network-load-test': {'enable': False,
'modes': [{'env': {'UCX_NET_DEVICES': 'mlx5_0:1'},
'mca': {'btl': '^uct',
'btl_tcp_if_include': 'eth0',
'pml': 'ucx'},
'name': 'mpi',
'proc_num': 1}]},
'gpcnet-network-test': {'enable': False,
'modes': [{'env': {'UCX_NET_DEVICES': 'mlx5_0:1'},
'mca': {'btl': '^uct',
'btl_tcp_if_include': 'eth0',
'pml': 'ucx'},
'name': 'mpi',
'proc_num': 1}]},
'gpt_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['gpt2-small',
'gpt2-large'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'gpu-burn': {'enable': True,
'modes': [{'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'doubles': True,
'tensor_core': True,
'time': 300}},
'gpu-copy-bw:correctness': {'enable': True,
'modes': [{'name': 'local',
'parallel': False}],
'parameters': {'check_data': True,
'copy_type': ['sm',
'dma'],
'mem_type': ['htod',
'dtoh',
'dtod'],
'num_loops': 1,
'num_warm_up': 0,
'size': 4096}},
'gpu-copy-bw:perf': {'enable': True,
'modes': [{'name': 'local',
'parallel': False}],
'parameters': {'copy_type': ['sm',
'dma'],
'mem_type': ['htod',
'dtoh',
'dtod']}},
'ib-loopback': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'PROC_RANK={proc_rank} '
'IB_DEVICES=0,2,4,6 '
'NUMA_NODES=1,0,3,2',
'proc_num': 4},
{'name': 'local',
'parallel': True,
'prefix': 'PROC_RANK={proc_rank} '
'IB_DEVICES=1,3,5,7 '
'NUMA_NODES=1,0,3,2',
'proc_num': 4}]},
'ib-traffic': {'enable': False,
'modes': [{'name': 'mpi',
'proc_num': 8}],
'parameters': {'gpu_dev': '$LOCAL_RANK',
'ib_dev': 'mlx5_$LOCAL_RANK',
'msg_size': 8388608,
'numa_dev': '$((LOCAL_RANK/2))'}},
'kernel-launch': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'lstm_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['lstm'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'matmul': {'enable': True,
'frameworks': ['pytorch'],
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'mem-bw': {'enable': True,
'modes': [{'name': 'local',
'parallel': False,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank} '
'numactl -N '
'$(({proc_rank}/2))',
'proc_num': 8}]},
'nccl-bw:default': {'enable': True,
'modes': [{'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'ngpus': 8}},
'nccl-bw:gdr-only': {'enable': True,
'modes': [{'env': {'NCCL_IB_DISABLE': '0',
'NCCL_IB_PCI_RELAXED_ORDERING': '1',
'NCCL_MIN_NCHANNELS': '16',
'NCCL_NET_GDR_LEVEL': '5',
'NCCL_P2P_DISABLE': '1',
'NCCL_SHM_DISABLE': '1'},
'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'ngpus': 8}},
'ort-inference': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}],
'parameters': {'batch_size': 1}},
'resnet_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['resnet50',
'resnet101',
'resnet152'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 128,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'sharding-matmul': {'enable': True,
'frameworks': ['pytorch'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}]},
'tcp-connectivity': {'enable': False,
'modes': [{'name': 'local',
'parallel': False}],
'parameters': {'port': 22}},
'tensorrt-inference': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}],
'parameters': {'batch_size': 1,
'precision': 'int8',
'pytorch_models': ['resnet50',
'resnet101',
'resnet152',
'densenet169',
'densenet201',
'bert-base',
'bert-large'],
'seq_length': 224}},
'vgg_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['vgg11',
'vgg13',
'vgg16',
'vgg19'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}}},
'enable': ['resnet_models'],
'monitor': {'enable': True,
'sample_duration': 1,
'sample_interval': 10},
'var': {'common_model_config': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']},
'default_local_mode': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'default_pytorch_mode': {'enable': True,
'frameworks': ['pytorch'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}]}}},
'version': 'v0.8'}.
[2023-04-15 13:15:10,648 guyen-MS-7B22:8261][runner.py:46][INFO] Runner writes to: /home/guyen/gg/git/superbenchmark/outputs/2023-04-15_13-15-10.
[2023-04-15 13:15:10,674 guyen-MS-7B22:8261][runner.py:51][INFO] Runner will run: ['resnet_models']
[2023-04-15 13:15:10,674 guyen-MS-7B22:8261][runner.py:202][INFO] Checking SuperBench environment.
[2023-04-15 13:15:10,699 guyen-MS-7B22:8261][ansible.py:127][INFO] Run playbook check_env.yaml ...
PLAY [Runtime Environment Check] ***********************************************
TASK [Checking container status] ***********************************************
changed: [localhost]
TASK [fail] ********************************************************************
skipping: [localhost]
PLAY [Runtime Environment Update] **********************************************
TASK [Gathering Facts] *********************************************************
ok: [localhost]
TASK [Ensure Workspace] ********************************************************
ok: [localhost]
TASK [Updating Config] *********************************************************
ok: [localhost]
TASK [Updating Env Variables] **************************************************
ok: [localhost] => (item=/root/sb-workspace/sb.env)
ok: [localhost] => (item=/tmp/sb.env)
TASK [Updating Hostfile to Remote] *********************************************
ok: [localhost]
TASK [Generating Hostfile to Local] ********************************************
changed: [localhost -> localhost]
PLAY RECAP *********************************************************************
localhost : ok=7 changed=2 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
[2023-04-15 13:15:15,163 guyen-MS-7B22:8261][ansible.py:79][INFO] Run succeed, return code 0.
[2023-04-15 13:15:15,165 guyen-MS-7B22:8261][runner.py:414][INFO] Runner is going to run resnet_models in torch.distributed mode, proc rank 0.
[2023-04-15 13:15:15,165 guyen-MS-7B22:8261][ansible.py:109][INFO] Run docker exec --env-file /tmp/sb.env sb-workspace bash -c 'python3 -m torch.distributed.launch --use_env --no_python --nproc_per_node=8 sb exec --output-dir outputs/2023-04-15_13-15-10 -c sb.config.yaml -C superbench.enable=resnet_models superbench.benchmarks.resnet_models.parameters.distributed_impl=ddp superbench.benchmarks.resnet_models.parameters.distributed_backend=nccl' on remote ...
[2023-04-15 13:15:15,165 guyen-MS-7B22:8261][ansible.py:73][INFO] Run as sudo ...
localhost | CHANGED | rc=0 >>
[2023-04-15 20:15:17,142 guyen-MS-7B22:298][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.
[2023-04-15 20:15:17,172 guyen-MS-7B22:297][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.
[2023-04-15 20:15:17,235 guyen-MS-7B22:301][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.
[2023-04-15 20:15:17,235 guyen-MS-7B22:296][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.
[2023-04-15 20:15:17,249 guyen-MS-7B22:312][monitor.py:118][INFO] Start monitoring.
[2023-04-15 20:15:17,299 guyen-MS-7B22:300][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.
[2023-04-15 20:15:17,314 guyen-MS-7B22:302][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.
[2023-04-15 20:15:17,590 guyen-MS-7B22:303][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.
[2023-04-15 20:15:17,623 guyen-MS-7B22:299][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.
[2023-04-15 20:15:17,909 guyen-MS-7B22:297][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:17,909 guyen-MS-7B22:297][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.
[2023-04-15 20:15:17,968 guyen-MS-7B22:298][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:17,970 guyen-MS-7B22:298][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.
[2023-04-15 20:15:18,151 guyen-MS-7B22:302][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:18,152 guyen-MS-7B22:302][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.
[2023-04-15 20:15:18,165 guyen-MS-7B22:296][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:18,166 guyen-MS-7B22:296][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.
[2023-04-15 20:15:18,216 guyen-MS-7B22:301][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:18,217 guyen-MS-7B22:301][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.
[2023-04-15 20:15:18,241 guyen-MS-7B22:300][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:18,241 guyen-MS-7B22:300][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.
[2023-04-15 20:15:18,393 guyen-MS-7B22:303][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:18,394 guyen-MS-7B22:303][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.
[2023-04-15 20:15:18,477 guyen-MS-7B22:299][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:18,477 guyen-MS-7B22:299][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.
[2023-04-15 20:15:23,074 guyen-MS-7B22:303][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
[2023-04-15 20:15:23,074 guyen-MS-7B22:296][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
[2023-04-15 20:15:23,074 guyen-MS-7B22:301][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
[2023-04-15 20:15:23,075 guyen-MS-7B22:296][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,075 guyen-MS-7B22:300][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
[2023-04-15 20:15:23,075 guyen-MS-7B22:298][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
[2023-04-15 20:15:23,075 guyen-MS-7B22:296][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.
[2023-04-15 20:15:23,075 guyen-MS-7B22:301][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,075 guyen-MS-7B22:299][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
[2023-04-15 20:15:23,075 guyen-MS-7B22:301][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.
[2023-04-15 20:15:23,075 guyen-MS-7B22:300][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,075 guyen-MS-7B22:299][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,075 guyen-MS-7B22:296][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,075 guyen-MS-7B22:297][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
[2023-04-15 20:15:23,075 guyen-MS-7B22:300][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.
[2023-04-15 20:15:23,075 guyen-MS-7B22:299][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.
[2023-04-15 20:15:23,075 guyen-MS-7B22:297][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,076 guyen-MS-7B22:301][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,076 guyen-MS-7B22:297][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.
[2023-04-15 20:15:23,076 guyen-MS-7B22:300][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,076 guyen-MS-7B22:299][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,076 guyen-MS-7B22:297][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,076 guyen-MS-7B22:303][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,077 guyen-MS-7B22:303][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.
[2023-04-15 20:15:23,077 guyen-MS-7B22:303][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,077 guyen-MS-7B22:298][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,078 guyen-MS-7B22:296][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,078 guyen-MS-7B22:296][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.
[2023-04-15 20:15:23,078 guyen-MS-7B22:298][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.
[2023-04-15 20:15:23,079 guyen-MS-7B22:299][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,079 guyen-MS-7B22:298][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,079 guyen-MS-7B22:299][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.
[2023-04-15 20:15:23,079 guyen-MS-7B22:301][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,079 guyen-MS-7B22:301][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.
[2023-04-15 20:15:23,079 guyen-MS-7B22:302][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
[2023-04-15 20:15:23,079 guyen-MS-7B22:300][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,079 guyen-MS-7B22:302][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,080 guyen-MS-7B22:300][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.
[2023-04-15 20:15:23,080 guyen-MS-7B22:302][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.
[2023-04-15 20:15:23,080 guyen-MS-7B22:302][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,080 guyen-MS-7B22:299][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!
[2023-04-15 20:15:23,080 guyen-MS-7B22:299][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,081 guyen-MS-7B22:299][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,081 guyen-MS-7B22:301][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!
[2023-04-15 20:15:23,081 guyen-MS-7B22:297][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,081 guyen-MS-7B22:301][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,081 guyen-MS-7B22:299][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.
[2023-04-15 20:15:23,081 guyen-MS-7B22:301][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,081 guyen-MS-7B22:297][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.
[2023-04-15 20:15:23,082 guyen-MS-7B22:301][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.
[2023-04-15 20:15:23,082 guyen-MS-7B22:298][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,082 guyen-MS-7B22:297][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!
[2023-04-15 20:15:23,082 guyen-MS-7B22:297][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,082 guyen-MS-7B22:298][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.
[2023-04-15 20:15:23,082 guyen-MS-7B22:297][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,082 guyen-MS-7B22:297][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.
[2023-04-15 20:15:23,083 guyen-MS-7B22:302][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,083 guyen-MS-7B22:302][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.
[2023-04-15 20:15:23,083 guyen-MS-7B22:303][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,083 guyen-MS-7B22:300][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!
[2023-04-15 20:15:23,084 guyen-MS-7B22:302][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!
[2023-04-15 20:15:23,084 guyen-MS-7B22:300][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,084 guyen-MS-7B22:302][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,084 guyen-MS-7B22:300][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,084 guyen-MS-7B22:302][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,084 guyen-MS-7B22:299][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,084 guyen-MS-7B22:300][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.
[2023-04-15 20:15:23,084 guyen-MS-7B22:302][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.
[2023-04-15 20:15:23,084 guyen-MS-7B22:303][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.
[2023-04-15 20:15:23,084 guyen-MS-7B22:299][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.
[2023-04-15 20:15:23,085 guyen-MS-7B22:301][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,085 guyen-MS-7B22:298][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!
[2023-04-15 20:15:23,085 guyen-MS-7B22:301][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.
[2023-04-15 20:15:23,085 guyen-MS-7B22:298][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,085 guyen-MS-7B22:298][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,085 guyen-MS-7B22:297][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,086 guyen-MS-7B22:297][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.
[2023-04-15 20:15:23,086 guyen-MS-7B22:298][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.
[2023-04-15 20:15:23,087 guyen-MS-7B22:302][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,087 guyen-MS-7B22:300][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,087 guyen-MS-7B22:302][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.
[2023-04-15 20:15:23,087 guyen-MS-7B22:300][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.
[2023-04-15 20:15:23,088 guyen-MS-7B22:303][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!
[2023-04-15 20:15:23,088 guyen-MS-7B22:303][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,088 guyen-MS-7B22:303][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,089 guyen-MS-7B22:298][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,089 guyen-MS-7B22:298][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.
[2023-04-15 20:15:23,089 guyen-MS-7B22:303][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.
[2023-04-15 20:15:23,090 guyen-MS-7B22:296][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!
[2023-04-15 20:15:23,091 guyen-MS-7B22:296][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:23,091 guyen-MS-7B22:296][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.
[2023-04-15 20:15:23,091 guyen-MS-7B22:296][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.
[2023-04-15 20:15:23,092 guyen-MS-7B22:303][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,092 guyen-MS-7B22:303][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.
[2023-04-15 20:15:23,094 guyen-MS-7B22:296][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.
[2023-04-15 20:15:23,094 guyen-MS-7B22:296][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.
[2023-04-15 20:15:24,085 guyen-MS-7B22:299][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!
[2023-04-15 20:15:24,086 guyen-MS-7B22:301][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!
[2023-04-15 20:15:24,086 guyen-MS-7B22:299][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:24,087 guyen-MS-7B22:299][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.
[2023-04-15 20:15:24,087 guyen-MS-7B22:301][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:24,087 guyen-MS-7B22:297][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!
[2023-04-15 20:15:24,088 guyen-MS-7B22:301][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.
[2023-04-15 20:15:24,088 guyen-MS-7B22:297][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:24,088 guyen-MS-7B22:297][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.
[2023-04-15 20:15:24,088 guyen-MS-7B22:302][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!
[2023-04-15 20:15:24,089 guyen-MS-7B22:302][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:24,090 guyen-MS-7B22:302][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.
[2023-04-15 20:15:24,090 guyen-MS-7B22:300][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!
[2023-04-15 20:15:24,090 guyen-MS-7B22:298][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!
[2023-04-15 20:15:24,090 guyen-MS-7B22:300][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:24,090 guyen-MS-7B22:298][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:24,091 guyen-MS-7B22:298][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.
[2023-04-15 20:15:24,091 guyen-MS-7B22:300][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.
[2023-04-15 20:15:24,101 guyen-MS-7B22:303][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!
[2023-04-15 20:15:24,102 guyen-MS-7B22:303][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:24,102 guyen-MS-7B22:303][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.
[2023-04-15 20:15:24,111 guyen-MS-7B22:296][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!
[2023-04-15 20:15:24,112 guyen-MS-7B22:296][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:15:24,113 guyen-MS-7B22:296][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.
[2023-04-15 13:15:27,814 guyen-MS-7B22:8261][ansible.py:79][INFO] Run succeed, return code 0.
[2023-04-15 13:15:27,815 guyen-MS-7B22:8261][ansible.py:127][INFO] Run playbook fetch_results.yaml ...
PLAY [Fetch Results] ***********************************************************
TASK [Gathering Facts] *********************************************************
ok: [localhost]
TASK [Synchronize Output Directory] ********************************************
changed: [localhost]
PLAY RECAP *********************************************************************
localhost : ok=2 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
[2023-04-15 13:15:29,883 guyen-MS-7B22:8261][ansible.py:79][INFO] Run succeed, return code 0.
(venv) guyen@guyen-MS-7B22:~/gg/git/superbenchmark$
from superbenchmark.
I only left resnet 101 and now out of memory : rtx2070 super 8gb.
What is the memory requirement for resnet_models: resnet101?
Any smaller training that can git in 8gb?
worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
[2023-04-15 20:32:44,711 guyen-MS-7B22:237][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 7.78 GiB total capacity; 6.58 GiB already allocated; 127.62 MiB free; 6.71 GiB reserved in total by PyTorch)
[2023-04-15 20:32:44,712 guyen-MS-7B22:237][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:32:44,712 guyen-MS-7B22:237][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.
[2023-04-15 20:32:44,713 guyen-MS-7B22:237][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.
from superbenchmark.
Sorry. SuperBenchmark doesn't officially support GeForce series at all. Therefore, it will introduce many unexpected issues. There is no plan for recent release on GeForce series support.
For GeForce related code & configuration setting (e.g. rtx2070), it would be great if you can contribute to it.
from superbenchmark.
it was due to memory size, i did manage some of the smaller training. you can close this.
from superbenchmark.
configuration
I dont think contribution is necessary. it is just same cuda chips with different brand name with smaller sizes and + gamind chips. Performance will be slower (GDDR5 intead of HBM, 8GB vs. 32GB etc) but i had no issue running scaled down workloads.
from superbenchmark.
Thanks for your discussion! Really appreciate it. We close this issue.
from superbenchmark.
Related Issues (20)
- cublaslt_gemm microbenchmark fails with running with large matrix sizes. HOT 5
- Found no NVIDIA driver on your system. HOT 5
- V0.8.0 Release Plan
- V0.8.0 Test Plan
- [Enhancement] - Add HPL random generator to gemm-flops with ROCm
- pytorch cannot find libopen-orted-mpir.so HOT 2
- Run benchmark failed (superbenchmark-0.8.0) HOT 2
- why is it probing for nviida when running on MI? HOT 7
- Some test does not support CS 8.9(RTX 4080/4090) HOT 2
- A question about Hived HOT 5
- cublaslt_gemm fp8 does not work with RTX 40 HOT 4
- sb deploy fails due to permission issue, HOT 10
- Superbench result contains null characters. HOT 1
- out-of-date reference link HOT 1
- V0.9.0 Test Plan
- Error: parsing sudo passwords containing special symbols HOT 1
- V0.10.0 Release Plan
- Urgent: while executing the superbench, it's failing (UBUNTU) HOT 2
- default gpu_burn test fails with cp error HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from superbenchmark.