What's the issue, what's expected? : <div class="snippet-clipboard

superbench failed at default most typical run config,about microsoft/superbenchmark

Comments (8)

abuccts commented on May 22, 2024

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Seems you haven't installed nvidia-container-toolkit correctly and Docker cannot mount GPUs by --gpus argument.

from superbenchmark.

jdgh000 commented on May 22, 2024

i discovered that but nvidia container toolkit is not working either:
NVIDIA/nvidia-container-toolkit#60

from superbenchmark.

jdgh000 commented on May 22, 2024

i managed to get nvidia container toolkit working but still getting error. See below log:

=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2023.04.15 13:15:08 =~=~=~=~=~=~=~=~=~=~=~=
sudo sb run -f local.ini -c resnet.yaml 2>&1 | sudo tee ~/gg/log/sb.run.log 

[2023-04-15 13:15:10,635 guyen-MS-7B22:8261][ansible.py:60][INFO] {'host_pattern': 'all', 'cmdline': '--forks 1 --inventory /home/guyen/gg/git/superbenchmark/local.ini'}
[2023-04-15 13:15:10,647 guyen-MS-7B22:8261][runner.py:45][INFO] Runner uses config: {'superbench': {'benchmarks': {'bert_models': {'enable': True,
                                               'frameworks': ['pytorch'],
                                               'models': ['bert-base',
                                                          'bert-large'],
                                               'modes': [{'name': 'torch.distributed',
                                                          'node_num': 1,
                                                          'proc_num': 8}],
                                               'parameters': {'batch_size': 1,
                                                              'duration': 0,
                                                              'model_action': ['train'],
                                                              'num_steps': 128,
                                                              'num_warmup': 16,
                                                              'precision': ['float32',
                                                                            'float16']}},
                               'computation-communication-overlap': {'enable': True,
                                                                     'frameworks': ['pytorch'],
                                                                     'modes': [{'name': 'torch.distributed',
                                                                                'node_num': 1,
                                                                                'proc_num': 8}]},
                               'cpu-memory-bw-latency': {'enable': False,
                                                         'modes': [{'name': 'local',
                                                                    'parallel': False,
                                                                    'proc_num': 1}],
                                                         'parameters': {'tests': ['bandwidth_matrix',
                                                                                  'latency_matrix',
                                                                                  'max_bandwidth']}},
                               'cublas-function': {'enable': True,
                                                   'modes': [{'name': 'local',
                                                              'parallel': True,
                                                              'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                              'proc_num': 8}]},
                               'cudnn-function': {'enable': True,
                                                  'modes': [{'name': 'local',
                                                             'parallel': True,
                                                             'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                             'proc_num': 8}]},
                               'densenet_models': {'enable': True,
                                                   'frameworks': ['pytorch'],
                                                   'models': ['densenet169',
                                                              'densenet201'],
                                                   'modes': [{'name': 'torch.distributed',
                                                              'node_num': 1,
                                                              'proc_num': 8}],
                                                   'parameters': {'batch_size': 1,
                                                                  'duration': 0,
                                                                  'model_action': ['train'],
                                                                  'num_steps': 128,
                                                                  'num_warmup': 16,
                                                                  'precision': ['float32',
                                                                                'float16']}},
                               'disk-benchmark': {'enable': False,
                                                  'modes': [{'name': 'local',
                                                             'parallel': False,
                                                             'proc_num': 1}],
                                                  'parameters': {'block_devices': ['/dev/nvme0n1']}},
                               'gemm-flops': {'enable': True,
                                              'modes': [{'name': 'local',
                                                         'parallel': True,
                                                         'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                         'proc_num': 8}]},
                               'gpcnet-network-load-test': {'enable': False,
                                                            'modes': [{'env': {'UCX_NET_DEVICES': 'mlx5_0:1'},
                                                                       'mca': {'btl': '^uct',
                                                                               'btl_tcp_if_include': 'eth0',
                                                                               'pml': 'ucx'},
                                                                       'name': 'mpi',
                                                                       'proc_num': 1}]},
                               'gpcnet-network-test': {'enable': False,
                                                       'modes': [{'env': {'UCX_NET_DEVICES': 'mlx5_0:1'},
                                                                  'mca': {'btl': '^uct',
                                                                          'btl_tcp_if_include': 'eth0',
                                                                          'pml': 'ucx'},
                                                                  'name': 'mpi',
                                                                  'proc_num': 1}]},
                               'gpt_models': {'enable': True,
                                              'frameworks': ['pytorch'],
                                              'models': ['gpt2-small',
                                                         'gpt2-large'],
                                              'modes': [{'name': 'torch.distributed',
                                                         'node_num': 1,
                                                         'proc_num': 8}],
                                              'parameters': {'batch_size': 1,
                                                             'duration': 0,
                                                             'model_action': ['train'],
                                                             'num_steps': 128,
                                                             'num_warmup': 16,
                                                             'precision': ['float32',
                                                                           'float16']}},
                               'gpu-burn': {'enable': True,
                                            'modes': [{'name': 'local',
                                                       'parallel': False,
                                                       'proc_num': 1}],
                                            'parameters': {'doubles': True,
                                                           'tensor_core': True,
                                                           'time': 300}},
                               'gpu-copy-bw:correctness': {'enable': True,
                                                           'modes': [{'name': 'local',
                                                                      'parallel': False}],
                                                           'parameters': {'check_data': True,
                                                                          'copy_type': ['sm',
                                                                                        'dma'],
                                                                          'mem_type': ['htod',
                                                                                       'dtoh',
                                                                                       'dtod'],
                                                                          'num_loops': 1,
                                                                          'num_warm_up': 0,
                                                                          'size': 4096}},
                               'gpu-copy-bw:perf': {'enable': True,
                                                    'modes': [{'name': 'local',
                                                               'parallel': False}],
                                                    'parameters': {'copy_type': ['sm',
                                                                                 'dma'],
                                                                   'mem_type': ['htod',
                                                                                'dtoh',
                                                                                'dtod']}},
                               'ib-loopback': {'enable': True,
                                               'modes': [{'name': 'local',
                                                          'parallel': True,
                                                          'prefix': 'PROC_RANK={proc_rank} '
                                                                    'IB_DEVICES=0,2,4,6 '
                                                                    'NUMA_NODES=1,0,3,2',
                                                          'proc_num': 4},
                                                         {'name': 'local',
                                                          'parallel': True,
                                                          'prefix': 'PROC_RANK={proc_rank} '
                                                                    'IB_DEVICES=1,3,5,7 '
                                                                    'NUMA_NODES=1,0,3,2',
                                                          'proc_num': 4}]},
                               'ib-traffic': {'enable': False,
                                              'modes': [{'name': 'mpi',
                                                         'proc_num': 8}],
                                              'parameters': {'gpu_dev': '$LOCAL_RANK',
                                                             'ib_dev': 'mlx5_$LOCAL_RANK',
                                                             'msg_size': 8388608,
                                                             'numa_dev': '$((LOCAL_RANK/2))'}},
                               'kernel-launch': {'enable': True,
                                                 'modes': [{'name': 'local',
                                                            'parallel': True,
                                                            'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                            'proc_num': 8}]},
                               'lstm_models': {'enable': True,
                                               'frameworks': ['pytorch'],
                                               'models': ['lstm'],
                                               'modes': [{'name': 'torch.distributed',
                                                          'node_num': 1,
                                                          'proc_num': 8}],
                                               'parameters': {'batch_size': 1,
                                                              'duration': 0,
                                                              'model_action': ['train'],
                                                              'num_steps': 128,
                                                              'num_warmup': 16,
                                                              'precision': ['float32',
                                                                            'float16']}},
                               'matmul': {'enable': True,
                                          'frameworks': ['pytorch'],
                                          'modes': [{'name': 'local',
                                                     'parallel': True,
                                                     'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                     'proc_num': 8}]},
                               'mem-bw': {'enable': True,
                                          'modes': [{'name': 'local',
                                                     'parallel': False,
                                                     'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank} '
                                                               'numactl -N '
                                                               '$(({proc_rank}/2))',
                                                     'proc_num': 8}]},
                               'nccl-bw:default': {'enable': True,
                                                   'modes': [{'name': 'local',
                                                              'parallel': False,
                                                              'proc_num': 1}],
                                                   'parameters': {'ngpus': 8}},
                               'nccl-bw:gdr-only': {'enable': True,
                                                    'modes': [{'env': {'NCCL_IB_DISABLE': '0',
                                                                       'NCCL_IB_PCI_RELAXED_ORDERING': '1',
                                                                       'NCCL_MIN_NCHANNELS': '16',
                                                                       'NCCL_NET_GDR_LEVEL': '5',
                                                                       'NCCL_P2P_DISABLE': '1',
                                                                       'NCCL_SHM_DISABLE': '1'},
                                                               'name': 'local',
                                                               'parallel': False,
                                                               'proc_num': 1}],
                                                    'parameters': {'ngpus': 8}},
                               'ort-inference': {'enable': True,
                                                 'modes': [{'name': 'local',
                                                            'parallel': True,
                                                            'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                            'proc_num': 8}],
                                                 'parameters': {'batch_size': 1}},
                               'resnet_models': {'enable': True,
                                                 'frameworks': ['pytorch'],
                                                 'models': ['resnet50',
                                                            'resnet101',
                                                            'resnet152'],
                                                 'modes': [{'name': 'torch.distributed',
                                                            'node_num': 1,
                                                            'proc_num': 8}],
                                                 'parameters': {'batch_size': 128,
                                                                'duration': 0,
                                                                'model_action': ['train'],
                                                                'num_steps': 128,
                                                                'num_warmup': 16,
                                                                'precision': ['float32',
                                                                              'float16']}},
                               'sharding-matmul': {'enable': True,
                                                   'frameworks': ['pytorch'],
                                                   'modes': [{'name': 'torch.distributed',
                                                              'node_num': 1,
                                                              'proc_num': 8}]},
                               'tcp-connectivity': {'enable': False,
                                                    'modes': [{'name': 'local',
                                                               'parallel': False}],
                                                    'parameters': {'port': 22}},
                               'tensorrt-inference': {'enable': True,
                                                      'modes': [{'name': 'local',
                                                                 'parallel': True,
                                                                 'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                                 'proc_num': 8}],
                                                      'parameters': {'batch_size': 1,
                                                                     'precision': 'int8',
                                                                     'pytorch_models': ['resnet50',
                                                                                        'resnet101',
                                                                                        'resnet152',
                                                                                        'densenet169',
                                                                                        'densenet201',
                                                                                        'bert-base',
                                                                                        'bert-large'],
                                                                     'seq_length': 224}},
                               'vgg_models': {'enable': True,
                                              'frameworks': ['pytorch'],
                                              'models': ['vgg11',
                                                         'vgg13',
                                                         'vgg16',
                                                         'vgg19'],
                                              'modes': [{'name': 'torch.distributed',
                                                         'node_num': 1,
                                                         'proc_num': 8}],
                                              'parameters': {'batch_size': 1,
                                                             'duration': 0,
                                                             'model_action': ['train'],
                                                             'num_steps': 128,
                                                             'num_warmup': 16,
                                                             'precision': ['float32',
                                                                           'float16']}}},
                'enable': ['resnet_models'],
                'monitor': {'enable': True,
                            'sample_duration': 1,
                            'sample_interval': 10},
                'var': {'common_model_config': {'batch_size': 1,
                                                'duration': 0,
                                                'model_action': ['train'],
                                                'num_steps': 128,
                                                'num_warmup': 16,
                                                'precision': ['float32',
                                                              'float16']},
                        'default_local_mode': {'enable': True,
                                               'modes': [{'name': 'local',
                                                          'parallel': True,
                                                          'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
                                                          'proc_num': 8}]},
                        'default_pytorch_mode': {'enable': True,
                                                 'frameworks': ['pytorch'],
                                                 'modes': [{'name': 'torch.distributed',
                                                            'node_num': 1,
                                                            'proc_num': 8}]}}},
 'version': 'v0.8'}.
[2023-04-15 13:15:10,648 guyen-MS-7B22:8261][runner.py:46][INFO] Runner writes to: /home/guyen/gg/git/superbenchmark/outputs/2023-04-15_13-15-10.
[2023-04-15 13:15:10,674 guyen-MS-7B22:8261][runner.py:51][INFO] Runner will run: ['resnet_models']
[2023-04-15 13:15:10,674 guyen-MS-7B22:8261][runner.py:202][INFO] Checking SuperBench environment.
[2023-04-15 13:15:10,699 guyen-MS-7B22:8261][ansible.py:127][INFO] Run playbook check_env.yaml ...


PLAY [Runtime Environment Check] ***********************************************


TASK [Checking container status] ***********************************************
changed: [localhost]


TASK [fail] ********************************************************************
skipping: [localhost]


PLAY [Runtime Environment Update] **********************************************


TASK [Gathering Facts] *********************************************************
ok: [localhost]


TASK [Ensure Workspace] ********************************************************
ok: [localhost]


TASK [Updating Config] *********************************************************
ok: [localhost]


TASK [Updating Env Variables] **************************************************
ok: [localhost] => (item=/root/sb-workspace/sb.env)
ok: [localhost] => (item=/tmp/sb.env)


TASK [Updating Hostfile to Remote] *********************************************
ok: [localhost]


TASK [Generating Hostfile to Local] ********************************************
changed: [localhost -> localhost]


PLAY RECAP *********************************************************************

localhost                  : ok=7    changed=2    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
[2023-04-15 13:15:15,163 guyen-MS-7B22:8261][ansible.py:79][INFO] Run succeed, return code 0.
[2023-04-15 13:15:15,165 guyen-MS-7B22:8261][runner.py:414][INFO] Runner is going to run resnet_models in torch.distributed mode, proc rank 0.
[2023-04-15 13:15:15,165 guyen-MS-7B22:8261][ansible.py:109][INFO] Run docker exec --env-file /tmp/sb.env sb-workspace bash -c 'python3 -m torch.distributed.launch --use_env --no_python --nproc_per_node=8 sb exec --output-dir outputs/2023-04-15_13-15-10 -c sb.config.yaml -C superbench.enable=resnet_models superbench.benchmarks.resnet_models.parameters.distributed_impl=ddp superbench.benchmarks.resnet_models.parameters.distributed_backend=nccl' on remote ...
[2023-04-15 13:15:15,165 guyen-MS-7B22:8261][ansible.py:73][INFO] Run as sudo ...
localhost | CHANGED | rc=0 >>

[2023-04-15 20:15:17,142 guyen-MS-7B22:298][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,172 guyen-MS-7B22:297][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,235 guyen-MS-7B22:301][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,235 guyen-MS-7B22:296][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,249 guyen-MS-7B22:312][monitor.py:118][INFO] Start monitoring.

[2023-04-15 20:15:17,299 guyen-MS-7B22:300][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,314 guyen-MS-7B22:302][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,590 guyen-MS-7B22:303][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,623 guyen-MS-7B22:299][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet50.

[2023-04-15 20:15:17,909 guyen-MS-7B22:297][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:17,909 guyen-MS-7B22:297][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:17,968 guyen-MS-7B22:298][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:17,970 guyen-MS-7B22:298][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:18,151 guyen-MS-7B22:302][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:18,152 guyen-MS-7B22:302][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:18,165 guyen-MS-7B22:296][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:18,166 guyen-MS-7B22:296][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:18,216 guyen-MS-7B22:301][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:18,217 guyen-MS-7B22:301][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:18,241 guyen-MS-7B22:300][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:18,241 guyen-MS-7B22:300][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:18,393 guyen-MS-7B22:303][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:18,394 guyen-MS-7B22:303][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:18,477 guyen-MS-7B22:299][model_base.py:201][INFO] Model placement - model: pytorch-resnet50, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:18,477 guyen-MS-7B22:299][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet50, distributed implementation: ddp.

[2023-04-15 20:15:23,074 guyen-MS-7B22:303][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,074 guyen-MS-7B22:296][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,074 guyen-MS-7B22:301][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,075 guyen-MS-7B22:296][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,075 guyen-MS-7B22:300][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,075 guyen-MS-7B22:298][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,075 guyen-MS-7B22:296][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,075 guyen-MS-7B22:301][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,075 guyen-MS-7B22:299][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,075 guyen-MS-7B22:301][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,075 guyen-MS-7B22:300][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,075 guyen-MS-7B22:299][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,075 guyen-MS-7B22:296][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,075 guyen-MS-7B22:297][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,075 guyen-MS-7B22:300][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,075 guyen-MS-7B22:299][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,075 guyen-MS-7B22:297][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,076 guyen-MS-7B22:301][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,076 guyen-MS-7B22:297][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,076 guyen-MS-7B22:300][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,076 guyen-MS-7B22:299][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,076 guyen-MS-7B22:297][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,076 guyen-MS-7B22:303][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,077 guyen-MS-7B22:303][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,077 guyen-MS-7B22:303][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,077 guyen-MS-7B22:298][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,078 guyen-MS-7B22:296][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,078 guyen-MS-7B22:296][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,078 guyen-MS-7B22:298][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,079 guyen-MS-7B22:299][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,079 guyen-MS-7B22:298][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,079 guyen-MS-7B22:299][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,079 guyen-MS-7B22:301][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,079 guyen-MS-7B22:301][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,079 guyen-MS-7B22:302][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:839, invalid usage, NCCL version 21.0.3

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[2023-04-15 20:15:23,079 guyen-MS-7B22:300][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,079 guyen-MS-7B22:302][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,080 guyen-MS-7B22:300][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,080 guyen-MS-7B22:302][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.

[2023-04-15 20:15:23,080 guyen-MS-7B22:302][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,080 guyen-MS-7B22:299][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,080 guyen-MS-7B22:299][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,081 guyen-MS-7B22:299][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,081 guyen-MS-7B22:301][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,081 guyen-MS-7B22:297][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,081 guyen-MS-7B22:301][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,081 guyen-MS-7B22:299][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,081 guyen-MS-7B22:301][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,081 guyen-MS-7B22:297][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,082 guyen-MS-7B22:301][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,082 guyen-MS-7B22:298][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,082 guyen-MS-7B22:297][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,082 guyen-MS-7B22:297][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,082 guyen-MS-7B22:298][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,082 guyen-MS-7B22:297][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,082 guyen-MS-7B22:297][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,083 guyen-MS-7B22:302][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,083 guyen-MS-7B22:302][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,083 guyen-MS-7B22:303][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,083 guyen-MS-7B22:300][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,084 guyen-MS-7B22:302][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,084 guyen-MS-7B22:300][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,084 guyen-MS-7B22:302][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,084 guyen-MS-7B22:300][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,084 guyen-MS-7B22:302][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,084 guyen-MS-7B22:299][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,084 guyen-MS-7B22:300][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,084 guyen-MS-7B22:302][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,084 guyen-MS-7B22:303][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.

[2023-04-15 20:15:23,084 guyen-MS-7B22:299][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:23,085 guyen-MS-7B22:301][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,085 guyen-MS-7B22:298][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,085 guyen-MS-7B22:301][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:23,085 guyen-MS-7B22:298][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,085 guyen-MS-7B22:298][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,085 guyen-MS-7B22:297][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,086 guyen-MS-7B22:297][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:23,086 guyen-MS-7B22:298][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,087 guyen-MS-7B22:302][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,087 guyen-MS-7B22:300][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,087 guyen-MS-7B22:302][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:23,087 guyen-MS-7B22:300][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:23,088 guyen-MS-7B22:303][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,088 guyen-MS-7B22:303][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,088 guyen-MS-7B22:303][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,089 guyen-MS-7B22:298][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,089 guyen-MS-7B22:298][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:23,089 guyen-MS-7B22:303][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,090 guyen-MS-7B22:296][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: trying to initialize the default process group twice!

[2023-04-15 20:15:23,091 guyen-MS-7B22:296][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:23,091 guyen-MS-7B22:296][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.

[2023-04-15 20:15:23,091 guyen-MS-7B22:296][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet152.

[2023-04-15 20:15:23,092 guyen-MS-7B22:303][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,092 guyen-MS-7B22:303][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:23,094 guyen-MS-7B22:296][model_base.py:201][INFO] Model placement - model: pytorch-resnet152, GPU availablility: True, pin memory: False, force fp32: False.

[2023-04-15 20:15:23,094 guyen-MS-7B22:296][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet152, distributed implementation: ddp.

[2023-04-15 20:15:24,085 guyen-MS-7B22:299][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,086 guyen-MS-7B22:301][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,086 guyen-MS-7B22:299][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,087 guyen-MS-7B22:299][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.

[2023-04-15 20:15:24,087 guyen-MS-7B22:301][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,087 guyen-MS-7B22:297][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,088 guyen-MS-7B22:301][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.

[2023-04-15 20:15:24,088 guyen-MS-7B22:297][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,088 guyen-MS-7B22:297][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.

[2023-04-15 20:15:24,088 guyen-MS-7B22:302][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,089 guyen-MS-7B22:302][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,090 guyen-MS-7B22:302][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.

[2023-04-15 20:15:24,090 guyen-MS-7B22:300][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,090 guyen-MS-7B22:298][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,090 guyen-MS-7B22:300][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,090 guyen-MS-7B22:298][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,091 guyen-MS-7B22:298][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.

[2023-04-15 20:15:24,091 guyen-MS-7B22:300][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.

[2023-04-15 20:15:24,101 guyen-MS-7B22:303][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,102 guyen-MS-7B22:303][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,102 guyen-MS-7B22:303][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.

[2023-04-15 20:15:24,111 guyen-MS-7B22:296][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet152, message: trying to initialize the default process group twice!

[2023-04-15 20:15:24,112 guyen-MS-7B22:296][executor.py:131][INFO] benchmark: pytorch-resnet152, return code: 4, result: {'return_code': [4]}.

[2023-04-15 20:15:24,113 guyen-MS-7B22:296][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet152.
[2023-04-15 13:15:27,814 guyen-MS-7B22:8261][ansible.py:79][INFO] Run succeed, return code 0.
[2023-04-15 13:15:27,815 guyen-MS-7B22:8261][ansible.py:127][INFO] Run playbook fetch_results.yaml ...


PLAY [Fetch Results] ***********************************************************


TASK [Gathering Facts] *********************************************************
ok: [localhost]


TASK [Synchronize Output Directory] ********************************************
changed: [localhost]


PLAY RECAP *********************************************************************

localhost                  : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
[2023-04-15 13:15:29,883 guyen-MS-7B22:8261][ansible.py:79][INFO] Run succeed, return code 0.
(venv) guyen@guyen-MS-7B22:~/gg/git/superbenchmark$

from superbenchmark.

jdgh000 commented on May 22, 2024

I only left resnet 101 and now out of memory : rtx2070 super 8gb.
What is the memory requirement for resnet_models: resnet101?
Any smaller training that can git in 8gb?


 worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
[2023-04-15 20:32:44,711 guyen-MS-7B22:237][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet50, message: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 7.78 GiB total capacity; 6.58 GiB already allocated; 127.62 MiB free; 6.71 GiB reserved in total by PyTorch)
[2023-04-15 20:32:44,712 guyen-MS-7B22:237][executor.py:131][INFO] benchmark: pytorch-resnet50, return code: 4, result: {'return_code': [4]}.
[2023-04-15 20:32:44,712 guyen-MS-7B22:237][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet50.
[2023-04-15 20:32:44,713 guyen-MS-7B22:237][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.

from superbenchmark.

cp5555 commented on May 22, 2024

Sorry. SuperBenchmark doesn't officially support GeForce series at all. Therefore, it will introduce many unexpected issues. There is no plan for recent release on GeForce series support.

For GeForce related code & configuration setting (e.g. rtx2070), it would be great if you can contribute to it.

from superbenchmark.

jdgh000 commented on May 22, 2024

it was due to memory size, i did manage some of the smaller training. you can close this.

from superbenchmark.

jdgh000 commented on May 22, 2024

configuration

I dont think contribution is necessary. it is just same cuda chips with different brand name with smaller sizes and + gamind chips. Performance will be slower (GDDR5 intead of HBM, 8GB vs. 32GB etc) but i had no issue running scaled down workloads.

from superbenchmark.

cp5555 commented on May 22, 2024

Thanks for your discussion! Really appreciate it. We close this issue.

from superbenchmark.

superbench failed at default most typical run config about superbenchmark HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent