Giter VIP home page Giter VIP logo

microsoft / superbenchmark Goto Github PK

View Code? Open in Web Editor NEW
191.0 16.0 48.0 9.11 MB

A validation and profiling tool for AI infrastructure

Home Page: https://aka.ms/superbench

License: MIT License

Dockerfile 2.55% Python 65.32% Makefile 0.04% Shell 0.08% CMake 1.00% Cuda 5.41% Jinja 0.01% C++ 22.10% JavaScript 0.43% CSS 0.09% HTML 0.27% Batchfile 0.07% HLSL 2.62%
benchmark ai-system superbench azure hacktoberfest

superbenchmark's Introduction

SuperBench

Build Image Codecov Website Latest Release Docker Pulls License

Azure Pipelines Build Status
cpu-unit-test Build Status
cuda-unit-test Build Status
ansible-integration-test Build Status

SuperBench is a validation and profiling tool for AI infrastructure.

📢 v0.10.0 has been released!

Check aka.ms/superbench for more details.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

superbenchmark's People

Contributors

abuccts avatar cp5555 avatar dependabot[bot] avatar guoshzhao avatar hpourreza avatar jaredamd avatar jeffdaily avatar jeseszhang1010 avatar kaiyux avatar kawaii-ghost avatar microsoftopensource avatar pnunna93 avatar quge009 avatar rafsalas19 avatar ryoyang avatar tobeyqin avatar yangpanms avatar yukirora avatar yzygitzh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

superbenchmark's Issues

why '/dev/nvidia-uvm' is a required file check for nvidia GPU?

What's the issue, what's expected?:
This is v0.6.0. In (superbench code gpu.py) [(https://github.com/microsoft/superbenchmark/blob/main/superbench/common/devices/gpu.py#L24), it checks whether a GPU is nvidia by checking both '/dev/nvidiactl' and '/dev/nvidia-uvm'. The question is: why does it require /dev/nvidia-uvm? I found some GPU type like 'Tesla K80' with cuda 11.4 doesn't have this.

How to reproduce it?:
Logon to a machine with 'Tesla K80' + cuda 11.4.

V0.7.0 Test Plan

Test Cases

single-node test

Machine Type #Node * #GPU * GPU Type PyTorch Version Accelerated Computing Toolkit Status
ND A100 v4 1 * 8 * A100 40GB SXM PyTorch 1.8 CUDA 11.1 Done
NDm A100 v4 1 * 8 * A100 80GB SXM PyTorch 1.8 CUDA 11.1 Done
Hopper 1* 8 * H100 PyTorch 1.x CUDA11.8 Done

single-node Micro-benchmark Test

  1. tensort-inference
  • Fix Transformers version to avoid Tensorrt-inference failure (#441)
  1. cublas-function/cudnn-function
  • Support list of custom config string in cudnn-functions and cublas-functions (#414)
  • Support correctness check in cublas-functions (#450, #452)
  1. mem-bw
  • Add wait time option to resolve mem-bw unstable issue (#438)

SuperBench Improvement

  • Support non-zero return code (#410, #411,#425)
  • Support log flushing to the result file during runtime (#445)
  • Update sb version to include revision hash and date (#427)

Hopper GPU and FP8 related benchmarks

  1. docker building
  • Add CUDA11.8 Docker image for Nvidia arch90 GPUs (#449)
  1. micro-benchmark
  • Support GEMM-FLOPS for Nvidia arch90 GPUs (#456)
  • Support cuBLASLt FP16 and FP8 GEMM (#451, #455)
  • Debug ome Cublas and cudnn kernels crash issue
  1. model-benchmark
  • Support FP8 in Bert model training (#446)

New in bug bash

  • [x]
  • [x]

multiple-node test

Test Table

Machine Type #Node * #GPU * GPU Type PyTorch Version Accelerated Computing Toolkit Status
NDm A100 v4 32 * 8 * A100 80GB SXM PyTorch 1.8 CUDA 11.1 Done

distributed Micro-benchmark test

  1. ib-traffic
  • Support pair-wise pattern in IB validation benchmark(#453 )
  • Support 'pattern' in 'mpi' mode to run tasks in parallel(#447)
  1. nccl-bw
  • Support topo-aware, all-pair, and K-batch pattern in 'mpi' mode(#437, #458)
  • Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark(#454)

New in bug bash

  • [x]
  • [x]

V0.4.0 Release Plan

Release Manager

@cp5555

Endgame

Code freeze: Dec. 12th, 2021
Bug Bash date: Dec. 13th, 2021
Release date: Dec. 24th, 2021

Main Features

Microbenchmark

    • CPU Memory Validation (Tool: Intel Memory Latency Checker) (#126)

      Metrics Unit Description
      cpu-memory-bw-latency/mem_bandwidth_matrix_numa_[0-9]+_[0-9]+_bw bandwidth (GB/s) Former NUMA to latter NUMA memory bandwidth.
      cpu-memory-bw-latency/mem_bandwidth_matrix_numa_[0-9]+_[0-9]+_lat time (us) Former NUMA to latter NUMA memory latency.
      cpu-memory-bw-latency/mem_max_bandwidth_all_reads_bw bandwidth (GB/s) Whole-CPU maximum memory bandwidth, full read.
      cpu-memory-bw-latency/mem_max_bandwidth_3_1_reads-writes_bw bandwidth (GB/s) Whole-CPU maximum memory bandwidth, read : write = 3 : 1.
      cpu-memory-bw-latency/mem_max_bandwidth_2_1_reads-writes_bw bandwidth (GB/s) Whole-CPU maximum memory bandwidth, read : write = 2 : 1.
      cpu-memory-bw-latency/mem_max_bandwidth_1_1_reads-writes_bw bandwidth (GB/s) Whole-CPU maximum memory bandwidth, read : write = 1 : 1.
      cpu-memory-bw-latency/mem_max_bandwidth_stream-triad_like_bw bandwidth (GB/s) Whole-CPU maximum memory bandwidth, with stream-triad like pattern.
    • GPU Copy Bandwidth (Tool: Built by MSRA) (#230)

      Metrics Unit Description
      gpu-copy-bw/cpu_to_gpu[0-9]+_by_gpu[0-9]+_using_(sm|dma)_under_numa[0-9]+_bw GB/s The bandwidth reading from all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs
      gpu-copy-bw/\gpu[0-9]+_to_cpu_by_gpu[0-9]+_using_(sm|dma)_under_numa[0-9]+_bw GB/s The bandwidth writing to all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs
      gpu-copy-bw/\gpu[0-9]+_to_gpu[0-9]+_by_gpu[0-9]+_using_(sm|dma)_under_numa[0-9]+_bw GB/s The bandwidth reading from or writing to all GPUs using DMA engine or GPU SM by all GPUs with peer communication enabled

Distributed Networking Benchmarks

    • Support IB Networking Validation (#191)
      Metrics Unit Description
      ib-traffic/${command}${line}${pair}${server}${client}_bw GB/s the average bandwidth of ib command(ib_write_bw, ib_send_bw, ib_read_bw) run between the <pair>th node pair in the <line>th line of the config
      ib-traffic/${command}${line}${pair}${server}${client}_lat usec the max latency of ib command(ib_write_lat, ib_send_lat, ib_read_lat) run between the <pair>th node pair in the <line>th line of the config
    • Support TCP Validation (Tool: TCPing) (#217)

      Metrics Unit Description
      tcp-connectivity/${hostname/ip}_successed_count successed times of tcp connections between current node and other nodes
      tcp-connectivity/${hostname/ip}_failed_count failed times of tcp connections between current node and other nodes
      tcp-connectivity/${hostname/ip}_success_rate success rate(successed/total) of tcp connection between current node and other nodes
      tcp-connectivity/${hostname/ip}_time_min ms mininum latency of tcp connections between current node and other nodes
      tcp-connectivity/${hostname/ip}_time_max ms maximum latency of tcp connections between current node and other nodes
      tcp-connectivity/${hostname/ip}_time_avg ms average latency of tcp connections between current node and other nodes
    • Support GPCNet Validation (#228 and #229)
      Metrics Unit Description
      gpcnet-network-test/rr_two-sided_lat_${stat} time (us) statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side latency' for network testing
      gpcnet-network-test/rr_two-sided+sync_bw_${stat} bandwidth (MiB/s/rank) fstatistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side bandwidth with barrier' for network testing
      gpcnet-network-test/multiple_allreduce_time_${stat} time (us) statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'multiple allreduce bandwidth' for network testing
      gpcnet-network-test/rr_get_lat_${stat} bandwidth (MiB/s/rank) statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR GetLat (8 B)' for network testing
      gpcnet-network-test/rr_two-sided_bw_${stat} bandwidth (MiB/s/rank) statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR Two-sidedBW (131072 B)' for network testing
      gpcnet-network-test/nat_two-sided_bw_${stat} bandwidth (MiB/s/rank) statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Nat Two-sidedBW (131072 B)' for network testing
      gpcnet-network-test/multiple_alltoall_bw_${stat} bandwidth (MiB/s/rank) statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Multiple Alltoall (4096 B)' for network testing
      gpcnet-network-load-test/rr_two-sided_lat_x_${stat} factor (x) summary about congestion impact factor of the network test algorithm
      gpcnet-network-load-test/rr_two-sided+sync_bw_x_${stat} factor (x) summary about congestion impact factor of the network test algorithm
      gpcnet-network-load-test/multiple_allreduce_x_${stat} factor (x) summary about congestion impact factor of the network test algorithm

SuperBench Improvement -- @guoshzhao

    • Add pipeline for AMD docker (#194)
    • Integrate system config info script with SuperBench (#199)
    • Support FP32 mode without TF32 (#213)
    • Refine the UT for microbenchmark (#268)
    • Unify metric names for all benchmarks (#252)

More E2E Models for AMD and Inference -- @lynex

    • Add ORT Model on AMD GPU platform (#227)
      Model: Bert-large, Distilbert-base, GPT-2, facebook/Bart-large and Roberta-large
      Metrics Unit Description
      onnxruntime-ort-models/bert_large_uncased_ngpu_1_throughput samples/s The throughput of bert large uncased model on 1 GPU
      onnxruntime-ort-models/bert_large_uncased_ngpu_8_throughput samples/s The throughput of bert large uncased model on 8 GPU
      onnxruntime-ort-models/distilbert_base_uncased_ngpu_1_throughput samples/s The throughput of distilbert base uncased model on 1 GPU
      onnxruntime-ort-models/distilbert_base_uncased_ngpu_8_throughput samples/s The throughput of distilbert base uncased model on 8 GPU
      onnxruntime-ort-models/gpt2_ngpu_1_throughput samples/s The throughput of gpt2 model on 1 GPU
      onnxruntime-ort-models/gpt2_ngpu_8_throughput samples/s The throughput of gpt2 model on 8 GPU
      onnxruntime-ort-models/facebook_bart_large_ngpu_1_throughput samples/s The throughput of facebook bart large model on 1 GPU
      onnxruntime-ort-models/facebook_bart_large_ngpu_8_throughput samples/s The throughput of facebook bart large model on 8 GPU
      onnxruntime-ort-models/roberta_large_ngpu_1_throughput samples/s The throughput of roberta large model on 1 GPU
      onnxruntime-ort-models/roberta_large_ngpu_8_throughput samples/s The throughput of roberta large model on 8 GPU
    • Add Inference Backend TensorRT (#236, #254)
      Name Unit Description
      tensorrt-inference/${model}_gpu_time_mean time (ms) The mean GPU latency to execute the kernels for a query.
      tensorrt-inference/${model}_gpu_time_99 time (ms) The 99th percentile GPU latency to execute the kernels for a query.
      tensorrt-inference/${model}_host_time_mean time (ms) The mean H2D, GPU, and D2H latency to execute the kernels for a query.
      tensorrt-inference/${model}_host_time_99 time (ms) The 99th percentile H2D, GPU, and D2H latency to execute the kernels for a query.
      tensorrt-inference/${model}_end_to_end_time_mean time (ms) The mean duration from when the H2D of a query is called to when the D2H of the same query is completed.
      tensorrt-inference/${model}_end_to_end_time_99 time (ms) The P99 duration from when the H2D of a query is called to when the D2H of the same query is completed.
    • Add Inference Backend ORT for Nvidia (#245)
      Name Unit Description
      ort-inference/{precision}_{model}_time time (ms) The mean latency to execute one batch of inference.

Data Diagnosis & Analysis -- @yukirora

    • Support baseline-based data diagnosis(#242)
    • Support basic analysis feature (boxplot figure, outlier detection, etc.)(#248)

Monitor -- @guoshzhao

    • Add monitor framework for CPU, memory, disk, GPU, etc. (#240)
    • Integrate monitor with SuperBench (#259)

Document

    • Monitor Document (#265)
    • Data Diagnosis Document(#249)

Backlogs

SuperBench Improvement

    • Improve Output Interface
    • Auto kill all processes on all nodes
    • Add Heart beat to monitor process health

Found no NVIDIA driver on your system.

What's the issue, what's expected?:
The torch inside the docker can not find the my GPU.

How to reproduce it?:
Install superbenchmark as normal.

sb run -f local.ini -c resnet.yaml --host-password=mypassword

GPU: Quadro RTX 6000, driver is 530.30.02
nvidia-smi
image

Log message or shapshot?:
/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system.

Additional information:
OS: ubuntu 22.04.02

how can I fix this problem?

V0.6.0 Test Plan

V0.6.0 Test Plan

Test Cases

single-node test

Machine Type #Node * #GPU * GPU Type PyTorch Version Accelerated Computing Toolkit Status
ND A100 v4 1 * 8 * A100 40GB SXM PyTorch 1.8 CUDA 11.1 Done
NDm A100 v4 1 * 8 * A100 80GB SXM PyTorch 1.8 CUDA 11.1 Done
Hayabusa 1* 16 * MI200 PyTorch 1.9 ROCm 5.1 Done

single-node Micro-benchmark Test

  1. ib-loopback
  • Fix issues in ib loopback benchmark (#369)
  • Fix stability issue in ib loopback benchmark (#386)
  • fix port conflict in ib loopback (#375)
  1. Rccl-test/nccl-test
  • Update Dockerfile for NCCL/RCCL version, tag name, and verbose output (#371)
  • Support node_num=1 in mpi mode(#372)

SuperBench Improvement

  • Support running on host directly without Docker(#356, #358, #362)
  • Support automatic configuration yaml selection on Azure VM
  • Add return code for Timeout(#383,#385)
  • Support ROCm 5.1.1 (#353, #354), Support ROCm 5.1.3 (#361)

Tools

  1. data diagnosis
  • Fix bugs in data diagnosis (#355)
  • Add failure check function in data_diagnosis.py (#378)
  • Support Json and Jsonl in Diagnosis. (#388)
  • Add support to store values of metrics in data diagnosis. (#392)

New in bug bash

  • Make baseline file optional in data diagnosis and fix bugs (#399)
  • Update error handling to support exit code of sb result diagnosis (#403)
  • Format int type and unify empty value to N/A in diagnosis output files (#406)
  • Upgrade colorlog for NO_COLOR support (#404)
  • Enhance timeout cleanup to avoid possible hanging (#405)

multiple-node test

Test Table

Machine Type #Node * #GPU * GPU Type PyTorch Version Accelerated Computing Toolkit Status
ND A100 v4 32 * 8 * A100 40GB SXM PyTorch 1.8 CUDA 11.1 Done

distributed Micro-benchmark test

  1. ib-traffic
  • Support multiple IB/GPU Pair-wise IB benchmark (#363)
  • Bug Fix in IB benchmark in all-pair mode(#370, #377)
  • Topology-aware IB benchmark (#373, #381)

New in bug bash

  • Auto generate ibstat file by pssh (#402)
  • Enable latency test in ib traffic validation distributed benchmark(#396)

Installation failing when following instruction [forcefully closed without fixing]

Ubuntu1804


nonroot@nonroot-MS-7B22:~/superbenchmark$ pwd
/home/nonroot/superbenchmark
nonroot@nonroot-MS-7B22:~/superbenchmark$ git remote -v
origin	https://github.com/microsoft/superbenchmark (fetch)
origin	https://github.com/microsoft/superbenchmark (push)
nonroot@nonroot-MS-7B22:~/superbenchmark$ 

nonroot@nonroot-MS-7B22:~/superbenchmark$ sudo python3 -m pip install .
Keyring is skipped due to an exception: org.freedesktop.DBus.Error.NoServer: Failed to connect to socket /tmp/dbus-9Lms1YHOlb: Connection refused
WARNING: The directory '/home/nonroot/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
Processing /home/nonroot/superbenchmark
  Preparing metadata (setup.py) ... done
Collecting ansible_runner>=2.0.0rc1
  Downloading ansible_runner-2.1.1-py3-none-any.whl (83 kB)
     |████████████████████████████████| 83 kB 4.7 MB/s             
Collecting colorlog>=4.7.2
  Downloading colorlog-6.6.0-py2.py3-none-any.whl (11 kB)
Collecting jinja2>=2.10.1
  Downloading Jinja2-3.0.3-py3-none-any.whl (133 kB)
     |████████████████████████████████| 133 kB 12.2 MB/s            
Requirement already satisfied: joblib>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from superbench==0.4.0) (1.1.0)
Collecting jsonlines>=2.0.0
  Downloading jsonlines-3.0.0-py3-none-any.whl (8.5 kB)
Collecting knack>=0.7.2
  Downloading knack-0.9.0-py3-none-any.whl (59 kB)
     |████████████████████████████████| 59 kB 94.4 MB/s            
Requirement already satisfied: matplotlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from superbench==0.4.0) (3.3.4)
Collecting natsort>=7.1.1
  Downloading natsort-8.0.2-py3-none-any.whl (37 kB)
Collecting omegaconf==2.0.6
  Downloading omegaconf-2.0.6-py3-none-any.whl (36 kB)
Collecting openpyxl>=3.0.7
  Downloading openpyxl-3.0.9-py2.py3-none-any.whl (242 kB)
     |████████████████████████████████| 242 kB 44.7 MB/s            
Requirement already satisfied: pandas>=1.1.5 in /usr/local/lib/python3.6/dist-packages (from superbench==0.4.0) (1.1.5)
Collecting pyyaml>=5.3
  Downloading PyYAML-6.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (603 kB)
     |████████████████████████████████| 603 kB 59.7 MB/s            
Collecting seaborn>=0.11.2
  Downloading seaborn-0.11.2-py3-none-any.whl (292 kB)
     |████████████████████████████████| 292 kB 62.9 MB/s            
Collecting tcping>=0.1.1rc1
  Downloading tcping-0.1.1rc1.tar.gz (4.1 kB)
  Preparing metadata (setup.py) ... done
Collecting xlrd>=2.0.1
  Downloading xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
     |████████████████████████████████| 96 kB 45.4 MB/s            
Collecting xlsxwriter>=1.3.8
  Downloading XlsxWriter-3.0.2-py3-none-any.whl (149 kB)
     |████████████████████████████████| 149 kB 69.3 MB/s            
Collecting xmltodict>=0.12.0
  Downloading xmltodict-0.12.0-py2.py3-none-any.whl (9.2 kB)
Collecting ansible_base>=2.10.9
  Downloading ansible-base-2.10.16.tar.gz (6.1 MB)
     |████████████████████████████████| 6.1 MB 25.8 MB/s            
  Preparing metadata (setup.py) ... done
Requirement already satisfied: dataclasses in /home/nonroot/.local/lib/python3.6/site-packages (from omegaconf==2.0.6->superbench==0.4.0) (0.8)
Requirement already satisfied: typing-extensions in /home/nonroot/.local/lib/python3.6/site-packages (from omegaconf==2.0.6->superbench==0.4.0) (4.0.1)
Requirement already satisfied: cryptography in /usr/lib/python3/dist-packages (from ansible_base>=2.10.9->superbench==0.4.0) (2.1.4)
Collecting packaging
  Downloading packaging-21.3-py3-none-any.whl (40 kB)
     |████████████████████████████████| 40 kB 72.5 MB/s            
Requirement already satisfied: six in /home/nonroot/.local/lib/python3.6/site-packages (from ansible_runner>=2.0.0rc1->superbench==0.4.0) (1.16.0)
Collecting pexpect>=4.5
  Downloading pexpect-4.8.0-py2.py3-none-any.whl (59 kB)
     |████████████████████████████████| 59 kB 63.9 MB/s            
Collecting python-daemon
  Downloading python_daemon-2.3.0-py2.py3-none-any.whl (35 kB)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.6/dist-packages (from jinja2>=2.10.1->superbench==0.4.0) (2.0.1)
Collecting attrs>=19.2.0
  Downloading attrs-21.4.0-py2.py3-none-any.whl (60 kB)
     |████████████████████████████████| 60 kB 63.3 MB/s            
Collecting argcomplete
  Downloading argcomplete-2.0.0-py2.py3-none-any.whl (37 kB)
Collecting jmespath
  Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Requirement already satisfied: tabulate in /usr/local/lib/python3.6/dist-packages (from knack>=0.7.2->superbench==0.4.0) (0.8.9)
Collecting pygments
  Downloading Pygments-2.11.2-py3-none-any.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 99.9 MB/s            
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.0->superbench==0.4.0) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.0->superbench==0.4.0) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.0->superbench==0.4.0) (1.3.1)
Requirement already satisfied: numpy>=1.15 in /home/nonroot/.local/lib/python3.6/site-packages (from matplotlib>=3.0.0->superbench==0.4.0) (1.19.5)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.0->superbench==0.4.0) (8.4.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.0->superbench==0.4.0) (3.0.7)
Collecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Requirement already satisfied: pytz>=2017.2 in /usr/lib/python3/dist-packages (from pandas>=1.1.5->superbench==0.4.0) (2018.3)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.6/dist-packages (from seaborn>=0.11.2->superbench==0.4.0) (1.5.4)
Collecting click
  Downloading click-8.0.3-py3-none-any.whl (97 kB)
     |████████████████████████████████| 97 kB 64.2 MB/s            
Collecting prettytable
  Downloading prettytable-2.5.0-py3-none-any.whl (24 kB)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect>=4.5->ansible_runner>=2.0.0rc1->superbench==0.4.0) (0.7.0)
Requirement already satisfied: importlib-metadata<5,>=0.23 in /usr/local/lib/python3.6/dist-packages (from argcomplete->knack>=0.7.2->superbench==0.4.0) (4.8.3)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prettytable->tcping>=0.1.1rc1->superbench==0.4.0) (0.2.5)
Requirement already satisfied: docutils in /usr/local/lib/python3.6/dist-packages (from python-daemon->ansible_runner>=2.0.0rc1->superbench==0.4.0) (0.18.1)
Requirement already satisfied: lockfile>=0.10 in /usr/local/lib/python3.6/dist-packages (from python-daemon->ansible_runner>=2.0.0rc1->superbench==0.4.0) (0.12.2)
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from python-daemon->ansible_runner>=2.0.0rc1->superbench==0.4.0) (39.0.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata<5,>=0.23->argcomplete->knack>=0.7.2->superbench==0.4.0) (3.6.0)
Building wheels for collected packages: superbench, ansible-base, tcping
  Building wheel for superbench (setup.py) ... done
  Created wheel for superbench: filename=superbench-0.4.0-py3-none-any.whl size=139599 sha256=f08e0e47cd18aadcd92ad96228b346a44daa7d569eb24ad2269e8d3ebb472819
  Stored in directory: /tmp/pip-ephem-wheel-cache-dsy5_fgx/wheels/dc/bb/c9/0181c21c034d8eb5089301c92e8e8b249ecc42b0a8569ef352
  Building wheel for ansible-base (setup.py) ... done
  Created wheel for ansible-base: filename=ansible_base-2.10.16-py3-none-any.whl size=1871330 sha256=550d11df3e25bf809017d59c36152caf22db6031f3823c725064ac06325d9600
  Stored in directory: /tmp/pip-ephem-wheel-cache-dsy5_fgx/wheels/ac/3a/eb/1953c987dfe9515f0b3c0770e22520361beedf030ec746b716
  Building wheel for tcping (setup.py) ... done
  Created wheel for tcping: filename=tcping-0.1.1rc1-py3-none-any.whl size=6400 sha256=8e0a41e98d6c0f4d7ea34b176f95b1cc7444edd7eeddf591d3a7fcc948d4f381
  Stored in directory: /tmp/pip-ephem-wheel-cache-dsy5_fgx/wheels/79/b4/79/cd1464d78ff94847f17dde162e88301861ffcdbae7b57279f0
Successfully built superbench ansible-base tcping
Installing collected packages: pyyaml, python-daemon, pygments, prettytable, pexpect, packaging, jmespath, jinja2, et-xmlfile, click, attrs, argcomplete, xmltodict, xlsxwriter, xlrd, tcping, seaborn, openpyxl, omegaconf, natsort, knack, jsonlines, colorlog, ansible-runner, ansible-base, superbench
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.12
ERROR: Cannot uninstall 'PyYAML'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
nonroot@nonroot-MS-7B22:~/superbenchmark$ sudo pip3 install --upgrade PyYAML
WARNING: The directory '/home/nonroot/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
Requirement already satisfied: PyYAML in /usr/lib/python3/dist-packages (3.12)
Collecting PyYAML
  Downloading PyYAML-6.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (603 kB)
     |████████████████████████████████| 603 kB 4.2 MB/s            
Installing collected packages: PyYAML
  Attempting uninstall: PyYAML
    Found existing installation: PyYAML 3.12
ERROR: Cannot uninstall 'PyYAML'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
nonroot@nonroot-MS-7B22:~/superbenchmark$ sudo python3 -m pip install .^C
nonroot@nonroot-MS-7B22:~/superbenchmark$ pwd
/home/nonroot/superbenchmark
nonroot@nonroot-MS-7B22:~/superbenchmark$ git remote -v
origin	https://github.com/microsoft/superbenchmark (fetch)
origin	https://github.com/microsoft/superbenchmark (push)
nonroot@nonroot-MS-7B22:~/superbenchmark$ 

why is it probing for nviida when running on MI?

It does not tell anything specific about running on either nvidia or amd but your platform certainly can not make distinguis!

[2023-04-16 07:09:53,869 abys245:321][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.
[0]: /opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
[2023-04-16 07:09:54,404 abys245:321][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: False, pin memory: False, force fp32: False.
[2023-04-16 07:09:54,405 abys245:321][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.
[1]: /opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
[2023-04-16 07:09:54,440 abys245:322][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: False, pin memory: False, force fp32: False.
[2023-04-16 07:09:54,440 abys245:322][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.
[2023-04-16 07:09:54,442 abys245:322][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
[2023-04-16 07:09:54,442 abys245:322][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.
[2023-04-16 07:09:54,442 abys245:322][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.
[2023-04-16 07:09:54,448 abys245:321][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
[2023-04-16 07:09:54,448 abys245:321][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.
[2023-04-16 07:09:54,448 abys245:321][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.
[2023-04-16 07:09:54,905 abys245:74026][ansible.py:79][INFO] Run succeed, return code 0.
[2023-04-16 07:09:54,906 abys245:74026][ansible.py:127][INFO] Run playbook fetch_results.yaml ...

PLAY [Fetch Results] ***********************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [Synchronize Output Directory] ********************************************
changed: [localhost]

PLAY RECAP *********************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
[2023-04-16 07:09:57,458 abys245:74026][ansible.py:79][INFO] Run succeed, return code 0.
(venv) jd@lab101:~/nm/git/superbenchmark$ 

sb should return non-zero exit code when executor.py failed

What's the issue, what's expected?:
This is using v0.6.0 release. The benchmark gemm-flops is run on a platform where the GPU is probably not supported (Tesla K80). The superbench has internal error like "Executor failed in gemm-flops, invalid context.". However, at the end, it returns exit code 0.

Expected: it should return non-zero exit code for this type of errors.

How to reproduce it?:
On a VM with Tesla K80 GPU (or CPU), run gemm-flops benchmark.

Log message or shapshot?:
[2022-09-09 20:38:20,578 N000000:38919][runner.py:392][INFO] Runner is going to run gemm-flops in local mode, proc rank 1.
[2022-09-09 20:38:20,580 N000000:38919][ansible.py:107][INFO] Run docker exec --env-file /tmp/sb.env sb-workspace bash -c 'PROC_RANK=1 CUDA_VISIBLE_DEVICES=1 timeout 1200 sb exec --output-dir outputs/2022-09-09_20-38-14 -c sb.config.yaml -C superbench.enable=gemm-flops' on remote ...
[2022-09-09 20:38:20,580 N000000:38919][ansible.py:72][INFO] Run as sudo ...

localhost | CHANGED | rc=0 >>
[2022-09-09 20:38:22,577 N000000:246][executor.py:235][INFO] Executor is going to execute gemm-flops.
[2022-09-09 20:38:23,363 N000000:246][registry.py:255][WARNING] Benchmark has no implementation, name: gemm-flops, platform: CPU
[2022-09-09 20:38:23,364 N000000:246][executor.py:132][ERROR] Executor failed in gemm-flops, invalid context.

localhost | CHANGED | rc=0 >>
[2022-09-09 20:38:22,702 N000000:260][executor.py:235][INFO] Executor is going to execute gemm-flops.
[2022-09-09 20:38:23,479 N000000:260][registry.py:255][WARNING] Benchmark has no implementation, name: gemm-flops, platform: CPU
[2022-09-09 20:38:23,479 N000000:260][executor.py:132][ERROR] Executor failed in gemm-flops, invalid context.
[2022-09-09 20:38:23,731 N000000:38918][ansible.py:78][INFO] Run succeed, return code 0.
[2022-09-09 20:38:23,860 N000000:38919][ansible.py:78][INFO] Run succeed, return code 0.
[2022-09-09 20:38:23,862 N000000:38433][ansible.py:125][INFO] Run playbook fetch_results.yaml ...

PLAY [Fetch Results] ***********************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [Synchronize Output Directory] ********************************************
changed: [localhost]

PLAY RECAP *********************************************************************
localhost : ok=2 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
[2022-09-09 20:38:26,514 N000000:38433][ansible.py:78][INFO] Run succeed, return code 0.
[2022-09-09 20:38:26,516 N000000:38433][runner.py:256][ERROR] Invalid content in JSON file: /home/aiscadmin/superbench/outputs/2022-09-09_20-38-14/nodes/N000000/benchmarks/gemm-flops/rank0/results.json
[2022-09-09 20:38:26,516 N000000:38433][runner.py:256][ERROR] Invalid content in JSON file: /home/aiscadmin/superbench/outputs/2022-09-09_20-38-14/nodes/N000000/benchmarks/gemm-flops/rank1/results.json
2022-09-09 20:38:26.746808: Command exit code: 0
Finished all. errors=0, runtime=13.1 s

Additional information:

Support for multi-NIC concurrent test - RDMA

What would you like to be added:
Support for multi-NIC concurrent test in GPU-GPU RDMA

Why is this needed:
Concurrent NIC RDMA test will be able to test the capability and stability at node level

Without this feature, how does current superbenchmark work
Current RDMA test is capable of validate NIC#0

Components that may involve changes:
https://github.com/microsoft/superbenchmark/blob/main/superbench/benchmarks/micro_benchmarks/ib_validation_performance/ib_validation_performance.cc

Brief description of your proposal if any:
Add support for user assigned NIC number, add support for concurrent NICs, during RDMA test

Fail to run on Ubuntu1804 nvidia RTX2070

What's the issue, what's expected?:
Fail to run the benchmark, see log.
sb deploy log shows: "could not select device driver"
How to reproduce it?:

Ubuntu1804
kernel: 5.4.0-80-generic
python --version
Python 3.8.8

01:00.0 VGA compatible controller: NVIDIA Corporation Device 1e84 (rev a1)
	Subsystem: NVIDIA Corporation Device 139e
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Tue Jul 27 23:01:58 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 41%   35C    P8    14W / 215W |      1MiB /  7979MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Log message or shapshot?:
(see attached)

Additional information:

V0.3.0 Release Plan

Release Manager

@TobeyQin

Endgame

Code freeze: 9/1/2021
Bug Bash date: 9/2/2021
Release date: 9/17/2021

Main Features

SuperBench Framework

SB Runner -- @abuccts

  • MPI mode implementation
    PR: #146

SB Benchmarks -- @guoshzhao

Single-node Validation

Micro-benchmarks -- @guoshzhao @yukirora

    • Memory (Tool: Nvidia Bandwidth Test Tool) -- @yukirora ETA: 5/28/2021
      PR: #114
      Metrics Unit Description
      H2D_Mem_BW_<GPU ID> GB/s host-to-GPU bandwidth for each GPU
      D2H_Mem_BW_<GPU ID> GB/s GPU-to-host bandwidth for each GPU
    • Device P2P Bandwidth (Tool: Nvidia p2pBandwidthLatencyTest Tool) -- Delayed

      Metrics Unit Description
      P2P_BW_Max GB/s The maximum bandwidth in Bidirectional P2P=Enabled Bandwidth Matrix for all GPUs
      P2P_BW_Min GB/s The minimum bandwidth
      P2P_BW_Avg GB/s The average bandwidth
    • IBLoopback (Tool: PerfTest – Standard RDMA Test Tool) -- @yukirora ETA: 7/30/2021
      PR: #112 and #129
      Metrics Unit Description
      IB_Write MB/s The IB write loopback throughput with different message size
      IB_Read MB/s The IB read loopback throughput with different message size
      IB_Send MB/s The IB send loopback throughput with different message size
    • NCCL (Tool: Nvidia NCCL Test) -- @yukirora ETA: 7/30/2021
      PR: #113 and #128
      Metrics Unit Description
      NCCL_AllReduce GB/s The NCCL AllReduce performance with different message size
      NCCL_AllGather GB/s The NCCL AllGather performance with different message size
      NCCL_broadcast GB/s The NCCL Broadcast performance with different message size
      NCCL_reduce GB/s The NCCL Reduce performance with different message size
      NCCL_reduce_scatter GB/s The NCCL ReduceScatter performance with different message size
    • Disk (Tool: FIO – Standard Disk Performance Tool) -- @yzygitzh ETA: 7/30/2021
      PR: #127 and #132 and #161
      Metrics Unit Description
      Seq_Read MB/s Sequential read performance
      Seq_Write MB/s Sequential write performance
      Rand_Read MB/s Random read performance
      Rand_Write MB/s Random write performance
      Seq_R/W_Read MB/s Read performance in sequential read/write, fixed measurement (read:write = 4:1)
      Seq_R/W_Write MB/s Write performance in sequential read/write (read:write = 4:1)
      Rand_R/W_Read MB/s Read performance in random read/write (read:write = 4:1)
      Rand_R/W_Write MB/s Write performance in random read/write (read:write = 4:1)
    • H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build) -- @yzygitzh ETA: 8/6/2021
      PR: #162 and #169
      Metrics Unit Description
      H2D_SM_BW_<GPU ID> GB/s host-to-GPU bandwidth using GPU kernel for each GPU
      D2H_SM_BW_<GPU ID> GB/s GPU-to-host bandwidth using GPU kernel for each GPU

Support AMD

Docker Image Support -- @guoshzhao ETA: 7/16/2021

  • ROCm 4.2 PyTorch 1.7 PR: #164
  • ROCm 4.0 PyTorch 1.7 PR: #164

Micro Benchmarks

    • Kernel Launch (Tool: MSR-A build) -- @yukirora ETA: 7/30/2021
      PR: #137 and #136
      Metrics Unit Description
      Kernel_Launch_Event_Time Time (ms) Dispatch latency measured in GPU time using hipEventRecord()
      Kernel_Launch_Wall_Time Time (ms) Dispatch latency measured in CPU time
    • RCCL (Tool: AMD RCCL Test) -- @yukirora ETA: 7/30/2021
      PR: #139 and #143
      Metrics Unit Description
      RCCL_AllReduce GB/s The RCCL AllReduce performance with different message size
      RCCL_AllGather GB/s The RCCL AllGather performance with different message size
      RCCL_broadcast GB/s The RCCL Broadcast performance with different message size
      RCCL_reduce GB/s The RCCL Reduce performance with different message size
      RCCL_reduce_scatter GB/s The RCCL ReduceScatter performance with different message size
    • GEMM FLOPS (Tool: AMD rocblas-bench Tool) -- @yukirora ETA: 8/27/2021
      PR: #144 and #165
      Metrics Unit Description
      FP64 GFLOPS FP64 FLOPS without MatrixCore
      FP32 GFLOPS FP32 FLOPS without MatrixCore
      FP16 GFLOPS FP16 FLOPS without MatrixCore
      FP32(MC) GFLOPS TF32 FLOPS with MatrixCore
      FP16(MC) GFLOPS FP16 FLOPS with MatrixCore
      BF16(MC) GFLOPS BF16 FLOPS with MatrixCore
      INT8(MC) GOPS INT8 FLOPS with MatrixCore
      INT4(MC) GOPS INT4 FLOPS with MatrixCore
    • Memory (Tool: HIP Bandwidth Test Tool) -- @yukirora ETA: 8/27/2021
      PR: #159 and #153
      Metrics Unit Description
      H2D_Mem_BW_<GPU ID> GB/s host-to-GPU bandwidth for each GPU
      D2H_Mem_BW_<GPU ID> GB/s GPU-to-host bandwidth for each GPU

E2E Benchmarks -- @guoshzhao ETA: 7/16/2021

    • CNN models -- User PyTorch TORCHVISION.MODELS sub-package
      • ResNet: ResNet-50, ResNet-101, ResNet-152
      • DenseNet: DenseNet-169, DenseNet-201 ​
      • VGG: VGG-11, VGG-13, VGG-16, VGG-19​
    • BERT -- Use huggingface Transformers
      • BERT
      • BERT LARGE
    • LSTM -- Use PyTorch TORCH.NN sub-package
    • GPT-2 -- Use huggingface Transformers

Result Summary -- @cp5555

  • Generate a report to summarize the results -- @guoshzhao ETA: 7/30/2021
    PR: #147, #149, and #157
  • Support basic analysis feature (boxplot figure, outlier detection, etc.)

Bug Fix

  • VGG models failed on A100 GPU with batch_size=128 #115
    PR: #134

Other Improvement

  1. Contribution related -- @lynex

    • Contribute rule (#131)
    • system information collection (#160)
  2. Document -- @TobeyQin

    • Add release process doc (#130)
    • Add design documents (#125)
    • Add developer guide doc for coding style (#155)
    • Add contribution rules (#131)
    • Add docker image list (#154)
    • Add initial validation results
    • Add metric reasoning doc -- @cp5555 @guoshzhao
  3. Process monitor

    • Add Heart beat to monitor process health
    • Auto kill all processes on all nodes
  4. Coding style -- @abuccts

    • Add vscode online

Backlogs

Multi-Node Benchmarks

  • Mellanox ClusterKit
  • GPCNeT

UI Design

VGG models failed on A100 GPU with batch_size=128

What's the issue, what's expected?:

VGG models failed on A100 GPU with batch_size=128, report NCCL error or OS crash. All the commands are runned on python 3.8 venv.

How to reproduce it?:

  1. copy default.yaml to current working directory (root directory of superbenchmark repo) and named "config.yaml"
  2. change parameter in config.yaml, set vgg_models' batch_size=128
  3. set "enable: vgg_models" to run vgg models only
  4. run "sb run -f ./host.ini -c config.yaml"
  5. process failed with NCCL error

Log message or shapshot?:

image

Additional information:

NVIDIA Driver version: 460.39
Python version: 3.8
PyTorch version: 1.8

pytorch cannot find libopen-orted-mpir.so

What's the issue, what's expected?:
pytorch cannot find libopen-orted-mpir.so

Log message or shapshot?:

ERROR: libopen-orted-mpir.so: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/knack/cli.py", line 233, in invoke
cmd_result = self.invocation.execute(args)
File "/usr/local/lib/python3.8/dist-packages/knack/invocation.py", line 224, in execute
cmd_result = parsed_args.func(params)
File "/usr/local/lib/python3.8/dist-packages/knack/commands.py", line 146, in call
return self.handler(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/knack/commands.py", line 253, in _command_handler
result = op(client, **command_args) if client else op(**command_args)
File "/usr/local/lib/python3.8/dist-packages/superbench/cli/_handler.py", line 208, in exec_command_handler
executor.exec()
File "/usr/local/lib/python3.8/dist-packages/superbench/executor/executor.py", line 247, in exec
context = BenchmarkRegistry.create_benchmark_context(
File "/usr/local/lib/python3.8/dist-packages/superbench/common/utils/lazy_import.py", line 42, in getattr
self._import()
File "/usr/local/lib/python3.8/dist-packages/superbench/common/utils/lazy_import.py", line 31, in _import
self._callback()
File "/usr/local/lib/python3.8/dist-packages/superbench/benchmarks/init.py", line 15, in
'superbench.benchmarks.registry', 'BenchmarkRegistry', lambda: list(
File "/usr/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 848, in exec_module
File "", line 219, in _call_with_frames_removed
File "/usr/local/lib/python3.8/dist-packages/superbench/benchmarks/model_benchmarks/init.py", line 7, in
from superbench.benchmarks.model_benchmarks.pytorch_bert import PytorchBERT
File "/usr/local/lib/python3.8/dist-packages/superbench/benchmarks/model_benchmarks/pytorch_bert.py", line 6, in
import torch
File "/usr/local/lib/python3.8/dist-packages/torch/init.py", line 191, in
_load_global_deps()
File "/usr/local/lib/python3.8/dist-packages/torch/init.py", line 153, in _load_global_deps
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File "/usr/lib/python3.8/ctypes/init.py", line 373, in init
self._handle = _dlopen(self._name, mode)
OSError: libopen-orted-mpir.so: cannot open shared object file: No such file or directorynon-zero return code

additional information
Ubuntu 20.04, python3.8;OpenMPI 4.04;
I have libopen-orted-mpir.so in /usr/lib/x86_64-linux-gnu/openmpi/lib/libopen-orted-mpir.so
and I have written the path in ~/.bashrc;
I have tried to ask chatgpt4, and he cannot solve this issue.

V0.4.0 Test Plan

V0.4.0 Test Plan

Test Table

Machine Type #Node * #GPU * GPU Type PyTorch Version Accelerated Computing Toolkit Status
ND A100 v4 1 * 8 * A100 40GB SXM PyTorch 1.8 CUDA 11.1 Succeeded
NDm A100 v4 1 * 8 * A100 80GB SXM PyTorch 1.8 CUDA 11.1 Done
ND A100 v4 2 * 8 * A100 40GB SXM PyTorch 1.8 CUDA 11.1 Succeeded
HPE 1 * 8 * MI100 PyTorch 1.7 ROCm 4.2 Succeeded
HPE 2 * 8 * MI100 PyTorch 1.7 ROCm 4.2 Succeeded
NC64as_T4_v3 4 * T4 PyTorch 1.8 CUDA 11.1 Done

Test Cases

Benchmarks Test

  1. Model Benchmarks
  • ORT model on AMD
  • Support FP32 mode without TF32
  1. Micro Benchmarks
  • GPU Memory Validation
  • GPU Copy Bandwidth
  • ORT inference on Nvidia
  • TensorRT inference on Nvidia
  1. Micro Benchmarks(distributed)
  • IB validation
  • TCP validation
  • GPCNet validation
  • NCCL/RCCL

Tools

    • Monitor
    • Data diagnosis

Other Features

    • Docker Images Check

docker image /root/hostfile cannot be updated with redeploy

This is using superbench v0.6.0-rc1-cuda11.1.1.

After I deployed and ran some tests on a set of nodes (e.g. n1,n2), I wanted to change the nodes to be (n1,n3), I removed the running containers, re-run deploy with (n1,n3). However, the /root/hostfile inside the docker image is still the old (n1,n2).

Expected: there could be an easy way to switch to the new (n1,n3). Either using sb deploy or sb run.

[Enhancement] maybe algo argument can be omitted in cudnn-function?

By using superbench cudnn-function microbenchmark to benchmark conv2d, It's hard to choose the algo arguments in cudnn-function unit test, because different shape's best performance comes up with different algo, I'm currently traversing all the seven algos to test the conv2d bench, but it really takes a bunch of time.

 for algo in algos:
            custom_config_str = '{' + '"algo":{2},"arrayLength":2,"convType":0,"dilationA":[{0},{0}],"filterStrideA":[{1},{1}],'.format(conv_info['D'], conv_info['S'], algo) \
                + '"filterDims":[{0},{1},{2},{2}],"inputDims":[{0},{3},{4},{5}],"inputStride":[{6},{7},{8},{9}],"inputType":0,'.format(conv_info['N'], conv_info['F'], conv_info['K'], conv_info['C'], conv_info['H'], conv_info['W'], input_strides[0], input_strides[1], input_strides[2], input_strides[3])\
                + '"mode":1,"name":"cudnnConvolutionForward","outputDims":[{0},{1},{2},{3}],'.format(conv_info['N'], conv_info['F'], conv_info['HO'], conv_info['WO'])\
                + '"outputStride":[{0},{1},{2},{3}],"padA":[{4},{4}],"tensorOp":false'.format(output_strides[0], output_strides[1], output_strides[2], output_strides[3], pad_top, pad_left) \
                + '}'
            parameters = '--num_warmup 8 --num_steps 100 --num_in_step 1000 --config_json_str {0}'.format(custom_config_str)
            context = BenchmarkRegistry.create_benchmark_context(
                'cudnn-function', platform=Platform.CUDA, parameters=parameters
            )

From my point of view, we can omit the algo because cudnn library already provided three functions to automatically choose the best algo:

cudnnFindConvolutionForwardAlgorithm
cudnnFindConvolutionBackwardAlgorithm
cudnnFindConvolutionBackwardFilterAlgorithm

V0.5.0 Test Plan

V0.5.0 Test Plan

Test Table

Machine Type #Node * #GPU * GPU Type PyTorch Version Accelerated Computing Toolkit Status
ND A100 v4 1 * 8 * A100 40GB SXM PyTorch 1.8 CUDA 11.1
NDm A100 v4 1 * 8 * A100 80GB SXM PyTorch 1.8 CUDA 11.1
NDm A100 v4 2 * 8 * A100 80GB SXM PyTorch 1.8 CUDA 11.1
Hayabusa 1* 16 * MI200 PyTorch 1.9 ROCm 5.0
Hayabusa 2 * 16 * MI200 PyTorch 1.9 ROCm 5.0
NC96_v4 1 * 4 * A100 PCIe PyTorch 1.8 CUDA 11.1

Test Cases

Micro-benchmark Test

  1. Nccl-tests (NVIDIA only)
    • GDR-only nccl-tests
  2. Rccl-test (ROCm only)
    • update submodule to fix divide by zero error
  3. GPU-copy
  • Support bi-directional bandwidth benchmark
  • Support data checking and make it optional
  1. GEMM benchmark (NVIDIA only)
  • Support T4 and A10 in GEMM benchmark
  1. GPU-burn benchmark (NVIDIA only)

Model-benchmark Test

  1. Pytorch models
  • Sync results on root rank for e2e model benchmarks in distributed mode
  • Support customized env in local and torch.distributed mode
  • Add support for pytorch>=1.9.0
  • Keep BatchNorm as fp32 for pytorch cnn models cast to fp16
  • Remove FP16 samples type converting time
  1. Support FAMBench

Inference Benchmark Improvement

  • Add percentile metrics for ort and pytorch inference benchmarks
  • Add configuration with inference benchmark

SuperBench Improvement

  • Add command to support listing all optional parameters for benchmarks.
  • Unify benchmark naming convention and support multiple tests with same benchmark and different parameters/options in one configuration file
  • Support timeout to detect the benchmark failure and stop the process automatically
  • Improve Output Interface

Tools

  1. data diagnosis
    • Support multi-benchmark check
    • Support output in excel and html format
    • Support result output for all nodes in data diagnosis
  2. Support result summary in excel ,md and html format

/sys/fs/cgroup/cpuacct/cpuacct missing causing superbench failures

Docker container: nvidia/cuda:11.6.1-cudnn8-devel-ubuntu20.04
GPU 0: NVIDIA A100 80GB PCIe

[2022-11-23T15:02:17.260Z] Running model...

[2022-11-23T15:02:17.260Z] > docker exec dd7780c3a5f9 bash -c "cd superbenchmark && bash run.sh atoa_small_hayabusa.yaml atoa_small_hayabusa_performance.csv"

[2022-11-23T15:18:24.479Z] NVIDIA GPU detected.

[2022-11-23T15:18:24.479Z] sb exec --config-file   atoa_small_ndv4.yaml    2>&1 | tee log.txt

[2022-11-23T15:18:24.479Z] [2022-11-23 15:02:18,200 rocm-framework-a100-1:400][executor.py:224][INFO] Executor is going to execute gpt_models/pytorch-gpt2-small.

[2022-11-23T15:18:24.480Z] [2022-11-23 15:02:18,202 rocm-framework-a100-1:466][monitor.py:100][INFO] Start monitoring.

[2022-11-23T15:18:24.480Z] [2022-11-23 15:02:18,203 rocm-framework-a100-1:466][monitor.py:226][ERROR] Failed to read process cpu ticks information - error message: [Errno 2] No such file or directory: '/sys/fs/cgroup/cpuacct/cpuacct.stat'

[2022-11-23T15:18:24.480Z] [2022-11-23 15:02:19,205 rocm-framework-a100-1:466][monitor.py:226][ERROR] Failed to read process cpu ticks information - error message: [Errno 2] No such file or directory: '/sys/fs/cgroup/cpuacct/cpuacct.stat'

[2022-11-23T15:18:24.480Z] [2022-11-23 15:02:19,206 rocm-framework-a100-1:466][monitor.py:105][ERROR] Failed to launch the monitor process - error message: unsupported operand type(s) for -: 'NoneType' and 'NoneType'

[2022-11-23T15:18:24.480Z] Process Monitor-1:

[2022-11-23T15:18:24.480Z] Traceback (most recent call last):

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 102, in run

[2022-11-23T15:18:24.480Z]     self.__sample()

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 126, in __sample

[2022-11-23T15:18:24.480Z]     self.__sample_host_metrics(record)

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 152, in __sample_host_metrics

[2022-11-23T15:18:24.480Z]     cpu_usage = (container_ticks_e -

[2022-11-23T15:18:24.480Z] TypeError: unsupported operand type(s) for -: 'NoneType' and 'NoneType'

[2022-11-23T15:18:24.480Z] 

[2022-11-23T15:18:24.480Z] During handling of the above exception, another exception occurred:

[2022-11-23T15:18:24.480Z] 

[2022-11-23T15:18:24.480Z] Traceback (most recent call last):

[2022-11-23T15:18:24.480Z]   File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap

[2022-11-23T15:18:24.480Z]     self.run()

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 106, in run

[2022-11-23T15:18:24.480Z]     self.stop()

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 117, in stop

[2022-11-23T15:18:24.480Z]     self.join()

[2022-11-23T15:18:24.480Z]   File "/usr/lib/python3.8/multiprocessing/process.py", line 147, in join

[2022-11-23T15:18:24.480Z]     assert self._parent_pid == os.getpid(), 'can only join a child process'

[2022-11-23T15:18:24.480Z] AssertionError: can only join a child process

self._cpu_file = '/sys/fs/cgroup/cpuacct/cpuacct.stat'

[Suggestion] Why tensorrt backend uses trtexec instead of tensorrt python interface?

From my point of view, use python interface we can insert cudaprofilestart() and cudaprofilestop() to better prof our program, because if we use trtexec, superbench will start anothor thread to execute and nvprof can not correctly prof the real command, and, directly profile trtexec will prof the compilation progress and runtime progress, in most of the case, we only need the last one.

tensorrt python interface example:

import tensorrt as trt
import common
import time
import pycuda.driver as cuda
import torch
import os

TRT_LOGGER = trt.Logger()


def inference(context, test_data):
    inputs, outputs, bindings, stream = common.allocate_buffers(context.engine)
    result = []
    inputs[0].host = test_data

    _, elapsed_time = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)

    return result, elapsed_time

# This function builds an engine from a Onnx model.
def build_engine(model_file, batch_size=32):
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(common.EXPLICIT_BATCH) as network, trt.OnnxParser(network, TRT_LOGGER) as parser, builder.create_builder_config() as trt_config:

        # Attention that, builder should be set to 1 because of the implementation of allocate_buffer
        builder.max_batch_size = 1
        # builder.max_workspace_size = common.GiB(1)
        trt_config.max_workspace_size = common.GiB(4)

        
        # Parse onnx model
        with open(model_file, 'rb') as model:
            if not parser.parse(model.read()):
                print ('ERROR: Failed to parse the ONNX file.')
                for error in range(parser.num_errors):
                    print (parser.get_error(error))
                return None


        # This design may not be correct if output more than one
        """
        for i in range(network.num_layers):
            layer = network.get_layer(i)
            layer.precision = trt.int8
            layer.set_output_type(0, trt.int8)
        """


        # network.mark_output(model_tensors.find(ModelData.OUTPUT_NAME))
        # Build engine and do int8 calibration.
        # engine = builder.build_cuda_engine(network)
        engine = builder.build_engine(network, trt_config)
        return engine

onnx_path = "/workspace/v-leiwang3/benchmark/nnfusion_models/resnet50.float32.1.onnx"
dummy_input = torch.rand(1, 3, 224, 224).numpy()

engine = build_engine(onnx_path)
context = engine.create_execution_context()

# warmup
for i in range(5):
    _, time = inference(context, dummy_input)

# iteration
time_set = []
for i in range(100):
    _, time = inference(context, dummy_input)
    time_set.append(time)

print(f'average time: {sum(time_set)/len(time_set)* 1000} ms')

V0.2.0 Release Plan

Release Manager

@TobeyQin

Endgame

Feature freeze: May. 10
Code freeze: May. 28
Demo date: May. 28
Bug Bash date: May. 28
Release date: Jun. 4

Main Features

SuperBench Framework Implementation

SB Benchmarks Implementation -- @guoshzhao

    • Design Doc
  1. SB Benchmark Base
    • Benchmark Base
    • Model Base
    • Microbenchmark Base
  2. Environment Build Pipeline ETA:
    • Design Doc
    • Implementation

SB Agent Implementation -- @abuccts

    • Design Doc
  1. SB CLI
    • SB CLI Implementation
    • Integration with SB Runner
    • Integration with SB Executor
  2. SB Executor
    • SB Executor Implementation
  3. SB Runner ETA:
    • SB Runner Implementation
    • Integration with SB Executor

Benchmark Tasks

E2E Benchmarks (including metrics: _float, _half, _float_throughput, _half_throughput) -- @guoshzhao

  • CNN models -- User PyTorch TORCHVISION.MODELS sub-package
  • ResNet: ResNet-50, ResNet-101, ResNet-152
  • DenseNet: DenseNet-169, DenseNet-201 ​
  • VGG: VGG-11, VGG-13, VGG-16, VGG-19​
  • BERT -- Use huggingface Transformers
  • BERT
  • BERT LARGE
  • LSTM -- Use PyTorch TORCH.NN sub-package
  • GPT-2 -- Use huggingface Transformers

Micro Benchmarks

    • GEMM FLOPS (Tool: Nvidia Cutlass Tool) -- @guoshzhao ETA: May 21
    Metrics Unit Description
    FP64 GFLOPS FP64 FLOPS without TensorCore
    FP32 GFLOPS FP32 FLOPS without TensorCore
    FP16 GFLOPS FP16 FLOPS without TensorCore
    FP64(TC) GFLOPS FP64 FLOPS with TensorCore
    TF32(TC) GFLOPS TF32 FLOPS with TensorCore
    FP16(TC) GFLOPS FP16 FLOPS with TensorCore
    BF16(TC) GFLOPS BF16 FLOPS with TensorCore
    INT8(TC) GOPS INT8 FLOPS with TensorCore
    INT4(TC) GOPS INT4 FLOPS with TensorCore
    • KernelLaunch (Tool: MSR-A build) -- @guoshzhao ETA: May 21
    Metrics Unit Description
    Kernel_Launch_Event_Time Time (ms) Dispatch latency measured in GPU time using cudaEventRecord()/hipEventRecord()
    Kernel_Launch_Wall_Time Time (ms) Dispatch latency measured in CPU time
    • Kernel (Tool: MSR-A build) -- @yukirora ETA: May 21
    Metrics Unit Description
    cublasSgemm Time (ms) Cublas Kernel Process Time for cublasSgemm
    cublasSgemmStridedBatched Time (ms) Time for cublasSgemmStridedBatched
    cublasGemmStridedBatchedEx Time (ms) Time for cublasGemmStridedBatchedEx
    cublasGemmEx Time (ms) Time for cublasGemmEx
    cublasCgemm3mStridedBatched Time (ms) Time for cublasCgemm3mStridedBatched
    cublasCgemm Time (ms) Time for cublasCgemm
    Mul_During_NCCL Time (ms) Time for Mul_During_NCCL
    MatMul_During_NCCL Time (ms) Time for MatMul_During_NCCL
    MM_AllReduce_Opsharding Time (ms) Time for MM_AllReduce_Opsharding
    MM_AllGather_Concat_Opsharing Time (ms) Time for MM_AllGather_Concat_Opsharing

Document refine -- @TobeyQin ETA: May 28

  1. Update README file
  • Add SuperBenchmark architecture and refine the goals
  • Define results format

Test Plan

Test Overall Pipeline

  • Set Environment and SuperBench Installation Test
  • Run through the entire benchmarking process
  • CLI commands execute test
  • E2E model benchmark test
  • Micro benchmark test

Utils

  1. Uni-Test platform
  • Add pipelines for CPU/GPU tests

V0.6.0 Release Plan

Release Manager

@cp5555

Endgame

  • Code freeze: August 22nd
  • Bug Bash date: August 22nd
  • Release date: September 4th

Main Features

SuperBench Improvement

    • Support running on host directly without Docker (#358, #362)
    • Support running sb command inside docker image (#356)
    • Support ROCm 5.1.3 (#361)
    • Fix bugs in data diagnosis (#355)
    • Fix cmake and build issues (#360)
    • Support automatic configuration yaml selection on Azure VM (#365)
    • Refine error message when GPU is not detected. (#368)
    • Add return code for Timeout (#383)
    • Update Dockerfile for NCCL/RCCL version, tag name, and verbose output. (#371)
    • Support node_num=1 in mpi mode (#372)
    • Update Python setup for require packages (#387)
    • Enhance parameter parsing to allow spaces in value (#397)
    • Support NO_COLOR for SuperBench output (#404)

Micro-benchmark Improvement

    • Fix issues in ib loopback benchmark (#369)
    • Fix stability issue in ib loopback benchmark (#386)

Distributed Benchmark Improvement

    • Pair-wise IB benchmark (#363)
    • Topology-aware IB benchmark (#373, #381)

Data Diagnosis & Analysis

    • Add failure check function in data_diagnosis.py (#378)
    • Support Json and Jsonl in Diagnosis. (#388)
    • Add support to store values of metrics in data diagnosis. (#392, #399)
    • Support exit code of sb result diagnosis (#403)
    • Format int type and unify empty value to N/A in diagnosis output files (#406)

Backlog

Inference Benchmark Improvement

  1. Support VGG, LSTM, and GPT-2 small in TensorRT Inference Backend
  2. Support VGG, LSTM, and GPT-2 small in ORT Inference Backend
  3. Support more TensorRT parameters (Related to #366)

Data Diagnosis & Analysis

  1. Support boxplot and outlier analysis

Document

  1. Metric Reasoning Doc

Run benchmark failed (superbenchmark-0.8.0)

What's the issue, what's expected?:

PLAY RECAP *********************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
[2023-04-14 19:17:19,195 u22:21920][ansible.py:79][INFO] Run succeed, return code 0.
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/cublas-function/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/cudnn-function/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/gemm-flops/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/gpu-burn/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/mem-bw/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/nccl-bw:default/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/nccl-bw:gdr-only/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/ort-inference/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchm

How to reproduce it?:
OS: ubuntu 22.04.02
GPU: GeForce RTX 3060 x1

wget https://github.com/microsoft/superbenchmark/archive/refs/tags/v0.8.0.tar.gz
tar xf v0.8.0.tar.gz
cd superbenchmark-0.8.0/
python3 -m venv --system-site-packages ./venv
source ./venv/bin/activate
python3 -m pip install .
python3 -m pip install --upgrade pip setuptools==65.7
make postinstall
cp superbench/config/default.yaml sb.yaml # and change the proc_num: 8 to proc_num: 1
nano local.ini
set +H
sb deploy -f local.ini --host-password=mysshpassword

docker images # check docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
superbench/superbench latest 36fe2cd49200 2 hours ago 19.5GB

docker run -it --rm --gpus all -e NVIDIA_VISIBLE_DEVICES=0 --shm-size=1g --ulimit memlock=-1 superbench/superbench
nvidia-smi #it works.
exit

sb run -f local.ini -c sb.yaml --host-password=mysshpassword

Log message or shapshot?:
image

see attached

Additional information:
2023-04-14_19-16-21.tar.gz

V0.8.0 Release Plan

Release Manager

@cp5555

Endgame

  • Code freeze: March 28th, 2023
  • Bug Bash date: March 29th, 2023
  • Release date: April 7th, 2023

Main Features

SuperBench Improvement

    • Support SuperBench Executor running on Windows (#475)
    • Remove fixed rccl version in rocm5.1.x docker file (#476)
    • Upgrade networkx version to fix installation compatibility issue (#478)
    • Pin setuptools version to v65.7.0 (#483)
    • Limit ansible_runner version for Python3.6 (#485)
    • Support cgroup V2 when read system metrics in Monitor (#491, #502)
    • Fix analyzer bug in python3.8 due to pandas api change (#504)
    • Collect real-time GPU power in Monitor (#507)
    • Remove unreachable condition when write host list (#512)
    • Update to cuda12.1, nccl 2.17.1, hpcx 2.14, and mlc 3.10 (#513)
    • Fix wrong unit of cpu-memory-bw-latency in doc (#515)

Micro-benchmark Improvement

    • Add STREAM benchmark for sustainable memory bandwidth and the corresponding computation rate. (#473)
    • Add HPL Benchmark for HPC Linpack Benchmark. (#482)
    • Support flexible warmup and non-random data initialization in cublas-benchmark (#479)
    • Support error tolerance in micro-benchmark for CuDNN function (#490, #506)
    • Add distributed inference benchmark (#493 and #505)
    • Support tensor core precisions (e.g., FP8) and batch/shape range in cublaslt gemm (#492, #494, and #503)

Model Benchmark Improvement

    • Fix torch.dist init issue with multiple models (#495)
    • Support TE FP8 in BERT/GPT2 models (#496, #499)
    • Add num_workers configurable in model benchmark (#511)

V0.9.0 Release Plan

Release Manager

@cp5555

Endgame

  • Code freeze: July 5th, 2023
  • Bug Bash date: July 8th, 2023
  • Release date: July 19th, 2023

Main Features

SuperBench Improvement

    • Support Ctrl+C and interrupt to stop all SuperBench testing. (#530)
    • Support CPU docker (#480)
    • Support Windows Docker for VDI/Gaming GPU (#534)
    • Support DirectX for Nvidia and AMD GPU (#536)
    • Add System Config Info feature in SB runner. (#532)
    • Support DirectX test pipeline (#545)

Micro-benchmark Improvement

    • Add DirectXGPUCopyBw Benchmark to measure HtoD/DtoH bandwidth (#486 and #546)
    • Add DirectXGPUCoreFLops Benchmark to measure peak FLOPS (#488 and #542)
    • Add DirectXGPUMemBw Benchmark to measure GPU memory bandwidth (#487 and #547)
    • Add DirectXVCNEncodingLatency Benchmark to measure the VCN hardware encoding latency (#543 and #548)
    • Support best algorithm selection in cudnn-function. (Related to #384) (#540)
    • Revise step time collection in distributed inference benchmark (#524)

Model Benchmark Improvement

    • Fix early stop logic due to num_steps. (#522)
    • Support TensorRT models on Nvidia H100 (#541)

Documentation

    • Improve documentation for System Config Info. (#532)
    • Update outdate references (#539)
    • Update outdate references in micro-benchmarks.md (#544)

Backlog

Micro-benchmark Improvement

  1. Add HPL random generator to gemm-flops with ROCm (Related to #518)
  2. Support Monitoring for AMD GPUs
  3. Support cuDNN Backend API in cudnn-function.
  4. Add DirectXGPURenderFPS Benchmark to measure the FPS of rendering simple frames

Inference Benchmark Improvement

  1. Support VGG, LSTM, and GPT-2 small in TensorRT Inference Backend
  2. Support VGG, LSTM, and GPT-2 small in ORT Inference Backend
  3. Support more TensorRT parameters (Related to #366)

'sb result diagnosis' should exit with valid error code when input data format has error

What's the issue, what's expected?:
This is superbench release/0.6. The 'sb result diagnosis' command didn't exit with error code properly when input data file has wrong format. e.g. for an empty data file, it has the right error detection which can be told from stdout/stderr, but the exit code is 0. see repro steps.

While empty file is just a rare case, this problem shows a general issue in superbench.

How to reproduce it?:
Create an empty data file. Create valid rule and baseline files. Then run (For privacy reason, the timestamp and hostname were removed from the following log)

$ sb result diagnosis --data-file outputs/b/results-summary.jsonl --rule-file rule1.y --baseline-file baseline1.json --output-file-format json --output-all --output-dir diag

...[file_handler.py:41][ERROR] Analyzer: invalid raw data fomat - 'node'
...[rule_base.py:106][ERROR] RuleBase: empty raw data
...[data_diagnosis.py:405][INFO] DataDiagnosis: Begin to process 0 nodes
...[data_diagnosis.py:111][ERROR] DataDiagnosis: get criteria failed
...[data_diagnosis.py:407][INFO] DataDiagnosis: Processed finished
...[data_diagnosis.py:428][INFO] DataDiagnosis: Output results to diag1/diagnosis_summary.json

$ echo $?
0

cublaslt_gemm microbenchmark fails with running with large matrix sizes.

When I run microbenchmark cublaslt_gemm with B=64, M=8192, K=8192, N=8192, it fails in cudaMalloc.

I debugged this and figured out the result of multiplication B * M * K and B * M * N and B * K * N are all more than 4GB in BF16 so 32 bit int data type cannot hold result of the multiplication. I managed to make local changes to fix this. But once I get past cudaMalloc, cublasCreate(&handle) call fails with CUBLAS_STATUS_NOT_INITIALIZED.

These are the steps reproduce the error
cd superbench/benchmarks/microbencmarks/cublaslt_gemm
cmake -S ./
make
./cublaslt_gemm -b 64 -m 8192 -k 8192 -n 8192 -i 1000 -t bf16

docker images does not run

ubuntu1804. After download, it fails to run. nvidia-docker installed but erronously stated it is not installed.


(venv) nonroot@nonroot-MS-7B22:~/superbenchmark$ sudo docker run superbench/superbench:v0.4.0-cuda11.1.1
Unable to find image 'superbench/superbench:v0.4.0-cuda11.1.1' locally
v0.4.0-cuda11.1.1: Pulling from superbench/superbench
6a5697faee43: Pulling fs layer 
ba13d3bc422b: Pulling fs layer 
a254829d9e55: Pulling fs layer 
ff2daf3cdab6: Pulling fs layer 
9867a212b99b: Pulling fs layer 
da2dc255298e: Pulling fs layer 
45c66138abc4: Pulling fs layer 
69f8f14337fe: Pulling fs layer 
ca6a80844c87: Pulling fs layer 
f1cef55f2f91: Pulling fs layer 
7da10256993e: Pulling fs layer 
6b2e44626eea: Pulling fs layer 
6e6939188865: Pulling fs layer 
e935c7b1a998: Pulling fs layer 
6ff8cf358a74: Pulling fs layer 
590ea530411f: Pulling fs layer 
e5e48f41197f: Pulling fs layer 
125a5da70c41: Pulling fs layer 
9279dc6b257d: Pulling fs layer 
f594d963eb87: Pulling fs layer 
1b18685e6f7a: Pulling fs layer 
9591a6fe4536: Pulling fs layer 
e935c7b1a998: Waiting 
bdb38838130b: Pulling fs layer 
f9b433e418df: Pulling fs layer 
1115e15a521a: Pulling fs layer 
10b0575de683: Pulling fs layer 
6ff8cf358a74: Waiting 
40ad6f0e66a8: Pulling fs layer 
922bdc233ecf: Pulling fs layer 
9867a212b99b: Waiting 
2fed69baa886: Pulling fs layer 
f594d963eb87: Waiting 
590ea530411f: Waiting 
1b18685e6f7a: Waiting 
da2dc255298e: Waiting 
e5e48f41197f: Waiting 
9591a6fe4536: Waiting 
24d6f5e10b64: Pulling fs layer 
28cb839879c0: Pulling fs layer 
125a5da70c41: Waiting 
45c66138abc4: Waiting 
9279dc6b257d: Waiting 
1a297942cbed: Pull complete 
7088b2887299: Pull complete 
0cd61a107eb7: Pull complete 
f4abf5809bbd: Pull complete 
5ce8cc51c2d6: Pull complete 
c19296d8b165: Pull complete 
65b415727830: Pull complete 
92ef4ed872d5: Pull complete 
8a6a6385784d: Pull complete 
7159e82f10c2: Pull complete 
8903926b1920: Pull complete 
b497b1efac36: Pull complete 
e0466816640f: Pull complete 
2992460da256: Pull complete 
a5ff18c9283b: Pull complete 
e039b86398d9: Pull complete 
802a59289df4: Pull complete 
f65af8b3e314: Pull complete 
828253d36d6c: Pull complete 
6543d2035b7e: Pull complete 
b16beecccd1d: Pull complete 
205f3c109cf1: Pull complete 
9fdd474bb4ec: Pull complete 
bde9957e1552: Pull complete 
c173907422c4: Pull complete 
ab107e9c96c0: Pull complete 
0b6e632691f0: Pull complete 
426d52c2e7a6: Pull complete 
Digest: sha256:80661452672edbd2017d36f8fc9033bb3083a32120f35efed1191339c6437482
Status: Downloaded newer image for superbench/superbench:v0.4.0-cuda11.1.1


=============
== PyTorch ==
=============

NVIDIA Release 20.12 (build 17950526)
PyTorch Version 1.8.0a0+1606899

Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2014-2020 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use 'nvidia-docker run' to start this container; see
   https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker .

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --ipc=host ...
(venv) nonroot@nonroot-MS-7B22:~/superbenchmark$ which nvidia-docker
/usr/bin/nvidia-docker


where the cudnn_benchmark binary located?

# python3 examples/benchmarks/cudnn_function.py
[2022-07-31 05:32:08,788 ac5e130cece6:26637][micro_base.py:129][ERROR] The binary does not exist - benchmark: cudnn-function, binary name: cudnn_benchmark, binary directory: None.
[2022-07-31 05:32:08,788 ac5e130cece6:26637][cudnn_function.py:21][INFO] benchmark: cudnn-function, return code: 31, result: {'return_code': [31]}

'sb deploy' is expected to exit with non-zero when failed

Summary

This issue was found using v0.6.0 release. In the system, the ansible was not setup properly because of a test environment issue. When I ran 'sb deploy -f local.ini -i superbench/superbench:v0.6.0-cuda11.1.1'. It has error message like below. Although the error message said ansible.py return code 127, the sb program exit with 0.

[2022-09-09 18:49:10,573 N000000:30359][runner.py:43][INFO] Runner writes to: /home/aiscadmin/superbench/outputs/2022-09-09_18-49-10.
[2022-09-09 18:49:10,622 N000000:30359][runner.py:48][INFO] Runner will run: ['gpu-burn', 'nccl-bw:default', 'nccl-bw:gdr-only', 'ib-loopback', 'mem-bw', 'gpu-copy-bw:correctness', 'gpu-copy-bw:perf', 'kernel-launch', 'gemm-flops', 'cudnn-function', 'cublas-function', 'matmul', 'sharding-matmul', 'computation-communication-overlap', 'ort-inference', 'tensorrt-inference', 'gpt_models', 'bert_models', 'lstm_models', 'resnet_models', 'densenet_models', 'vgg_models']
[2022-09-09 18:49:10,622 N000000:30359][runner.py:165][INFO] Preparing SuperBench environment.
[2022-09-09 18:49:10,622 N000000:30359][ansible.py:125][INFO] Run playbook deploy.yaml ...
The command was not found or was not executable: ansible-playbook.
[2022-09-09 18:49:10,628 N000000:30359][ansible.py:80][WARNING] Run failed, return code 127.

$ echo $?
0

How to repro

Setup superbench normally. Before running 'sb deploy', remove the ~/ .ansible directory. Then run 'sb deploy' like above.

V0.7.0 Release Plan

Release Manager

@cp5555

Endgame

  • Code freeze: Jan. 3rd, 2023
  • Bug Bash date: Jan 13th, 2023
  • Release date: Jan 20th, 2023

Main Features

SuperBench Improvement

    • Support non-zero return code when “sb deploy” or “sb run” fails in Ansible (Related to #410 and #411) (#425)
    • Support log flushing to the result file during runtime (Related to #390) (#445)
    • Update version to include revision hash and date (#427)
    • Support 'pattern' in 'mpi' mode to run tasks in parallel (#430, #458)
    • Support topo-aware, all-pair, and K-batch pattern in 'mpi' mode (#437#447)
    • Fix Transformers version to avoid Tensorrt failure (#441)
    • Add CUDA11.8 Docker image for Nvidia arch90 GPUs (#449)
    • Support sb deploy without docker pulling (#466)

Micro-benchmark Improvement

    • Support list of custom config string in cudnn-functions and cublas-functions (#414)
    • Support correctness check in cublas-functions (#450, #452)
    • Support GEMM-FLOPS for Nvidia arch90 GPUs (#456)
    • Add wait time option to resolve mem-bw unstable issue (#438)
    • Fix bug for incorrect datatype judgement in cublas-function source code. (#462)

Model-benchmark Improvement

    • Support FP8 in Bert model training (#446, #461)

Distributed Benchmark Improvement

    • Support pair-wise pattern in IB validation benchmark. (#453)
    • Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark. (#454)

Backlog

Inference Benchmark Improvement

  1. Support VGG, LSTM, and GPT-2 small in TensorRT Inference Backend
  2. Support VGG, LSTM, and GPT-2 small in ORT Inference Backend
  3. Support more TensorRT parameters (Related to #366)

Document

  1. Metric Reasoning Doc

V0.3.0 Test Plan

V0.3.0 Test Plan

Test Table

Machine Type #Node * #GPU * GPU Type PyTorch Version Accelerated Computing Toolkit Status
NDv4 1 * 1 * A100 PyTorch 1.8 CUDA 11.1 Succeeded
NDv4 1 * 8 * A100 PyTorch 1.8 CUDA 11.1 Succeeded
NDv4 2 * 8 * A100 PyTorch 1.8 CUDA 11.1 Done
1 * 2 * V100 PyTorch 1.8 CUDA 11.1 Succeeded
HPE 1 * 8 * MI100 PyTorch 1.7 ROCm 4.2 Done

Test Cases

Benchmarks Test

  1. E2E Benchmarks
  • CNN models
  • BERT
  • LSTM
  • GPT-2
  1. Micro Benchmarks
  • Kernel Launch
  • GEMM FLOPS
  • Memory
  • NCCL/RCCL
  • IB
  • Disk

Other Features

    • Docker Images Check
    • Document Correctness Check

superbench runtime needs to flush log to the result file

When superbench (version 0.6.0-rc1) runs a test, the test output is saved in some memory. It didn't flush the log to output file. This is hard for users to track the test process, especially for interactive tests, long running tests (>=5minutes).

This feature is especially important because sometime a test may hang. Without output, it is hard to tell whether it is actually hang or not.

Expected: superbench should flush the outputs to the log file immediately.

superbench logging: 'color' output should be optional

In current superbench code, the logger has a hard-coded 'color' mode. While this design is good for interactive mode, it is bad when processing the logs programmatically in system like Geneva.

The following is superbench output's raw text:
^[[0;33m[1,4]:^[[0m[^[[36m2022-08-29 22:50:16,377 node1:287^[[0m][^[[34mexecutor.py:125^[[0m][^[[32mINFO^[[0m] Executor succeeded in nccl-bw:nvlink.^[[0m^[[0m^M

Expected: the color mode should be an option to 'sb' command. or it can be turned off by some environmental variable.
The following is the expectd raw text:

[1,4]:[2022-08-29 22:50:16,377 ND40rsv201000009:287][executor.py:125][INFO] Executor succeeded in nccl-bw:nvlink.

[Bug Report] ONNX export failed on adaptive_avg_pool2d at tensorrt micro bench.

I am currently working on the superbench/superbench:v0.4.0-cuda11.1.1 docker workspace to measure benchmark.

To get different model's benchmark with tensorrt, I customized the superbenchmark/examples/benchmarks/tensorrt_inference_performance.py like below

# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.

"""Micro benchmark example for TensorRT inference performance.

Commands to run:
    python3 examples/benchmarks/tensorrt_inference_performance.py
"""
import sys
from statistics import mode
from superbench.benchmarks import BenchmarkRegistry, Platform
from superbench.common.utils import logger

if __name__ == '__main__':
    batch = int(sys.argv[1])
    model = sys.argv[2]
    precision = sys.argv[3]
    parameters = '--batch_size {0} --pytorch_models {1} --precision {2} --seq_length 8 --iterations 105'.format(batch, model, precision)

    context = BenchmarkRegistry.create_benchmark_context('tensorrt-inference', platform=Platform.CUDA, parameters=parameters)
    benchmark = BenchmarkRegistry.launch_benchmark(context)
    if benchmark:
        logger.info(
            'benchmark: {}, return code: {}, result: {}'.format(
                benchmark.name, benchmark.return_code, benchmark.result
            )
        )

execution:

nvprof --log-file benches/TensorRT/vgg11/fp32_batch_1_prof.txt /opt/conda/bin/python /opt/superbench/examples/benchmarks/tensorrt_inference_performance.py 1 vgg11 fp32 | tee benches/TensorRT/vgg11/fp32_batch_1_time.txt

log :

root@616b67a69ab7:/opt/superbench# nvprof --log-file benches/TensorRT/vgg11/fp32_batch_1_prof.txt /opt/conda/bin/python /opt/superbench/examples/benchmarks/tensorrt_inference_performance.py 1 vgg11 fp32 | tee benches/TensorRT/vgg11/fp32_batch_1_time.txt
/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py:256: UserWarning: `add_node_names' can be set to True only when 'operator_export_type' is `ONNX`. Since 'operator_export_type' is not set to 'ONNX', `add_node_names` argument will be ignored.
warnings.warn("`{}' can be set to True only when 'operator_export_type' is "
/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py:256: UserWarning: `do_constant_folding' can be set to True only when 'operator_export_type' is `ONNX`. Since 'operator_export_type' is not set to 'ONNX', `do_constant_folding` argument will be ignored.
warnings.warn("`{}' can be set to True only when 'operator_export_type' is "
/opt/conda/lib/python3.8/site-packages/torch/onnx/symbolic_helper.py:182: UserWarning: ONNX export failed on adaptive_avg_pool2d because input size not accessible not supported
warnings.warn("ONNX export failed on " + op + " because " + msg + " not supported")
[2022-05-06 12:33:25,995 616b67a69ab7:18330][micro_base.py:167][INFO] Execute command - round: 0, benchmark: tensorrt-inference, command: /opt/tensorrt/bin/trtexec --onnx=/root/.cache/torch/hub/onnx/vgg11.onnx --explicitBatch --optShapes=input:1x3x224x224 --workspace=8192 --iterations=105 --percentile=99.
[2022-05-06 12:33:40,844 616b67a69ab7:18330][micro_base.py:176][ERROR] Microbenchmark execution failed - round: 0, benchmark: tensorrt-inference, error message: &&&& RUNNING TensorRT.trtexec # /opt/tensorrt/bin/trtexec --onnx=/root/.cache/torch/hub/onnx/vgg11.onnx --explicitBatch --optShapes=input:1x3x224x224 --workspace=8192 --iterations=105 --percentile=99
[05/06/2022-12:33:26] [I] === Model Options ===
[05/06/2022-12:33:26] [I] Format: ONNX
[05/06/2022-12:33:26] [I] Model: /root/.cache/torch/hub/onnx/vgg11.onnx
[05/06/2022-12:33:26] [I] Output:
[05/06/2022-12:33:26] [I] === Build Options ===
[05/06/2022-12:33:26] [I] Max batch: explicit
[05/06/2022-12:33:26] [I] Workspace: 8192 MiB
[05/06/2022-12:33:26] [I] minTiming: 1
[05/06/2022-12:33:26] [I] avgTiming: 8
[05/06/2022-12:33:26] [I] Precision: FP32
[05/06/2022-12:33:26] [I] Calibration:
[05/06/2022-12:33:26] [I] Refit: Disabled
[05/06/2022-12:33:26] [I] Safe mode: Disabled
[05/06/2022-12:33:26] [I] Save engine:
[05/06/2022-12:33:26] [I] Load engine:
[05/06/2022-12:33:26] [I] Builder Cache: Enabled
[05/06/2022-12:33:26] [I] NVTX verbosity: 0
[05/06/2022-12:33:26] [I] Tactic sources: Using default tactic sources
[05/06/2022-12:33:26] [I] Input(s)s format: fp32:CHW
[05/06/2022-12:33:26] [I] Output(s)s format: fp32:CHW
[05/06/2022-12:33:26] [I] Input build shape: input=1x3x224x224+1x3x224x224+1x3x224x224
[05/06/2022-12:33:26] [I] Input calibration shapes: model
[05/06/2022-12:33:26] [I] === System Options ===
[05/06/2022-12:33:26] [I] Device: 0
[05/06/2022-12:33:26] [I] DLACore:
[05/06/2022-12:33:26] [I] Plugins:
[05/06/2022-12:33:26] [I] === Inference Options ===
[05/06/2022-12:33:26] [I] Batch: Explicit
[05/06/2022-12:33:26] [I] Input inference shape: input=1x3x224x224
[05/06/2022-12:33:26] [I] Iterations: 105
[05/06/2022-12:33:26] [I] Duration: 3s (+ 200ms warm up)
[05/06/2022-12:33:26] [I] Sleep time: 0ms
[05/06/2022-12:33:26] [I] Streams: 1
[05/06/2022-12:33:26] [I] ExposeDMA: Disabled
[05/06/2022-12:33:26] [I] Data transfers: Enabled
[05/06/2022-12:33:26] [I] Spin-wait: Disabled
[05/06/2022-12:33:26] [I] Multithreading: Disabled
[05/06/2022-12:33:26] [I] CUDA Graph: Disabled
[05/06/2022-12:33:26] [I] Separate profiling: Disabled
[05/06/2022-12:33:26] [I] Skip inference: Disabled
[05/06/2022-12:33:26] [I] Inputs:
[05/06/2022-12:33:26] [I] === Reporting Options ===
[05/06/2022-12:33:26] [I] Verbose: Disabled
[05/06/2022-12:33:26] [I] Averages: 10 inferences
[05/06/2022-12:33:26] [I] Percentile: 99
[05/06/2022-12:33:26] [I] Dump refittable layers:Disabled
[05/06/2022-12:33:26] [I] Dump output: Disabled
[05/06/2022-12:33:26] [I] Profile: Disabled
[05/06/2022-12:33:26] [I] Export timing to JSON file:
[05/06/2022-12:33:26] [I] Export output to JSON file:
[05/06/2022-12:33:26] [I] Export profile to JSON file:
[05/06/2022-12:33:26] [I]
[05/06/2022-12:33:26] [I] === Device Information ===
[05/06/2022-12:33:26] [I] Selected Device: NVIDIA Tesla V100-PCIE-16GB
[05/06/2022-12:33:26] [I] Compute Capability: 7.0
[05/06/2022-12:33:26] [I] SMs: 80
[05/06/2022-12:33:26] [I] Compute Clock Rate: 1.38 GHz
[05/06/2022-12:33:26] [I] Device Global Memory: 16160 MiB
[05/06/2022-12:33:26] [I] Shared Memory per SM: 96 KiB
[05/06/2022-12:33:26] [I] Memory Bus Width: 4096 bits (ECC enabled)
[05/06/2022-12:33:26] [I] Memory Clock Rate: 0.877 GHz
[05/06/2022-12:33:26] [I]
----------------------------------------------------------------
Input filename: /root/.cache/torch/hub/onnx/vgg11.onnx
ONNX IR version: 0.0.6
Opset version: 10
Producer name: pytorch
Producer version: 1.8
Domain:
Model version: 0
Doc string:
----------------------------------------------------------------
[05/06/2022-12:33:40] [W] [TRT] /workspace/TensorRT/parsers/onnx/onnx2trt_utils.cpp:218: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/06/2022-12:33:40] [I] [TRT] /workspace/TensorRT/parsers/onnx/ModelImporter.cpp:139: No importer registered for op: adaptive_avg_pool2d. Attempting to import as plugin.
[05/06/2022-12:33:40] [I] [TRT] /workspace/TensorRT/parsers/onnx/builtin_op_importers.cpp:3716: Searching for plugin: adaptive_avg_pool2d, plugin_version: 1, plugin_namespace:
[05/06/2022-12:33:40] [E] [TRT] INVALID_ARGUMENT: getPluginCreator could not find plugin adaptive_avg_pool2d version 1
While parsing node number 22 [adaptive_avg_pool2d]:
ERROR: /workspace/TensorRT/parsers/onnx/builtin_op_importers.cpp:3718 In function importFallbackPluginImporter:
[8] Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"
[05/06/2022-12:33:40] [E] Failed to parse onnx file
[05/06/2022-12:33:40] [E] Parsing model failed
[05/06/2022-12:33:40] [E] Engine creation failed
[05/06/2022-12:33:40] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec # /opt/tensorrt/bin/trtexec --onnx=/root/.cache/torch/hub/onnx/vgg11.onnx --explicitBatch --optShapes=input:1x3x224x224 --workspace=8192 --iterations=105 --percentile=99
.
[2022-05-06 12:33:40,844 616b67a69ab7:18330][tensorrt_inference_performance.py:23][INFO] benchmark: tensorrt-inference, return code: 32, result: {'return_code': [32]}

It seems that the trt onnx importer can not support the adaptive_avg_pool2d op?

Please cc.

Executor/Run benchmark failed messages while running vgg models with superbench

What's the issue, what's expected?:
I tried to run some vgg models with superbench, on 8-GPU A100-80G machine, but some of them failed with messages shown below.

How to reproduce it?:
Run Command:
sb run --no-docker -l localhost -c --output-dir

Log message or shapshot?:
Executor is going to execute model-benchmarks:vgg:float/pytorch-vgg16.�[0m
Model placement - model: pytorch-vgg16, GPU availablility: True, pin memory: False, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-vgg16, distributed implementation: ddp.�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: trying to initialize the default process group twice!�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: trying to initialize the default process group twice!�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: trying to initialize the default process group twice!�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: trying to initialize the default process group twice!�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: trying to initialize the default process group twice!�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Run benchmark failed - benchmark: pytorch-vgg16, message: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:05:00)�[0m
benchmark: pytorch-vgg16, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg16.�[0m
Executor is going to execute model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Model placement - model: pytorch-vgg19, GPU availablility: True, pin memory: False, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-vgg19, distributed implementation: ddp.�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: trying to initialize the default process group twice!�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Run benchmark failed - benchmark: pytorch-vgg16, message: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:05:00)�[0m
benchmark: pytorch-vgg16, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg16.�[0m
Executor is going to execute model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Model placement - model: pytorch-vgg19, GPU availablility: True, pin memory: False, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-vgg19, distributed implementation: ddp.�[0m
Run benchmark failed - benchmark: pytorch-vgg16, message: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:05:00)�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: trying to initialize the default process group twice!�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
benchmark: pytorch-vgg16, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg16.�[0m
Executor is going to execute model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Model placement - model: pytorch-vgg19, GPU availablility: True, pin memory: False, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-vgg19, distributed implementation: ddp.�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29501 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29501 (errno: 98 - Address already in use).�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:29501 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:29501 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.

Additional information:

superbench 0.6.0-rc1: setup requires not existing pkg requests>=2.28.1

On a computer with python 3.6.9, 'python -m pip install --upgrade pip' will report error

ERROR: Could not find a version that satisfies the requirement requests>=2.28.1 (from superbench) (from versions: 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.2.4, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.4.0, 0.4.1, 0.5.0, 0.5.1, 0.6.0, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.6.5, 0.6.6, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.7.4, 0.7.5, 0.7.6, 0.8.0, 0.8.1, 0.8.2, 0.8.3, 0.8.4, 0.8.5, 0.8.6, 0.8.7, 0.8.8, 0.8.9, 0.9.0, 0.9.1, 0.9.2, 0.9.3, 0.10.0, 0.10.1, 0.10.2, 0.10.3, 0.10.4, 0.10.6, 0.10.7, 0.10.8, 0.11.1, 0.11.2, 0.12.0, 0.12.1, 0.13.0, 0.13.1, 0.13.2, 0.13.3, 0.13.4, 0.13.5, 0.13.6, 0.13.7, 0.13.8, 0.13.9, 0.14.0, 0.14.1, 0.14.2, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.1.0, 1.2.0, 1.2.1, 1.2.2, 1.2.3, 2.0.0, 2.0.1, 2.1.0, 2.2.0, 2.2.1, 2.3.0, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.5.0, 2.5.1, 2.5.2, 2.5.3, 2.6.0, 2.6.1, 2.6.2, 2.7.0, 2.8.0, 2.8.1, 2.9.0, 2.9.1, 2.9.2, 2.10.0, 2.11.0, 2.11.1, 2.12.0, 2.12.1, 2.12.2, 2.12.3, 2.12.4, 2.12.5, 2.13.0, 2.14.0, 2.14.1, 2.14.2, 2.15.1, 2.16.0, 2.16.1, 2.16.2, 2.16.3, 2.16.4, 2.16.5, 2.17.0, 2.17.1, 2.17.2, 2.17.3, 2.18.0, 2.18.1, 2.18.2, 2.18.3, 2.18.4, 2.19.0, 2.19.1, 2.20.0, 2.20.1, 2.21.0, 2.22.0, 2.23.0, 2.24.0, 2.25.0, 2.25.1, 2.26.0, 2.27.0, 2.27.1)
ERROR: No matching distribution found for requests>=2.28.1

expected: the setup should use lower version of requests. so that lower version of python3 works.

superbench mpi job should use the proper Ethernet interface

Several superbench tests (e.g. nccltests, ib-traffic ) use openmpi to launch the tests in multiple nodes. Some node type is designed to have multiple ethernet interfaces (e.g. azure2, eth0, docker0, ib0, ib1 etc). The working IPv4 ethernet interface is not the default 'eth0' (e.g. azure2).

While a user can manually check its node type and figure out which ethernet type to use for the MPI (e.g. --mca btl_tcp_if_include azure2 --mca oob_tcp_if_include azure2), it is not generic across diff node types.

Expected: because superbench launches the MPI command, superbench should detect the proper ethernet interface to use. And add it to the openmpi command line.

The following is one way to find this interface using bash. It will be much simpler to use python to do this.

get_eth_interfaces() {
IPV4List=$(ip -4 -f inet a |grep mtu|awk '{print $2}' | sed ':a; N; $!ba; s/\n//g')
for ifname in $(ls /sys/class/net); do
if [[ -f /sys/class/net/$ifname/type && $(cat /sys/class/net/$ifname/type) -eq 1 && ! -f /sys/class/net/$ifname/bridge ]]; then
isIPV4=$(echo ${IPV4List} | grep "$ifname:" | wc -l)
isDocker=$(echo $ifname | grep docker | wc -l)
if [[ "${isIPV4}" == "1" && "${isDocker}" == "0" ]]; then
echo $ifname
fi
fi
done
}

third_party build error when using CUDA Dockerfile

When I try to build dockerfile using dockerfile/cuda11.1.1.dockerfile, I get the following error:

~/superbenchmark main !1 ?2 ❯ docker buildx build \             
  --platform linux/amd64 --cache-to type=inline,mode=max \
  --tag superbench-dev --file dockerfile/cuda11.1.1.dockerfile .
[+] Building 172.9s (8/18)
 => [internal] load build definition from cuda11.1.1.dockerfile                                                                                                                                     0.0s
 => => transferring dockerfile: 4.00kB                                                                                                                                                              0.0s
 => [internal] load .dockerignore                                                                                                                                                                   0.0s
 => => transferring context: 35B                                                                                                                                                                    0.0s
 => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                                                                                                                   1.4s
 => [internal] load build context                                                                                                                                                                   0.6s
 => => transferring context: 788.47kB                                                                                                                                                               0.5s
 => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b853bb45499f8d                                                                           [+] Building 173.1s (8/18)                                                                                                                                                                                 automake     build-essential     curl     dmidecode     git     jq     libaio-dev     lib   => [internal] load build definition from cuda11.1.1.dockerfile                                                                                                                                     0.0s0.8.tgz -O docker.tgz &&     tar --extract --file docker.tgz --strip-components 1 --director   => => transferring dockerfile: 4.00kB                                                                                                                                                              0.0sshd &&     sed -i "s/[# ]*PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/s   => [internal] load .dockerignore                                                                                                                                                                   0.0sD_LINUX-5.2-2.2.3.0-ubuntu20.04-x86_64.tgz &&     tar xzf MLNX_OFED_LINUX-5.2-2.2.3.0-ubun  16 => => transferring context: 35B                                                                                                                                                                    0.0s                                                                                               => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                                                                                                                   1.4s                                                                                               => [internal] load build context                                                                                                                                                                   0.6s                                                                                               => => transferring context: 788.47kB                                                                                                                                                               0.5s                                                                                               => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b853bb45499f8d                                                                           [+] Building 173.2s (8/18)                                                                                                                                                                                 automake     build-essential     curl     dmidecode     git     jq     libaio-dev     lib   => [internal] load build definition from cuda11.1.1.dockerfile                                                                                                                                     0.0s0.8.tgz -O docker.tgz &&     tar --extract --file docker.tgz --strip-components 1 --director   => => transferring dockerfile: 4.00kB                                                                                                                                                              0.0sshd &&     sed -i "s/[# ]*PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/s   => [internal] load .dockerignore                                                                                                                                                                   0.0sD_LINUX-5.2-2.2.3.0-ubuntu20.04-x86_64.tgz &&     tar xzf MLNX_OFED_LINUX-5.2-2.2.3.0-ubun  16 => => transferring context: 35B                                                                                                                                                                    0.0s
 => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                                                                                                                   1.4s
 => [internal] load build context                                                                                                                                                                   0.6s
 => => transferring context: 788.47kB                                                                                                                                                               0.5s
 => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b853bb45499f8d                                                                           [+] Building 173.4s (8/18)                                                                                                                                                                                 automake     build-essential     curl     dmidecode     git     jq     libaio-dev     lib   => [internal] load build definition from cuda11.1.1.dockerfile                                                                                                                                     0.0s0.8.tgz -O docker.tgz &&     tar --extract --file docker.tgz --strip-components 1 --director   => => transferring dockerfile: 4.00kB                                                                                                                                                              0.0sshd &&     sed -i "s/[# ]*PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/s   => [internal] load .dockerignore                                                                                                                                                                   0.0sD_LINUX-5.2-2.2.3.0-ubuntu20.04-x86_64.tgz &&     tar xzf MLNX_OFED_LINUX-5.2-2.2.3.0-ubun  16 => => transferring context: 35B                                                                                                                                                                    0.0s
 => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                                                                                                                   1.4s
 => [internal] load build context                                                                                                                                                                   0.6s
 => => transferring context: 788.47kB                                                                                                                                                               0.5s
 => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b853bb45499f8d                                                                           [+] Building 183.3s (8/18)  [+] Buil[+] Building 665.5s (16/18)
 => [internal] load build definition from cuda11.1.1.dockerfile                                                0.0s  => => transferring dockerfile: 4.00kB                                                                         0.0st => [internal] load .dockerignore                                                                              0.0s
 => => transferring context: 35B                                                                               0.0s  => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                              1.4st => [internal] load build context                                                                              0.6s
 => => transferring context: 788.47kB                                                                          0.5s  => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b8  0.0sH => CACHED [ 2/14] RUN apt-get update &&     apt-get install -y --no-install-recommends     autoconf     auto  0.0s
 => [ 3/14] RUN cd /tmp &&     wget https://download.docker.com/linux/static/stable/x86_64/docker-20.10.8.tgz  9.5s  => [ 4/14] RUN mkdir -p /root/.ssh &&     touch /root/.ssh/authorized_keys &&     mkdir -p /var/run/sshd &&   0.6s/ => [ 5/14] RUN cd /tmp &&     wget -q http://content.mellanox.com/ofed/MLNX_OFED-5.2-2.2.3.0/MLNX_OFED_LIN  277.4s
 => [ 6/14] RUN cd /opt &&     wget -q https://azhpcstor.blob.core.windows.net/azhpc-images-store/hpcx-v2.8.  62.9s
 => [ 7/14] RUN cd /tmp &&     git clone https://github.com/Mellanox/nccl-rdma-sharp-plugins.git &&     cd n  22.1s
 => [ 8/14] RUN cd /tmp &&     git clone -b v2.10.3-1 https://github.com/NVIDIA/nccl.git &&     cd nccl &&   264.6s
 => [ 9/14] RUN cd /tmp &&     mkdir -p mlc &&     cd mlc &&     wget --user-agent="Mozilla/5.0 (X11; Fedora;  0.8s
 => [10/14] WORKDIR /opt/superbench                                                                            0.1s
 => [11/14] ADD third_party third_party                                                                        0.1s
 => ERROR [12/14] RUN make -j 40 -C third_party cuda                                                          25.8s
------
 > [12/14] RUN make -j 40 -C third_party cuda:
#0 0.415 make: Entering directory '/opt/superbench/third_party'
#0 0.415 mkdir -p /opt/superbench/bin
#0 0.418 mkdir -p /opt/superbench/lib
#0 0.445 if [ -d cuda-samples ]; then rm -rf cuda-samples; fi
#0 0.445 bash -c "source /opt/hpcx/hpcx-init.sh && hpcx_load && make CC=mpicc -C GPCNET all && hpcx_unload"
#0 0.465 git clone -b v11.1 https://github.com/NVIDIA/cuda-samples.git ./cuda-samples
#0 0.468 Cloning into './cuda-samples'...
#0 0.493 make[1]: Entering directory '/opt/superbench/third_party'
#0 0.493 make[1]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.
#0 0.495 make[1]: Leaving directory '/opt/superbench/third_party/GPCNET'
#0 0.495 make[1]: *** No rule to make target 'all'.  Stop.
#0 0.496 make: *** [Makefile:98: gpcnet] Error 2
#0 0.496 make: *** Waiting for unfinished jobs....
#0 20.08 Note: switching to 'c4e2869a2becb4b6d9ce5f64914406bf5e239662'.
#0 20.08
#0 20.08 You are in 'detached HEAD' state. You can look around, make experimental
#0 20.08 changes and commit them, and you can discard any commits you make in this
#0 20.08 state without impacting any branches by switching back to a branch.
#0 20.08
#0 20.08 If you want to create a new branch to retain commits you create, you may
#0 20.08 do so (now or later) by using -c with the switch command. Example:
#0 20.08
#0 20.08   git switch -c <new-branch-name>
#0 20.08
#0 20.08 Or undo this operation with:
#0 20.08
#0 20.08   git switch -
#0 20.08
#0 20.08 Turn off this advice by setting config variable advice.detachedHead to false
#0 20.08
#0 20.56 cd ./cuda-samples/Samples/bandwidthTest && make clean && make TARGET_ARCH=x86_64 SMS="70 75 80 86"
#0 20.59 make[1]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.
#0 20.59 make[1]: Entering directory '/opt/superbench/third_party/cuda-samples/Samples/bandwidthTest'
#0 20.61 rm -f bandwidthTest bandwidthTest.o
#0 20.62 rm -rf ../../bin/x86_64/linux/release/bandwidthTest
#0 20.62 make[1]: Leaving directory '/opt/superbench/third_party/cuda-samples/Samples/bandwidthTest'
#0 20.62 make[1]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.
#0 20.62 make[1]: Entering directory '/opt/superbench/third_party/cuda-samples/Samples/bandwidthTest'
#0 20.65 /usr/local/cuda/bin/nvcc -ccbin g++ -I../../Common  -m64    -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o bandwidthTest.o -c bandwidthTest.cu
#0 25.31 /usr/local/cuda/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o bandwidthTest bandwidthTest.o
#0 25.62 mkdir -p ../../bin/x86_64/linux/release
#0 25.62 cp bandwidthTest ../../bin/x86_64/linux/release
#0 25.63 make[1]: Leaving directory '/opt/superbench/third_party/cuda-samples/Samples/bandwidthTest'
#0 25.63 cp -v ./cuda-samples/Samples/bandwidthTest/bandwidthTest /opt/superbench/bin/
#0 25.63 './cuda-samples/Samples/bandwidthTest/bandwidthTest' -> '/opt/superbench/bin/bandwidthTest'
#0 25.63 make: Leaving directory '/opt/superbench/third_party'
------
error: failed to solve: executor failed running [/bin/sh -c make -j ${NUM_MAKE_JOBS} -C third_party cuda]: exit code: 2

I cloned the recent main branch and the commit UUID is a9634ef

The problem is in step 12 of the docker build.

Please help. Thanks.

TensorRT parameter passing can be enhanced

trtexec has a lot of arguments, but superbench only covers a fraction of them, and the default value trtexec set is not suitable for benchmarking our program, for example, when I use superbench to profile a resnet50.onnx with tensorrt backend, the command that superbench generated is :

/opt/tensorrt/bin/trtexec --onnx=/workspace/v-leiwang3/.torch/hub/onnx/resnet50.onnx --explicitBatch --optShapes=input:1x3x224x224 --workspace=8192 --iterations=105 --percentile=99.

How ever I found this command executed more than 200 executions on our V100 GPU, it was caused by the default arguments --duration was set to 3, which means trtexec will profile the model at least 3s, but for 100 iterations on resnet50, it only takes about 1.5 second, so the default value of --duration should be set to 0 to srtictly execute with given iterations.

And, for warmup step, trtexec also provides --warmUp options to set warmup step, so my expected command should be :

/opt/tensorrt/bin/trtexec --onnx=/workspace/v-leiwang3/.torch/hub/onnx/resnet50.onnx --explicitBatch --optShapes=input:1x3x224x224 --workspace=8192 --fp16 --avgRuns=10 --warmUp=5 --iterations=100 --percentile=99. --duration=0 

Some test does not support CS 8.9(RTX 4080/4090)

What's the issue, what's expected?:
[2023-04-16 12:26:24,006 u22:880][cuda_gemm_flops_performance.py:77][ERROR] Unsupported architecture - benchmark: gemm-flops, compute capability: 8.9, supports 7.0 7.5 8.0 8.6 9.0

How to reproduce it?:
Run superbenchmark with RTX 4080/4090.

Log message or shapshot?:
image

Additional information:

sb-exec.log

Passing multiple test configurations to cublas microbenchmark

What would you like to be added:
Support multiple configurations as a list in the YAML config file for cublas testing

Why is this needed:
Can create a sweep of custom tests to run

Without this feature, how does current superbenchmark work
Use the default list of test configs or pass only 1 test config

Components that may involve changes:
The logic in https://github.com/microsoft/superbenchmark/blob/main/superbench/benchmarks/micro_benchmarks/cublas_function.py#L248
that can handle a list of different dictionaries

Brief description of your proposal if any:

Run failed (Failed to get information on remote file)

What's the issue, what's expected?:
sb deploy -f local.ini

TASK [Copying Context] *********************************************************
fatal: [localhost]: FAILED! => {"msg": "Failed to get information on remote file (/home/edison/sb-workspace/.ssh/config): sudo: a password is required\n"}

PLAY RECAP *********************************************************************
localhost : ok=8 changed=2 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
[2022-11-21 17:31:49,645 u22:6029][ansible.py:80][WARNING] Run failed, return code 2.

log message:

[2022-11-21 18:27:12,121 u22:6465][file_handler.py:79][INFO] No benchmark config provided, using config file /home/edison/.local/lib/python3.10/site-packages/superbench/config/default.yaml.
[2022-11-21 18:27:12,156 u22:6465][ansible.py:59][INFO] {'host_pattern': 'all', 'cmdline': '--forks 1 --inventory /home/edison/Downloads/superbenchmark/local.ini'}
[2022-11-21 18:27:12,163 u22:6465][runner.py:42][INFO] Runner uses config: {'superbench': {'benchmarks': {'bert_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['bert-base',
'bert-large'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'computation-communication-overlap': {'enable': True,
'frameworks': ['pytorch'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}]},
'cpu-memory-bw-latency': {'enable': False,
'modes': [{'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'tests': ['bandwidth_matrix',
'latency_matrix',
'max_bandwidth']}},
'cublas-function': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'cudnn-function': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'densenet_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['densenet169',
'densenet201'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'disk-benchmark': {'enable': False,
'modes': [{'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'block_devices': ['/dev/nvme0n1']}},
'gemm-flops': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'gpcnet-network-load-test': {'enable': False,
'modes': [{'env': {'UCX_NET_DEVICES': 'mlx5_0:1'},
'mca': {'btl': '^uct',
'btl_tcp_if_include': 'eth0',
'pml': 'ucx'},
'name': 'mpi',
'proc_num': 1}]},
'gpcnet-network-test': {'enable': False,
'modes': [{'env': {'UCX_NET_DEVICES': 'mlx5_0:1'},
'mca': {'btl': '^uct',
'btl_tcp_if_include': 'eth0',
'pml': 'ucx'},
'name': 'mpi',
'proc_num': 1}]},
'gpt_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['gpt2-small',
'gpt2-large'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'gpu-burn': {'enable': True,
'modes': [{'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'doubles': True,
'tensor_core': True,
'time': 300}},
'gpu-copy-bw:correctness': {'enable': True,
'modes': [{'name': 'local',
'parallel': False}],
'parameters': {'check_data': True,
'copy_type': ['sm',
'dma'],
'mem_type': ['htod',
'dtoh',
'dtod'],
'num_loops': 1,
'num_warm_up': 0,
'size': 4096}},
'gpu-copy-bw:perf': {'enable': True,
'modes': [{'name': 'local',
'parallel': False}],
'parameters': {'copy_type': ['sm',
'dma'],
'mem_type': ['htod',
'dtoh',
'dtod']}},
'ib-loopback': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'PROC_RANK={proc_rank} '
'IB_DEVICES=0,2,4,6 '
'NUMA_NODES=1,0,3,2',
'proc_num': 4},
{'name': 'local',
'parallel': True,
'prefix': 'PROC_RANK={proc_rank} '
'IB_DEVICES=1,3,5,7 '
'NUMA_NODES=1,0,3,2',
'proc_num': 4}]},
'ib-traffic': {'enable': False,
'modes': [{'name': 'mpi',
'proc_num': 8}],
'parameters': {'gpu_dev': '$LOCAL_RANK',
'ib_dev': 'mlx5_$LOCAL_RANK',
'msg_size': 8388608,
'numa_dev': '$((LOCAL_RANK/2))'}},
'kernel-launch': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'lstm_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['lstm'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'matmul': {'enable': True,
'frameworks': ['pytorch'],
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'mem-bw': {'enable': True,
'modes': [{'name': 'local',
'parallel': False,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank} '
'numactl -N '
'$(({proc_rank}/2))',
'proc_num': 8}]},
'nccl-bw:default': {'enable': True,
'modes': [{'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'ngpus': 8}},
'nccl-bw:gdr-only': {'enable': True,
'modes': [{'env': {'NCCL_IB_DISABLE': '0',
'NCCL_IB_PCI_RELAXED_ORDERING': '1',
'NCCL_MIN_NCHANNELS': '16',
'NCCL_NET_GDR_LEVEL': '5',
'NCCL_P2P_DISABLE': '1',
'NCCL_SHM_DISABLE': '1'},
'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'ngpus': 8}},
'ort-inference': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}],
'parameters': {'batch_size': 1}},
'resnet_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['resnet50',
'resnet101',
'resnet152'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'sharding-matmul': {'enable': True,
'frameworks': ['pytorch'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}]},
'tcp-connectivity': {'enable': False,
'modes': [{'name': 'local',
'parallel': False}],
'parameters': {'port': 22}},
'tensorrt-inference': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}],
'parameters': {'batch_size': 1,
'precision': 'int8',
'pytorch_models': ['resnet50',
'resnet101',
'resnet152',
'densenet169',
'densenet201',
'bert-base',
'bert-large'],
'seq_length': 224}},
'vgg_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['vgg11',
'vgg13',
'vgg16',
'vgg19'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}}},
'enable': None,
'monitor': {'enable': True,
'sample_duration': 1,
'sample_interval': 10},
'var': {'common_model_config': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']},
'default_local_mode': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'default_pytorch_mode': {'enable': True,
'frameworks': ['pytorch'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}]}}},
'version': 'v0.6'}.
[2022-11-21 18:27:12,163 u22:6465][runner.py:43][INFO] Runner writes to: /home/edison/Downloads/superbenchmark/outputs/2022-11-21_18-27-12.
[2022-11-21 18:27:12,179 u22:6465][runner.py:48][INFO] Runner will run: ['gpu-burn', 'nccl-bw:default', 'nccl-bw:gdr-only', 'ib-loopback', 'mem-bw', 'gpu-copy-bw:correctness', 'gpu-copy-bw:perf', 'kernel-launch', 'gemm-flops', 'cudnn-function', 'cublas-function', 'matmul', 'sharding-matmul', 'computation-communication-overlap', 'ort-inference', 'tensorrt-inference', 'gpt_models', 'bert_models', 'lstm_models', 'resnet_models', 'densenet_models', 'vgg_models']
[2022-11-21 18:27:12,179 u22:6465][runner.py:165][INFO] Preparing SuperBench environment.
[2022-11-21 18:27:12,179 u22:6465][ansible.py:125][INFO] Run playbook deploy.yaml ...

PLAY [Facts Gathering] *********************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

PLAY [Context Preparation] *****************************************************

TASK [Generating SSH Config] ***************************************************
changed: [localhost]

TASK [Generating SSH Key Pair] *************************************************
changed: [localhost]

PLAY [Check GPU Environment] ***************************************************

TASK [Checking NVIDIA GPU Environment] *****************************************
ok: [localhost] => (item=/dev/nvidiactl)
ok: [localhost] => (item=/dev/nvidia-uvm)

TASK [Checking AMD GPU Environment] ********************************************
ok: [localhost] => (item=/dev/kfd)
ok: [localhost] => (item=/dev/dri)

TASK [Set GPU Facts] ***********************************************************
ok: [localhost]

TASK [Print GPU Checking Result] ***********************************************
ok: [localhost] => {
"msg": [
"NVIDIA GPU detected",
"AMD GPU not operational, pls confirm amdgpu kernel module is loaded"
]
}

PLAY [Remote Deployment] *******************************************************

TASK [Creating Workspace] ******************************************************
ok: [localhost] => (item=/home/edison/sb-workspace)
ok: [localhost] => (item=/home/edison/sb-workspace/.ssh)

TASK [Copying Context] *********************************************************
fatal: [localhost]: FAILED! => {"msg": "Failed to get information on remote file (/home/edison/sb-workspace/.ssh/config): sudo: a password is required\n"}

PLAY RECAP *********************************************************************
localhost : ok=8 changed=2 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
[2022-11-21 18:27:14,383 u22:6465][ansible.py:80][WARNING] Run failed, return code 2.

How to reproduce it?:
sudo apt-get install sshpass
git clone -b v0.6.0 https://github.com/microsoft/superbenchmark
cd superbenchmark
python3 -m pip install .
make postinstall

create local.ini
[all]
localhost ansible_connection=local

sb deploy -f local.ini

Log message or shapshot?:

Additional information:
sb --version

0.6.0
Python (Linux) 3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0]
Python location '/usr/bin/python3'

sshpass -V

sshpass 1.09

OS: Ubuntu 22.04
NVIDIA Driver: 525.53

V0.5.0 Release Plan

Release Manager

@cp5555

Endgame

Code freeze: April 10th, 2022
Bug Bash date: April 11th, 2022
Release date: April 22th, 2022

Main Features

Micro-benchmark Improvement

    • Support nccl bandwidth benchmark only with NIC in NCCL/RCCL Bandwidth Test (#299)
    • Support bi-directional bandwidth benchmark in GPU Copy Bandwidth Test (#285, #298, #302)
    • Support data checking in GPU Copy Bandwidth Test (#301)
    • Update rccl-tests submodule to fix divide by zero error (#306)
    • Add GPU-Burn as microbenchmark (#324)

Model-benchmark Improvement

    • Sync results on root rank for e2e model benchmarks in distributed mode (#287)
    • Support customized env in local and torch.distributed mode (#295)
    • Add support for pytorch>=1.9.0 (#305)
    • Keep BatchNorm as fp32 for pytorch cnn models cast to fp16 (#322)
    • Remove FP16 samples type converting time (#330, #332)
    • Support FAMBench (#338)

Inference Benchmark Improvement

    • Revise the default setting for inference benchmark (#311, #329)
    • Add percentile metrics for inference benchmarks (#283)
    • Support T4 and A10 in GEMM benchmark (#294)
    • Add configuration with inference benchmark (#311)

SuperBench Improvement

    • Add command to support listing all optional parameters for benchmarks. (#279)
    • Unify benchmark naming convention and support multiple tests with same benchmark and different parameters/options in one configuration file (#284)
    • Support timeout to detect the benchmark failure and stop the process automatically (#288)
    • Add rocm5.0 dockerfile (#307)
    • Improve Output Interface (#333)

Data Diagnosis & Analysis

    • Support multi-benchmark check (#289)
    • Support result summary in md, excel and html format (#320, #321, #335)
    • Support data diagnosis in md and html format (#325)
    • Support result output for all nodes in data diagnosis (#336, #339)
    • Add document for result summary usage (#337)

Backlogs

SuperBench Improvement

    • Support automatic configuration yaml selection on Azure VM

Inference Benchmark Improvement

    • Support VGG, LSTM, and GPT-2 small in TensorRT Inference Backend
    • Support VGG, LSTM, and GPT-2 small in ORT Inference Backend

Data Diagnosis & Analysis

    • Support boxplot and outlier analysis

Document

    • Metric Reasoning Doc

ib validation benchmark should support mixed IB device naming schema

This is for superbench latest code.

Current superbench ib validation benchmark is designed to have consistent IB device names across the nodes. A user must specify this name or a default one is used. The following command will be created to pass to ib command (like ib_write_bw)
https://github.com/microsoft/superbenchmark/blob/main/superbench/benchmarks/micro_benchmarks/ib_validation_performance.py#L310

This design has a problem: In some environments, the IB device naming are not consistent. e.g. some VM calls the IB device mlx5_0. Some VM calls it mlx5_ib0. There is no way to run ib-validation benchmark on these VMs together.

Expected:
IB validation benchmark should work if some IB device is called mlx5_0, some VM calls it mlx5_ib0 (or other name). One design : in the run config yaml, a user specifies the index of the IB device (e.g. 0,1,2). superbench figures out the actual physical device name at runtime on each VM (e.g. mlx5_0, mlx5_ib0 etc). 'ibstat -l' can list the IB device names.

[Enhancement] - Add HPL random generator to gemm-flops with ROCm

What would you like to be added:
For GEMM-FLOPs test with ROCm, pass flags to rocblas-bench specify what random tensor generation is used.
--initialization rand_int
--initialization hpl

Why is this needed:
--initialization rand_int to use simple random tensor generation, which results relatively more 0 values
--initialization hpl to use HPL style random tensor generation, which results relatively less 0 values

Without this feature, how does current superbenchmark work
more 0 values results better benchmark results, which is by default for ROCm before 5.1 release

Components that may involve changes:
GEMM-FLOPs

Brief description of your proposal if any:
reference:
https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.1.0
https://ontrack.amd.com/browse/MSRCHA-325
ROCm/rocBLAS@9e9ced4

superbench failed at default most typical run config

What's the issue, what's expected?:


TASK [Starting Container] ******************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": "docker rm --force sb-workspace ||: && docker run -itd --name=sb-workspace  --privileged --net=host --ipc=host  --gpus=all    -w /root -v /root/sb-workspace:/root -v /mnt:/mnt  -v /var/run/docker.sock:/var/run/docker.sock  --entrypoint /bin/bash superbench/superbench && docker exec sb-workspace bash -c  \"chown -R root:root ~ && \\\n  sed -i 's/[# ]*Port.*/Port 22066/g' /etc/ssh/sshd_config && \\\n  service ssh restart && sb help\"\n", "delta": "0:00:36.069805", "end": "2023-04-15 03:01:11.455660", "msg": "non-zero return code", "rc": 125, "start": "2023-04-15 03:00:35.385855", "stderr": "Error response from daemon: No such container: sb-workspace\ndocker: Error response from daemon: could not select device driver \"\" with capabilities: [[gpu]].", "stderr_lines": ["Error response from daemon: No such container: sb-workspace", "docker: Error response from daemon: could not select device driver \"\" with capabilities: [[gpu]]."], "stdout": "28784ba8358530ee44bf82ec37213a691e8573b1b52231c794533c0db781483c", "stdout_lines": ["28784ba8358530ee44bf82ec37213a691e8573b1b52231c794533c0db781483c"]}

PLAY RECAP *********************************************************************
localhost                  : ok=10   changed=5    unreachable=0    failed=1    skipped=1    rescued=0    ignored=0
[2023-04-15 03:01:11,663 jd-MS-7B22:26239][ansible.py:82][WARNING] Run failed, return code 2.
jd@jd-MS-7B22:~/gg/git/superbenchmark$
jd@jd-MS-7B22:~/gg/git/superbenchmark$
jd@jd-MS-7B22:~/gg/git/superbenchmark$ sudo docker container list
[sudo] password for jd:
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
jd@jd-MS-7B22:~/gg/git/superbenchmark$ sudo docker container list --all
CONTAINER ID   IMAGE                   COMMAND       CREATED       STATUS    PORTS     NAMES
28784ba83585   superbench/superbench   "/bin/bash"   5 hours ago   Created             sb-workspace
jd@jd-MS-7B22:~/gg/git/superbenchmark$

How to reproduce it?:
follow your own instruction at https://aka.ms/superbench.

Log message or shapshot?:
above

Additional information:
ubuntu 22.04 bare metal, gtx 2070, cuda 12.x

Executor/Run benchmark failed messages while running bert-large model with superbench

What's the issue, what's expected?:
bert-large model failed with error messages show below when tried to run with superbench on A100-80GB machine with 8 GPUs. Please help provide suggestions to fix this.

How to reproduce it?:
Run Command:
sb run --no-docker -l localhost -c --output-dir

Log message or shapshot?:
Model placement - model: pytorch-bert-large, GPU availablility: True, pin memory: True, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-bert-large, distributed implementation: ddp.�[0m
Executor is going to execute bert_models/pytorch-bert-large.�[0m
Model placement - model: pytorch-bert-large, GPU availablility: True, pin memory: True, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-bert-large, distributed implementation: ddp.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Broken pipe�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Broken pipe�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m
benchmark: pytorch-bert-base, return code: 0, result: {'return_code': [0]}.�[0m
Executor succeeded in bert_models/pytorch-bert-base.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Broken pipe�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m
Executor is going to execute bert_models/pytorch-bert-large.�[0m
Model placement - model: pytorch-bert-large, GPU availablility: True, pin memory: True, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-bert-large, distributed implementation: ddp.�[0m
benchmark: pytorch-bert-base, return code: 0, result: {'return_code': [0]}.�[0m
Executor succeeded in bert_models/pytorch-bert-base.�[0m
Executor is going to execute bert_models/pytorch-bert-large.�[0m
benchmark: pytorch-bert-base, return code: 0, result: {'return_code': [0], 'fp32_train_step_time': [96.17431712523103], 'fp32_train_throughput': [333.46321017523894], 'fp16_train_step_time': [66.33570070005953], 'fp16_train_throughput': [484.904031859023]}.�[0m
Executor succeeded in bert_models/pytorch-bert-base.�[0m
Model placement - model: pytorch-bert-large, GPU availablility: True, pin memory: True, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-bert-large, distributed implementation: ddp.�[0m
benchmark: pytorch-bert-base, return code: 0, result: {'return_code': [0]}.�[0m
Executor succeeded in bert_models/pytorch-bert-base.�[0m
Executor is going to execute bert_models/pytorch-bert-large.�[0m
Model placement - model: pytorch-bert-large, GPU availablility: True, pin memory: True, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-bert-large, distributed implementation: ddp.�[0m
Executor is going to execute bert_models/pytorch-bert-large.�[0m
Model placement - model: pytorch-bert-large, GPU availablility: True, pin memory: True, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-bert-large, distributed implementation: ddp.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=4, timeout=0:05:00)�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=8, worker_count=4, timeout=0:05:00)�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Timed out initializing process group in store based barrier on rank: 4, for key: store_based_barrier_key:1 (world_size=8, worker_count=4, timeout=0:05:00)�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=4, timeout=0:05:00)�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=5, timeout=0:05:00)�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated

Additional information:

V0.8.0 Test Plan

Test Cases

single-node test

Machine Type #Node * #GPU * GPU Type PyTorch Version Accelerated Computing Toolkit Status
NDv5 SXM 1* 8 * H100 PyTorch 1.x CUDA11.8 Done
ND A100 v4/NDm A100 v4 1 * 8 * A100 80GB SXM PyTorch 1.x CUDA 11.8
ND A100 v4/NDm A100 v4 1 * 8 * A100 40GB SXM PyTorch 1.8 CUDA 11.1

Hopper GPU and FP8 related benchmarks

  1. microbenchmark
  • Add distributed inference benchmark (#493)
  • Support tensor core precisions (e.g., FP8) and batch/shape range in cublaslt gemm (#492 and #494)
  1. e2e benchmark
  • Support TE FP8 in BERT/GPT2 models (#496, #499)

SuperBench existing benchmark improvement

  1. microbenchmark improvement
  • Support flexible warmup and non-random data initialization in cublas-benchmark (Benchmarks: Revision - Support flexible warmup and non-random data initialization in cublas-benchmark #479)
  • Support error tolerance in micro-benchmark for CuDNN function (#490)
  1. e2e benchmark improvement
  • Fix torch.dist init issue with multiple models (#495)

CPU benchmark

  • Add STREAM benchmark for sustainable memory bandwidth and the corresponding computation rate. (#473)
  • Add HPL Benchmark for HPC Linpack Benchmark. (#482)

SuperBench Improvement

  1. install pipeline
  • Remove fixed rccl version in rocm5.1.x docker file (#476)
  • Upgrade networkx version to fix installation compatibility issue (#478)
  • Pin setuptools version to v65.7.0 (#483)
  • Limit ansible_runner version for Python3.6 (#485)
  1. monitor
  • Support cgroup V2 when read system metrics in Monitor

multi-node test

Machine Type #Node * #GPU * GPU Type PyTorch Version Accelerated Computing Toolkit Status
NDv5 SXM 2* 8 * H100 PyTorch 1.x CUDA11.8

Hopper GPU and FP8 related benchmarks

  1. microbenchmark
  • Add distributed inference benchmark (#493)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.