linux-rdma / perftest Goto Github PK

Infiniband Verbs Performance Tests

License: Other

Makefile 0.70% Shell 0.70% M4 1.97% C 96.63%

perftest's Introduction

	     Open Fabrics Enterprise Distribution (OFED)
                	Performance Tests README



===============================================================================
Table of Contents
===============================================================================
1. Overview
2. Installation
3. Notes on Testing Methodology
4. Test Descriptions
5. Running Tests
6. Known Issues

===============================================================================
1. Overview
===============================================================================
This is a collection of tests written over uverbs intended for use as a
performance micro-benchmark. The tests may be used for HW or SW tuning
as well as for functional testing.

The collection contains a set of bandwidth and latency benchmark such as:

	* Send        - ib_send_bw and ib_send_lat
	* RDMA Read   - ib_read_bw and ib_read_lat
	* RDMA Write  - ib_write_bw and ib_write_lat
	* RDMA Atomic - ib_atomic_bw and ib_atomic_lat
	* Native Ethernet (when working with MOFED2) - raw_ethernet_bw, raw_ethernet_lat

Please post results/observations to the openib-general mailing list.
See "Contact Us" at http://openib.org/mailman/listinfo/openib-general and
http://www.openib.org.

===============================================================================
2. Installation
===============================================================================
-After cloning the repository a perftest directory should appear in your current
 directory

-Cloning example :
git clone <URL>, In our situation its --> git clone https://github.com/linux-rdma/perftest.git

-After cloning, Follow this commands:

-cd perftest/

-./autogen.sh

-./configure    Note:If you want to install in a specific directory use the optional flag --prefix=<Directory path> , e.g: ./configure --prefix=<Directory path>

-make

-make install

-All of the tests will appear in the  perftest directory and in the install directory.
===============================================================================
3. Notes on Testing Methodology
===============================================================================
- The benchmarks use the CPU cycle counter to get time stamps without context
  switch.  Some CPU architectures (e.g., Intel's 80486 or older PPC) do not
  have such capability.

- The latency benchmarks measure round-trip time but report half of that as one-way
  latency. This means that the results may not be accurate for asymmetrical configurations.

- On all unidirectional bandwidth benchmarks, the client measures the bandwidth.
  On bidirectional bandwidth benchmarks, each side measures the bandwidth of
  the traffic it initiates, and at the end of the measurement period, the server
  reports the result to the client, who combines them together.

- Latency tests report minimum, median and maximum latency results. 
  The median latency is typically less sensitive to high latency variations,
  compared to average latency measurement.
  Typically, the first value measured is the maximum value, due to warmup effects.

- Long sampling periods have very limited impact on measurement accuracy.
  The default value of 1000 iterations is pretty good.
  Note that the program keeps data structures with memory footprint proportional
  to the number of iterations. Setting a very high number of iteration may
  have negative impact on the measured performance which are not related to
  the devices under test.
  If a high number of iterations is strictly necessary, it is recommended to
  use the -N flag (No Peak).

- Bandwidth benchmarks may be run for a number of iterations, or for a fixed duration.
  Use the -D flag to instruct the test to run for the specified number of seconds.
  The --run_infinitely flag instructs the program to run until interrupted by
  the user, and print the measured bandwidth every 5 seconds. 

- The "-H" option in latency benchmarks dumps a histogram of the results.
  See xgraph, ygraph, r-base (http://www.r-project.org/), PSPP, or other 
  statistical analysis programs.

  *** IMPORTANT NOTE:
      When running the benchmarks over an Infiniband fabric,
      a Subnet Manager must run on the switch or on one of the
      nodes in your fabric, prior to starting the benchmarks.

Architectures tested:	i686, x86_64, ia64

===============================================================================
4. Benchmarks Description
===============================================================================

The benchmarks generate a synthetic stream of operations, which is very useful
for hardware and software benchmarking and analysis.
The benchmarks are not designed to emulate any real application traffic.
Real application traffic may be affected by many parameters, and hence
might not be predictable based only on the results of those benchmarks.

ib_send_lat 	latency test with send transactions
ib_send_bw 	bandwidth test with send transactions
ib_write_lat 	latency test with RDMA write transactions
ib_write_bw 	bandwidth test with RDMA write transactions
ib_read_lat 	latency test with RDMA read transactions
ib_read_bw 	bandwidth test with RDMA read transactions
ib_atomic_lat	latency test with atomic transactions
ib_atomic_bw 	bandwidth test with atomic transactions

Raw Ethernet interface benchmarks:
raw_ethernet_send_lat  latency test over raw Ethernet interface
raw_ethernet_send_bw   bandwidth test over raw Ethernet interface

===============================================================================
5. Running Tests
===============================================================================

Prerequisites:
	kernel 2.6
	(kernel module) matches libibverbs
	(kernel module) matches librdmacm
	(kernel module) matches libibumad
	(kernel module) matches libmath (lm)
	(linux kernel module) matches pciutils (lpci).

Server:		./<test name> <options>
Client:		./<test name> <options> <server IP address>

		o  <server address> is IPv4 or IPv6 address. You can use the IPoIB
                   address if IPoIB is configured.
		o  --help lists the available <options>

  *** IMPORTANT NOTE:
      The SAME OPTIONS must be passed to both server and client.

Common Options to all tests:
----------------------------
  -h, --help				Display this help message screen
  -p, --port=<port>			Listen on/connect to port <port> (default: 18515)
  -R, --rdma_cm				Connect QPs with rdma_cm and run test on those QPs
  -z, --comm_rdma_cm			Communicate with rdma_cm module to exchange data - use regular QPs
  -m, --mtu=<mtu>			QP Mtu size (default: active_mtu from ibv_devinfo)
  -c, --connection=<type>		Connection type RC/UC/UD/XRC/DC/SRD (default RC).
  -d, --ib-dev=<dev>			Use IB device <dev> (default: first device found)
  -i, --ib-port=<port>			Use network port <port> of IB device (default: 1)
  -s, --size=<size>			Size of message to exchange (default: 1)
  -a, --all				Run sizes from 2 till 2^23
  -n, --iters=<iters>			Number of exchanges (at least 100, default: 1000)
  -x, --gid-index=<index>		Test uses GID with GID index taken from command
  -V, --version				Display version number
  -e, --events				Sleep on CQ events (default poll)
  -F, --CPU-freq			Do not fail even if cpufreq_ondemand module
  -I, --inline_size=<size>		Max size of message to be sent in inline mode
  -u, --qp-timeout=<timeout>		QP timeout = (4 uSec)*(2^timeout) (default: 14)
  -S, --sl=<sl>				Service Level (default 0)
  -r, --rx-depth=<dep>			Receive queue depth (default 600)

Options for latency tests:
--------------------------

  -C, --report-cycles			Report times in CPU cycle units
  -H, --report-histogram		Print out all results (Default: summary only)
  -U, --report-unsorted			Print out unsorted results (default sorted)

Options for BW tests:
---------------------

  -b, --bidirectional			Measure bidirectional bandwidth (default uni)
  -N, --no peak-bw			Cancel peak-bw calculation (default with peak-bw)
  -Q, --cq-mod				Generate Cqe only after <cq-mod> completion
  -t, --tx-depth=<dep>			Size of tx queue (default: 128)
  -O, --dualport			Run test in dual-port mode (2 QPs). Both ports must be active (default OFF)
  -D, --duration=<sec> 			Run test for <sec> period of seconds
  -f, --margin=<sec> 			When in Duration, measure results within margins (default: 2)
  -l, --post_list=<list size>		Post list of send WQEs of <list size> size (instead of single post)
      --recv_post_list=<list size>	Post list of receive WQEs of <list size> size (instead of single post)
  -q, --qp=<num of qp's>		Num of QPs running in the process (default: 1)
      --run_infinitely			Run test until interrupted by user, print results every 5 seconds

SEND tests (ib_send_lat or ib_send_bw) flags: 
---------------------------------------------

  -r, --rx-depth=<dep>			Size of receive queue (default: 512 in BW test)
  -g, --mcg=<num_of_qps> 		Send messages to multicast group with <num_of_qps> qps attached to it
  -M, --MGID=<multicast_gid>		In multicast, uses <multicast_gid> as the group MGID

WRITE latency (ib_write_lat) flags:
-----------------------------------

  --write_with_imm				Use write-with-immediate verb instead of write

ATOMIC tests (ib_atomic_lat or ib_atomic_bw) flags: 
---------------------------------------------------

  -A, --atomic_type=<type>		type of atomic operation from {CMP_AND_SWAP,FETCH_AND_ADD}
  -o, --outs=<num>			Number of outstanding read/atomic requests - also on READ tests

Options for raw_ethernet_send_bw:
---------------------------------
  -B, --source_mac			source MAC address by this format XX:XX:XX:XX:XX:XX (default take the MAC address form GID)
  -E, --dest_mac			destination MAC address by this format XX:XX:XX:XX:XX:XX **MUST** be entered
  -J, --server_ip			server ip address by this format X.X.X.X (using to send packets with IP header)
  -j, --client_ip			client ip address by this format X.X.X.X (using to send packets with IP header)
  -K, --server_port			server udp port number (using to send packets with UDP header)
  -k, --client_port			client udp port number (using to send packets with UDP header)
  -Z, --server				choose server side for the current machine (--server/--client must be selected)
  -P, --client				choose client side for the current machine (--server/--client must be selected)

----------------------------------------------
Special feature detailed explanation in tests:
----------------------------------------------

  1. Usage of post_list feature (-l, --post_list=<list size> and --recv_post_list=<list size>)
     In this case, each QP will prepare <list size> WQEs (instead of 1), and will chain them to each other.
     In chaining we mean allocating <list_size> array, and setting 'next' pointer of each WQE in the array
     to point to the following element in the array. the last WQE in the array will point to NULL.
     In this case, when posting the first WQE in the list, will instruct the HW to post all of those WQEs.
     Which means each post send/recv will post <list_size> messages.
     This feature is good if we want to know the maximum message rate of QPs in a single process.
     Since we are limited to SW posts (for example, on post_send ~ 10 Mpps, since we have ~ 500 ns between
     each SW post_send), we can see the true HW message rate when setting <list_size> of 64 (for example)
     since it's not depended on SW limitations.

  2. RDMA Connected Mode (CM)
     You can add the "-R" flag to all tests to connect the QPs from each side with the rdma_cm library.
     In this case, the library will connect the QPs and will use the IPoIB interface for doing it.
     It helps when you don't have Ethernet connection between the 2 nodes.
     You must supply the IPoIB interface as the server IP.

  3. Multicast support in ib_send_lat and in ib_send_bw
     Send tests have built in feature of testing multicast performance, in verbs level.
     You can use "-g" to specify the number of QPs to attach to this multicast group.
     "-M" flag allows you to choose the multicast group address.

  4. GPUDirect usage:
     To utilize GPUDirect feature, perftest should be compiled as:
     ./autogen.sh && ./configure CUDA_H_PATH=<path to cuda.h> && make -j, e.g.:
     ./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j

     Thus --use_cuda=<gpu_index> flag will be available to add to a command line:
     ./ib_write_bw -d ib_dev --use_cuda=<gpu index> -a

    CUDA DMA-BUF requierments:
        1) CUDA Toolkit 11.7 or later.
        2) NVIDIA Open-Source GPU Kernel Modules version 515 or later.
           installation instructions: http://us.download.nvidia.com/XFree86/Linux-x86_64/515.43.04/README/kernel_open.html
        3) Configuration / Usage:
          export the following environment variables:
            1- export LD_LIBRARY_PATH.
              e.g: export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
            2- export LIBRARY_PATH.
              e.g: export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH
          perform compilation as decribe in the begining of section 4 (GPUDirect usage).

  5. AES_XTS (encryption/decryption)
     In perftest repository there are two files as follow:
      1) gen_data_enc_key.c

      2) encrypt_credentials.c

      gen_data_enc_key.c file should be compiled with the following command:
      #gcc gen_data_enc_key.c -o gen_data_enc_key -lcrypto

      encrypt_credentials.c file should be compiled with the following command:
      #gcc encrypt_credentials.c -o encrypt_credentials -lcrypto

      You must provide the plaintext credentials and the kek in seperate files in hex format.

      for example:

      credential_file:
        0x00
        0x00
        0x00
        0x00
        0x10
        etc..

      kek_file:
        0x00
        0x00
        0x11
        0x22
        0x55
        etc..

      Notes:
        1) You should run the encrypt_credentials program and give paths as parameters
        to the plaintext credential_file, kek_file and the path you want the encrypted
        credentials to be in (credentials_file first).
        for example:
          #./encrypt_credentials <PATH>/credential_file <PATH>/kek_file
          <PATH>/encrypted_credentials_file_name

        The output of this is a text file that you must provide its path
        as a parameter to the perftest application with --credentials_path <PATH>

        2)Both encrypt_credentials.c and gen_data_enc_key.c should be compiled
          before using the perftest application.

        3)gen_data_enc_key.c compiled program path must be provided to the perftest
          application with --data_enc_key_app_path <PATH> and the kek file should be
          provided with --kek_path <PATH>

        4) This feature supported only on RC qp type, and on ib_write_bw, ib_read_bw,
           ib_send_bw, ib_read_lat, ib_send_lat.

        5) You should load the kek and credentials you want to the device in the following way:
          #sudo mlxreg -d <pci address> --reg_name CRYPTO_OPERATIONAL --set "credential[0]
          =0x00000000,credential[1]=0x10000000,credential[2]=0x10000000,
          credential[3]=0x10000000,credential[4]=0x10000000,credential[5]=0x10000000
          ,credential[6]=0x10000000,credential[7]=0x10000000,credential[8]=0x10000000
          ,credential[9]=0x10000000,kek[0]=0x00001122,kek[1]=0x55556633,kek[2]=0x33447777,kek[3]=0x22337777"

  6. Payload modification
        Using the --payload_file_path you can pass a text file, which contains a pattern,
        as a parameter to perftest, and use the pattern as the payload of the RDMA verb.

        You must provide the pattern in DWORD's seperated by comma and in hex format.

        for example:
        0xddccbbaa,0xff56f00d,0xffffffff,0x21ab025b, etc...

        Notes:
          1) Perftest parse the pattern and save it in LE format.
          2) The feature available for ib_write_bw, ib_read_bw, ib_send_bw, ib_read_lat and ib_send_lat.
          3) 0 size pattern is not allow.


===============================================================================
7. Known Issues
===============================================================================

 1. Multicast support in ib_send_lat and in ib_send_bw is not stable.
    The benchmark program may hang or exhibit other unexpected behavior.

 2. Bidirectional support in ib_send_bw test, when running in UD or UC mode.
    In rare cases, the benchmark program may hang.
    perftest-2.3 release includes a feature for hang detection, which will exit test after 2 mins in those situations.

 3. Different versions of perftest may not be compatible with each other.
    Please use the same perftest version on both sides to ensure consistency of benchmark results.

 4. Test version 5.3 and above won't work with previous versions of perftest. As well as 5.70 and above.

 5. This perftest package won't compile on MLNX_OFED-2.1 due to API changes in MLNX_OFED-2.2
    In order to compile it properly, please do:
    ./configure --disable-verbs_exp
    make

 6. In the x390x platform virtualized environment the results shown by package test applications can be incorrect.

 7. perftest-2.3 release includes support for dualport VPI test - port1-Ethernet , port2-IB. (in addition to Eth:Eth, IB:IB)
    Currently, running dualport when port1-IB , port2-Ethernet still not working.

 8. If GPUDirect is not working, (e.g. you see "Couldn't allocate MR" error message), consider disabling Scatter to CQE feature. Set the environmental variable MLX5_SCATTER_TO_CQE=0. E.g.:
    MLX5_SCATTER_TO_CQE=0 ./ib_write_bw -d ib_dev --use_cuda=<gpu index> -a

perftest's People

Contributors

Stargazers

Watchers

Forkers

rleon larrystevenwise zoharbenaharon alexschm nmorey shamoya deadbeefh bdrung btamiller sbates130272 mellanox yanivbl6 sharadmehrotra2 skcoulter aron-silverton anandbibhuti yongjianxu baiwei0427 simonraviv tatyana-en cyrilleverrier alex--m zwaston initat sujinzhao vcgege dsharma283 lmingcsce zorrohahaha 1997cui tcm0116 wangshuaizs dimasique1 newamber yilongli shangguanshiyuan marlooonvon qzan9 weiny2 yossike syedsk vishaldeyiiest zhyh329 augustjo drizzleyang wxlg1117 hasan3050 wenqingwu junglefive firasj gal-pressman pengn117 pakmarkthub ovn-open-virtual-networks juhlee-microsoft shines001 spotluri lravich tu-cao johnny5188 manjugv z00413562 artpol84 jungan rhiswell takahashi0801 souravzzz sindhu-devale qzwlinux saadrahim yucl80 andremelk honggang-li mkalderon yuvalbason dyingfair planeta lastweek wjtian smajumder-nvidia amaro grom72 dmonakhov hwpplayers matanhamilis shsym plsmaop ruihong000 ossic dariusgrassi caseywang2 lele121314 zettastorage hhlrank ruihong123 yanghaomai lyndonmario zjyst2019 dkkranz twywleo

perftest's Issues

ib_send_lat doesn't work on AWS EFA card

I run ib_send_lat on AWS EFA card, but server desn't recv any packet

Command:

client: sudo ./ib_send_lat -d efa_0 -s a 172.31.6.59 -c SRD
server: sudo ./ib_send_lat -d efa_0 -s a -c SRD

Statistics of EFA hw_counters:

client

alloc_pd_err :0
alloc_ucontext_err :0
cmds_err :0
completed_cmds :105
create_ah_err :0
create_cq_err :0
create_qp_err :0
keep_alive_rcvd :87
lifespan :12
mmap_err :0
no_completion_cmds :0
rdma_read_bytes :0
rdma_read_resp_bytes :0
rdma_read_wr_err :0
rdma_read_wrs :0
recv_bytes :0
recv_wrs :0
reg_mr_err :0
rx_bytes :0
rx_drops :0
rx_pkts :0
send_bytes :2
send_wrs :1
submitted_cmds :108
tx_bytes :2
tx_pkts :1

server

alloc_pd_err :0
alloc_ucontext_err :0
cmds_err :0
completed_cmds :39
create_ah_err :0
create_cq_err :0
create_qp_err :0
keep_alive_rcvd :91
lifespan :12
mmap_err :0
no_completion_cmds :0
rdma_read_bytes :0
rdma_read_resp_bytes :0
rdma_read_wr_err :0
rdma_read_wrs :0
recv_bytes :0
recv_wrs :0
reg_mr_err :0
rx_bytes :0
rx_drops :0
rx_pkts :0
send_bytes :0
send_wrs :0
submitted_cmds :42
tx_bytes :0
tx_pkts :0

Allocate correct size buffer

In ctx_alloc_credit function in perftest_resources.c file, ctrl_buf is allocated a size of user_param->num_of_qps however it is being memset to a size of buf_size:

ALLOCATE(ctx->ctrl_buf,uint32_t,user_param->num_of_qps);
memset(&ctx->ctrl_buf[0],0,buf_size);

Fix this by allocating the right size ctrl_buf:

ALLOCATE(ctx->ctrl_buf,uint32_t,buf_size);

Default write inline value (220) could be too large

In perftest, default WRITE_INLINE is 220:

perftest/src/perftest_parameters.h

Line 137 in ed18d32

#define DEF_INLINE_WRITE (220)

However, 220 max_inline_data could be too large for some devices (e.g., Intel E810 only support max_inline_data less than 102).
Pertest has option (-I) to change the value, but users may not be aware of the option, and output is a bit unfriendly.

Output of ./ib_write_lat -s 64 -c RC -d irdma0 -n 1000 -i 1:


************************************
* Waiting for client to connect... *
************************************
Unable to create QP.
Failed to create QP.
 Couldn't create IB resources

"Couldn't allocate MR" errors on latest WITH_CUDA perftest

When run with use_cuda, perftests fail with error "Couldn't allocate MR":

root@u1:/dev/shm/pt/bin# /dev/shm/pt/bin/ib_write_bw -d mlx5_3 -a --use_cuda

************************************
* Waiting for client to connect... *
************************************
initializing CUDA
There are 2 devices supporting CUDA, picking first...
[pid = 4869, dev = 0] device name = [Tesla K80]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 16777216 bytes GPU buffer
allocated GPU buffer address at 0000002305040000 pointer=0x2305040000
Couldn't allocate MR
failed to create mr
Failed to create MR
 Couldn't create IB resources

Without use_cuda, the test works well:

root@u1:/dev/shm/pt/bin# /dev/shm/pt/bin/ib_write_bw -d mlx5_3 -a 

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_3
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x0e QPN 0x0086 PSN 0xb72682 RKey 0x0056fe VAddr 0x007fb28132d000
 remote address: LID 0x10 QPN 0x0086 PSN 0x316d10 RKey 0x003aae VAddr 0x007f895f3a0000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 8388608    5000             11682.08            11672.99		   0.001459
---------------------------------------------------------------------------------------

===================================================
The code was built by
CUDA_H_PATH=/usr/local/cuda/include/cuda.h ./configure --prefix=/dev/shm/pt/; make -j 20 install

root@u1:/dev/shm/perftest# git show
commit 198473181e0365f97c5840b8fd406ff52af6335b
Author: Zohar Ben Aharon <[email protected]>
Date:   Wed Mar 14 10:33:59 2018 +0200

Software stack is as following:

root@u2:/dev/shm/pt/bin# ofed_info -s
MLNX_OFED_LINUX-4.3-1.0.1.0:
root@u2:/dev/shm/pt/bin# uname -a
Linux u2 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:08:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
root@u2:/dev/shm/pt/bin# /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85


CA 'mlx5_3'
	CA type: MT4119
	Number of ports: 1
	Firmware version: 16.22.1002
	Hardware version: 0
	Node GUID: 0x248a070300a80a00
	System image GUID: 0x248a070300a80a00
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 16
		LMC: 0
		SM lid: 1
		Capability mask: 0x2651e848
		Port GUID: 0x248a070300a80a00
		Link layer: InfiniBand

Any ideas?
Thanks

Unable to create QP.

Command that I have tried:
Server:

./ib_send_lat  -x 0 -c UD -F -I 0 -a

Client:

./ib_send_lat  -x 0 -c UD -F -I 0 -a

Server output:

************************************
* Waiting for client to connect... *
************************************
Unable to create QP.
Failed to create QP.
Couldn't create IB resources

Client output:

Unable to create QP.
Failed to create QP.
Couldn't create IB resources.

I ran “lspci” on the instances where I ran the test and it seems that I have the efa_device installed:

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:06.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA)

Then I hacked into the code, and found that ibv_create_qp() (https://github.com/linux-rdma/perftest/blob/master/src/perftest_resources.c#L1848) is not called.

This is because the HAVE_IBV_WR_API flag was not set(https://github.com/linux-rdma/perftest/blob/master/src/perftest_resources.c#L1818)

ib_read_bw over VF generate SegV fault

on server side

ibv_devices
device node GUID
------ ----------------
mlx5_2 0000000000000000
mlx5_0 506b4b0300285028
mlx5_3 0000000000000000
mlx5_1 506b4b0300285034

address 1.1.1.2/24 defined for mlx5_2
run: ib_read_bw --report_gbits -s 1048576 -D5 -F -R

on client side

ibv_devices
device node GUID
------ ----------------
mlx5_0 248a0703004bf0a8
mlx5_1 0000000000000000

address 1.1.1.1/24 defined for mlx5_0
run: ib_read_bw --report_gbits -s 1048576 -D5 -F -R 1.1.1.2

result

server (running ib_read_bw on VF) failed with segV fault when the client tries to connect.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f7c7f0dfe64 in ibv_alloc_pd () from /lib64/libibverbs.so.1
(gdb) bt
#0  0x00007f7c7f0dfe64 in ibv_alloc_pd () from /lib64/libibverbs.so.1
#1  0x0000000000417267 in ctx_init (ctx=0x1c75e10, user_param=0x1c75ad0)
    at src/perftest_resources.c:1795
#2  0x0000000000406ca4 in rdma_server_connect (ctx=0x1c75e10, user_param=0x1c75ad0)
    at src/perftest_communication.c:1110
#3  0x00000000004074ea in establish_connection (comm=0x7ffc9c28f320)
    at src/perftest_communication.c:1243
#4  0x0000000000403230 in main (argc=7, argv=0x7ffc9c28f8d8) at src/read_bw.c:115

Further debugging:

1102		ctx->context = ctx->cm_id->verbs;
(gdb) p ctx->context
$15 = (struct ibv_context *) 0x0

connect retry fails

The perftest_communication is written so that the timing between server performing rdma_listen
and client calling rdma_connect is sometimes out of sync.
The connect flow implements a retry attempt if the rdma_connect fails, but the retry fails
on a return code before and the retry attempt eventually does not happen
repro steps:
This was exposed on iWARP, where the rdma_listen takes longer.
Can be easily reproducible by delaying the rdma_listen, or simply running the client before server, no retry occurs. (This also happens on RoCE)

Function proc_get_cpu_mhz() is valid for only a few architectures

It happened that we run your tests on RISC-V arch and found out that some metrics on the final output are zero. Delving into the code showed that proc_get_cpu_mhz() always returns 0, just because this function reads CPU MHz information from /proc/cpuinfo but only a few arches print out this information into this file. To be honest it is difficult to add sth into /proc/cpuinfo.
See requests to solve this problem:

Why not to get this information from /sys/devices/system/cpu/cpufreq/ ?
It is a portable and clear way to get it.

Negative value of BW average and t_avg observevd

Version used
lehi-smoke:~ # ib_send_lat --version
Version: 5.60
lehi-smoke:~ # ib_send_bw --version
Version: 5.60
lehi-smoke:~ #

Test Method
4 hosts with MLX HCA inter-connected with ethernet switches.
Each host launches 12 server processes (3 ib_send_bw, 3 ib_write_bw, 3 ib_send_lat, 3 ib_write_lat).
Each host launches 12 client processes (3 ib_send_bw, 3 ib_write_bw, 3 ib_send_lat, 3 ib_write_lat).
It was a full-cross test, each host act as both server and client for other 3 hosts.
And no rate limitation was used.

Problem
Negative value observed in the log of some bw or lat process.
Following is some of the logs.

Fri, 20 Mar 2020 22:35:36 +0800

taskset -c 17 ib_send_bw -d mlx5_bond_0 -z -f 2 -F --report_gbits -D 205200 -p 20003 -x 3 192.168.219.6

                Send BW Test

Dual-port : OFF Device : mlx5_bond_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : rdma_cm

local address: LID 0000 QPN 0x1295 PSN 0xe41d73
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:05
remote address: LID 0000 QPN 0x126e PSN 0x9f23a6
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:06

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 18446744072490473291 0.000000 -3.122397 -0.005955

Mon, 23 Mar 2020 07:35:37 +0800

Fri, 20 Mar 2020 22:34:20 +0800

taskset -c 1 ib_send_bw -d mlx5_bond_0 -z -f 2 -F --report_gbits -D 205200 -p 20001 -x 3

Waiting for client to connect... *

                Send BW Test

Dual-port : OFF Device : mlx5_bond_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
RX depth : 512
CQ Moderation : 1
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : rdma_cm

local address: LID 0000 QPN 0x12a1 PSN 0x2955ce
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:05
remote address: LID 0000 QPN 0x127e PSN 0x2afc3c
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:06

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 1906054058 0.00 4.88 0.009312

Mon, 23 Mar 2020 07:35:47 +0800

Fri, 20 Mar 2020 22:34:20 +0800

taskset -c 7 ib_send_lat -d mlx5_bond_0 -z -f 2 -F --report_gbits -D 205200 -p 40001 -x 3

Waiting for client to connect... *

                Send Latency Test

Dual-port : OFF Device : mlx5_bond_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
RX depth : 512
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 236[B]
rdma_cm QPs : OFF
Data ex. method : rdma_cm

local address: LID 0000 QPN 0x12a4 PSN 0xb2115c
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:05
remote address: LID 0000 QPN 0x1280 PSN 0x6966f2
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:06

#bytes #iterations t_avg[usec] tps average
2 -267885825 -382.06 -1308.69

Mon, 23 Mar 2020 07:35:48 +0800

Fri, 20 Mar 2020 22:34:20 +0800

taskset -c 8 ib_send_lat -d mlx5_bond_0 -z -f 2 -F --report_gbits -D 205200 -p 40002 -x 3

Waiting for client to connect... *

                Send Latency Test

Dual-port : OFF Device : mlx5_bond_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
RX depth : 512
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 236[B]
rdma_cm QPs : OFF
Data ex. method : rdma_cm

local address: LID 0000 QPN 0x12ac PSN 0xc7fee4
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:05
remote address: LID 0000 QPN 0x128b PSN 0x57659e
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:07

#bytes #iterations t_avg[usec] tps average
2 1158940729 88.31 5661.71

Mon, 23 Mar 2020 07:35:57 +0800

ib_send_bw test stuck

When running the following, for example:
ib_send_bw -b -e -q 16

I think that the root cause is wrong usage of notifications - the cq isn't emptied when the application is woken up. The argument passed to ibv_poll_cq is user_param->rx_depth, but we might have more completions than that, since we use multiple QPs.

There are two options I can think of for a fix:

Use a larger completion buffer when calling ibv_poll_cq (user_param->rx_depth * number_of_qps)
As done in run_iter_be_server - loop over ibv_poll_cq as long as there are completions in the cq

Completion with error with status 4, IBV_WC_LOC_PROT_ERR after upgrading OFED version

hello, I failed gpu direct rdma test over infiniband after upgrading OFED version.

I tested several times, and

got the following error messages
or hangs

got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 01005104 08001a78 000013d2
 Completion with error at client
 Failed status 4: wr_id 0 syndrom 0x51
scnt=128, ccnt=0
 Failed to complete run_iter_bw function successfully

status 4 indicates IBV_WC_LOC_PROT_ERR, and it looks like
"This event is generated when a user attempts to access an address outside of the registered memory region."
from https://www.ibm.com/docs/en/sdk-java-technology/8?topic=jverbs-workcompletionstatus

without GPU direct, by deleting the --use_cuda=0 argument, it worked well.

I would really appreciate any help.

The full log is given as

$./runme remote_host numactl -N 3 ib_write_bw -a -b --use_cuda=0 -d mlx5_1
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 07:00
CUDA device 1: PCIe address is 0B:00
CUDA device 2: PCIe address is 48:00
CUDA device 3: PCIe address is 4C:00
CUDA device 4: PCIe address is 88:00
CUDA device 5: PCIe address is 8B:00
CUDA device 6: PCIe address is C8:00
CUDA device 7: PCIe address is CB:00

Picking device No. 0
[pid = 574, dev = 0] device name = [A100 Graphics Device]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 16777216 bytes GPU buffer
allocated GPU buffer address at 00007f7b4e000000 pointer=0x7f7b4e000000
---------------------------------------------------------------------------------------
                    RDMA_Write Bidirectional BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x07 QPN 0x1a78 PSN 0xbe6767 RKey 0x00cf39 VAddr 0x007f7b4e800000
 remote address: LID 0x3f QPN 0x1544 PSN 0xebc222 RKey 0x00ba05 VAddr 0x007f670a800000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
mlx5: test-sleep-worker-0: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 01005104 08001544 000020d2
 Completion with error at client
 Failed status 4: wr_id 0 syndrom 0x51
scnt=128, ccnt=0
 Failed to complete run_iter_bw function successfully

************************************
* Waiting for client to connect... *
************************************
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 07:00
CUDA device 1: PCIe address is 0B:00
CUDA device 2: PCIe address is 48:00
CUDA device 3: PCIe address is 4C:00
CUDA device 4: PCIe address is 88:00
CUDA device 5: PCIe address is 8B:00
CUDA device 6: PCIe address is C8:00
CUDA device 7: PCIe address is CB:00

Picking device No. 0
[pid = 872, dev = 0] device name = [A100 Graphics Device]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 16777216 bytes GPU buffer
allocated GPU buffer address at 00007f670a000000 pointer=0x7f670a000000
---------------------------------------------------------------------------------------
                    RDMA_Write Bidirectional BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x3f QPN 0x1544 PSN 0xebc222 RKey 0x00ba05 VAddr 0x007f670a800000
 remote address: LID 0x07 QPN 0x1a78 PSN 0xbe6767 RKey 0x00cf39 VAddr 0x007f7b4e800000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]






mlx5: test-sleep-master-0: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 01005104 08001a78 000013d2
 Completion with error at client
 Failed status 4: wr_id 0 syndrom 0x51
scnt=128, ccnt=0
 Failed to complete run_iter_bw function successfully

And the below is ofed information

air@cl-rndcgpu-a12:~$ ofed_info
MLNX_OFED_LINUX-5.3-1.0.0.1 (OFED-5.3-1.0.0):

ar_mgr:
osm_plugins/ar_mgr/ar_mgr-1.0-5.8.2.MLNX20210321.g58d33bf.tar.gz

clusterkit:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.2-1.0.4.0/SRPMS/clusterkit-1.0.36-1.52104.src.rpm

dapl:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.2-1.0.4.0/SRPMS/dapl-2.1.10.1.mlnx-OFED.4.9.0.1.4.52104.src.rpm

dpcp:
/sw/release/sw_acceleration/dpcp/dpcp-1.1.2-1.src.rpm

dump_pr:
osm_plugins/dump_pr//dump_pr-1.0-5.8.2.MLNX20210321.g58d33bf.tar.gz

fabric-collector:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.2-1.0.4.0/SRPMS/fabric-collector-1.1.0.MLNX20170103.89bb2aa-0.1.52104.src.rpm

hcoll:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.2-1.0.4.0/SRPMS/hcoll-4.7.3189-1.52104.src.rpm

ibdump:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.2-1.0.4.0/SRPMS/ibdump-6.0.0-1.52104.src.rpm

ibsim:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.2-1.0.4.0/SRPMS/ibsim-0.10-1.52104.src.rpm

ibutils2:
ibutils2/ibutils2-2.1.1-0.132.MLNX20210321.g84ba964.tar.gz

iser:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_3
commit 19d48787e91c989a7a0a856d04d0c80bca648423
isert:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_3
commit 19d48787e91c989a7a0a856d04d0c80bca648423
kernel-mft:
mlnx_ofed_mft/kernel-mft-4.16.3-12.src.rpm

knem:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.2-1.0.4.0/SRPMS/knem-1.1.4.90mlnx1-OFED.5.1.2.5.0.1.src.rpm

libvma:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.2-1.0.4.0/SRPMS/libvma-9.2.2-0.src.rpm

mlnx-dpdk:
https://github.com/Mellanox/dpdk.org mlnx_dpdk_20.11_last_stable
commit f05f7352848c8906a9d8dfd607687d52b5f7fcaf
mlnx-en:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_3
commit 19d48787e91c989a7a0a856d04d0c80bca648423

mlnx-ethtool:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.2-1.0.4.0/SRPMS/mlnx-ethtool-5.8-1.52104.src.rpm

mlnx-iproute2:
mlnx_ofed/iproute2.git mlnx_ofed_5_3
commit 130d9fd2320a4ece4a6fc330624be2bc341f569a
mlnx-nfsrdma:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_3
commit 19d48787e91c989a7a0a856d04d0c80bca648423
mlnx-nvme:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_3
commit 19d48787e91c989a7a0a856d04d0c80bca648423
mlnx-ofa_kernel:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_3
commit 19d48787e91c989a7a0a856d04d0c80bca648423

mpi-selector:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.2-1.0.4.0/SRPMS/mpi-selector-1.0.3-1.52104.src.rpm

mpitests:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.2-1.0.4.0/SRPMS/mpitests-3.2.20-5d20b49.52104.src.rpm

mstflint:
mlnx_ofed_mstflint/mstflint-4.16.0-1.tar.gz

multiperf:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.2-1.0.4.0/SRPMS/multiperf-3.0-0.14.g5f0fd0e.52104.src.rpm

ofed-docs:
docs.git mlnx_ofed-4.0
commit 3d1b0afb7bc190ae5f362223043f76b2b45971cc

openmpi:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.2-1.0.4.0/SRPMS/openmpi-4.1.0rc5-1.52104.src.rpm

opensm:
mlnx_ofed_opensm/opensm-5.8.2.MLNX20210321.2958ab8.tar.gz

openvswitch:
openvswitch.git mlnx_ofed_5_3
commit b5c772fad3a4b3345805fb4dc19b9e181931b23d
perftest:
mlnx_ofed_perftest/perftest-4.5-0.1.g23b8f9c.tar.gz

rdma-core:
mlnx_ofed/rdma-core.git mlnx_ofed_5_3
commit 696ecf5f1be2fbc7a3b72787c6a4d455d8f47904
rshim:
mlnx_ofed_soc/rshim-2.0.5-10.g0ae03b4.src.rpm

sharp:
mlnx_ofed_sharp/sharp-2.4.5.MLNX20210302.7c3c223.tar.gz

sockperf:
/sw/release/mlnx_ofed/IBHPC/MLNX_OFED_LINUX-5.2-1.0.4.0/SRPMS/sockperf-3.7-0.gita1e8e835a689.52104.src.rpm

srp:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_5_3
commit 19d48787e91c989a7a0a856d04d0c80bca648423
ucx:
mlnx_ofed_ucx/ucx-1.10.0-1.src.rpm

xpmem:
xpmem.git mellanox-master
commit 740afed1b7647b241a5a3a57b13ce11ec882d9d5

Installed Packages:
-------------------
ii  ar-mgr                               1.0-5.8.2.MLNX20210321.g58d33bf.53100   amd64        Adaptive Routing Manager
ii  dapl2-utils                          2.1.10.1.mlnx-OFED.4.9.0.1.4.53100      amd64        Utilities for use with the DAPL libraries
ii  dpcp                                 1.1.2-1.53100                           amd64        Direct Packet Control Plane (DPCP) is a library to use Devx
ii  dump-pr                              1.0-5.8.2.MLNX20210321.g58d33bf.53100   amd64        Dump PathRecord Plugin
ii  hcoll                                4.7.3189-1.53100                        amd64        Hierarchical collectives (HCOLL)
ii  ibacm                                52mlnx1-1.53100                         amd64        InfiniBand Communication Manager Assistant (ACM)
ii  ibdump                               6.0.0-1.53100                           amd64        Mellanox packets sniffer tool
ii  ibsim                                0.10-1.53100                            amd64        InfiniBand fabric simulator for management
ii  ibsim-doc                            0.10-1.53100                            all          documentation for ibsim
ii  ibutils2                             2.1.1-0.132.MLNX20210321.g84ba964.53100 amd64        OpenIB Mellanox InfiniBand Diagnostic Tools
ii  ibverbs-providers:amd64              52mlnx1-1.53100                         amd64        User space provider drivers for libibverbs
ii  ibverbs-utils                        52mlnx1-1.53100                         amd64        Examples for the libibverbs library
ii  infiniband-diags                     52mlnx1-1.53100                         amd64        InfiniBand diagnostic programs
ii  iser-dkms                            5.3-OFED.5.3.0.3.8.1                    all          DKMS support fo iser kernel modules
ii  isert-dkms                           5.3-OFED.5.3.0.3.8.1                    all          DKMS support fo isert kernel modules
ii  kernel-mft-dkms                      4.16.3-12                               all          DKMS support for kernel-mft kernel modules
ii  knem                                 1.1.4.90mlnx1-OFED.5.1.2.5.0.1          amd64        userspace tools for the KNEM kernel module
ii  knem-dkms                            1.1.4.90mlnx1-OFED.5.1.2.5.0.1          all          DKMS support for mlnx-ofed kernel modules
ii  libdapl-dev                          2.1.10.1.mlnx-OFED.4.9.0.1.4.53100      amd64        Development files for the DAPL libraries
ii  libdapl2                             2.1.10.1.mlnx-OFED.4.9.0.1.4.53100      amd64        The Direct Access Programming Library (DAPL)
ii  libibmad-dev:amd64                   52mlnx1-1.53100                         amd64        Development files for libibmad
ii  libibmad5:amd64                      52mlnx1-1.53100                         amd64        Infiniband Management Datagram (MAD) library
ii  libibnetdisc5:amd64                  52mlnx1-1.53100                         amd64        InfiniBand diagnostics library
ii  libibumad-dev:amd64                  52mlnx1-1.53100                         amd64        Development files for libibumad
ii  libibumad3:amd64                     52mlnx1-1.53100                         amd64        InfiniBand Userspace Management Datagram (uMAD) library
ii  libibverbs-dev:amd64                 52mlnx1-1.53100                         amd64        Development files for the libibverbs library
ii  libibverbs1:amd64                    52mlnx1-1.53100                         amd64        Library for direct userspace use of RDMA (InfiniBand/iWARP)
ii  libibverbs1-dbg:amd64                52mlnx1-1.53100                         amd64        Debug symbols for the libibverbs library
ii  libopensm                            5.8.2.MLNX20210321.2958ab8-0.1.53100    amd64        Infiniband subnet manager libraries
ii  libopensm-devel                      5.8.2.MLNX20210321.2958ab8-0.1.53100    amd64        Developement files for OpenSM
ii  librdmacm-dev:amd64                  52mlnx1-1.53100                         amd64        Development files for the librdmacm library
ii  librdmacm1:amd64                     52mlnx1-1.53100                         amd64        Library for managing RDMA connections
ii  mlnx-ethtool                         5.8-1.53100                             amd64        This utility allows querying and changing settings such as speed,
ii  mlnx-iproute2                        5.10.0-1.53100                          amd64        This utility allows querying and changing settings such as speed,
ii  mlnx-ofed-kernel-dkms                5.3-OFED.5.3.1.0.0.1                    all          DKMS support for mlnx-ofed kernel modules
ii  mlnx-ofed-kernel-utils               5.3-OFED.5.3.1.0.0.1                    amd64        Userspace tools to restart and tune mlnx-ofed kernel modules
ii  mpitests                             3.2.20-5d20b49.53100                    amd64        Set of popular MPI benchmarks and tools IMB 2018 OSU benchmarks ver 4.0.1 mpiP-3.3 IPM-2.0.6
ii  mstflint                             4.16.0-1.53100                          amd64        Mellanox firmware burning application
ii  openmpi                              4.1.0rc5-1.53100                        all          Open MPI
ii  opensm                               5.8.2.MLNX20210321.2958ab8-0.1.53100    amd64        An Infiniband subnet manager
ii  opensm-doc                           5.8.2.MLNX20210321.2958ab8-0.1.53100    amd64        Documentation for opensm
ii  perftest                             4.5-0.1.g23b8f9c.53100                  amd64        Infiniband verbs performance tests
ii  rdma-core                            52mlnx1-1.53100                         amd64        RDMA core userspace infrastructure and documentation
ii  rdmacm-utils                         52mlnx1-1.53100                         amd64        Examples for the librdmacm library
ii  rshim                                2.0.5-10.g0ae03b4.53100                 amd64        driver for Mellanox BlueField SoC
ii  sharp                                2.4.5.MLNX20210302.7c3c223-1.53100      amd64        SHArP switch collectives
ii  srp-dkms                             5.3-OFED.5.3.0.3.8.1                    all          DKMS support fo srp kernel modules
ii  srptools                             52mlnx1-1.53100                         amd64        Tools for Infiniband attached storage (SRP)
ii  ucx                                  1.10.0-1.53100                          amd64        Unified Communication X

Perftest tools use shared CQ for mutiple QPs

Hi,
I see that with multiple QP support to perftest (-q), single uber CQ is created and shared by all the QPs in an instance (ib_read_lat/ib_write_bw etc.). If more QPs are used per instance say -q 1000, CQ becomes too big and run out of max device supported cqe per CQ and may fail creating CQ or fail polling the CQ with limited size (device supported max size) for more number QPs.

Any reason behind using single large CQ for multiple QPs? Please let me know if creating CQ per QP makes sense.

Thanks

Use of Device Memory feature in perftests

Hello!

I was wondering if the Device Memory API (ibv_alloc_dm, ibv_memcpy_to_dm, ibv_reg_dm_mr, etc.) could be added to the IB perftests as a runtime option? I was trying to implement it myself, but there's a snag about local protection errors that I'm currently debugging (I have an idea, but I just need the time to get to it).

Thank you!

"Failed to disconnect RDMA CM connection" seen on client while running ib_write_bw

Hi all,

With latest perftest while running ib_write_bw the below errors are seen on client side
Failed to disconnect RDMA CM connection.
ERRNO: Connection reset by peer.
Failed to disconnect RDMA CM nodes.

No errors observed in dmesg

Current behavior:

Server
#ib_write_bw -n 10 -R -s 2048 -D 10 -p 11000

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : cxgb4_0
Number of qps : 1 Transport type : IW
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : OFF
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 0
Max inline data : 0[B]
rdma_cm QPs : ON
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
Waiting for client rdma_cm QP to connect
Please run the same command with the IB/RoCE interface IP
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x040a PSN 0x95f56e
GID: 00:07:67:60:01:112:00:00:00:00:00:00:00:00:00:00
remote address: LID 0000 QPN 0x040a PSN 0xc476a6
GID: 00:07:67:62:204:144:00:00:00:00:00:00:00:00:00:00
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
2048 6143300 0.00 1999.78 1.023887
---------------------------------------------------------------------------------------

Client
------
#ib_write_bw -n 10 -R -s 2048 -D 10 -p 11000 102.1.1.245
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : cxgb4_0
Number of qps : 1 Transport type : IW
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 1024[B]
Link type : Ethernet
GID index : 0
Max inline data : 0[B]
rdma_cm QPs : ON
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x040a PSN 0xc476a6
GID: 00:07:67:62:204:144:00:00:00:00:00:00:00:00:00:00
remote address: LID 0000 QPN 0x040a PSN 0x95f56e
GID: 00:07:67:60:01:112:00:00:00:00:00:00:00:00:00:00
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Conflicting CPU frequency values detected: 2636.683000 != 2179.229000. CPU Frequency is not max.
2048 6143300 0.00 1999.78 1.023887
---------------------------------------------------------------------------------------
Failed to disconnect RDMA CM connection.
ERRNO: Connection reset by peer.
Failed to disconnect RDMA CM nodes.

Expected behavior
Error messages on the client should not be seen

Observation:

The issue is seen intermittently
Issue did not hit with perftest-4.2-0.8

Context:
OS: RHEL 7.9 (3.10.0-1160.el7.x86_64)
perftest version : tot(7504ce4)

Thanks
Smit Kothari

configure: error: pciutils header files not found, consider installing pciutils-devel

I got this configure: error: pciutils header files not found, consider installing pciutils-devel when compiling the perftest.

~/perftest# ./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j

libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'config'.
libtoolize: copying file 'config/ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'.
libtoolize: copying file 'm4/libtool.m4'
libtoolize: copying file 'm4/ltoptions.m4'
libtoolize: copying file 'm4/ltsugar.m4'
libtoolize: copying file 'm4/ltversion.m4'
libtoolize: copying file 'm4/lt~obsolete.m4'
libtoolize: 'AC_PROG_RANLIB' is rendered obsolete by 'LT_INIT'
configure.ac:55: installing 'config/compile'
configure.ac:36: installing 'config/missing'
Makefile.am: installing 'config/depcomp'
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... no
checking for mawk... mawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc understands -c and -o together... yes
checking for style of include used by make... GNU
checking dependency style of gcc... gcc3
checking for g++... g++
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking dependency style of g++... gcc3
checking dependency style of gcc... gcc3
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking how to print strings... printf
checking for a sed that does not truncate output... /bin/sed
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for fgrep... /bin/grep -F
checking for ld used by gcc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 1572864
checking how to convert x86_64-unknown-linux-gnu file names to x86_64-unknown-linux-gnu format... func_convert_file_noop
checking how to convert x86_64-unknown-linux-gnu file names to toolchain format... func_convert_file_noop
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for ar... ar
checking for archiver @file support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for a working dd... /bin/dd
checking how to truncate binary pipes... /bin/dd bs=4096 count=1
checking for mt... mt
checking if mt is a manifest tool... no
checking how to run the C preprocessor... gcc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... no
checking for gcc option to produce PIC... -fPIC -DPIC
checking if gcc PIC flag -fPIC -DPIC works... yes
checking if gcc static flag -static works... yes
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking how to run the C++ preprocessor... g++ -E
checking for ld used by g++... /usr/bin/ld -m elf_x86_64
checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes
checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking for g++ option to produce PIC... -fPIC -DPIC
checking if g++ PIC flag -fPIC -DPIC works... yes
checking if g++ static flag -static works... yes
checking if g++ supports -c -o file.o... yes
checking if g++ supports -c -o file.o... (cached) yes
checking whether the g++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking dynamic linker characteristics... (cached) GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking for ranlib... (cached) ranlib
checking for ANSI C header files... (cached) yes
checking infiniband/verbs.h usability... yes
checking infiniband/verbs.h presence... yes
checking for infiniband/verbs.h... yes
checking for ibv_get_device_list in -libverbs... yes
checking for rdma_create_event_channel in -lrdmacm... yes
checking for umad_init in -libumad... yes
checking for log in -lm... yes
checking pci/pci.h usability... no
checking pci/pci.h presence... no
checking for pci/pci.h... no
configure: error: pciutils header files not found, consider installing pciutils-devel

I am using ubuntu 18.04.
How do I get pciutils-devel installed properly?

The device name printed via ctx_print_test_info is not the same as the actual used device name when rdma_cm is enabled

I modified the code to print the device name through the parameters inside the struct ibv_context.I found the device name printed by ctx_print_test_info is not the same as the parameters inside the struct ibv_context, further I found ctx_print_test_info just print the user parameter... Why is this?
My code is like this:

The execution result is as follows
server:

client:

Looking forward to reply from you when it is convenient, any reply will be helpful for me, thanks!!

Unable to run perftest with multiple qp for more than1025 QPs

I see the following errors on running ib_read_bw or ib_write_bw with cxgb4.
It is seen with the commit that adds multiple QP support for rdma-cm 5d1895ae602.

[root@saptharishi perftest]# ./ib_write_bw -R -F -n10 -q1026 -s32 --report_gbits 102.1.1.11
RDMA CM event error:
Event: RDMA_CM_EVENT_REJECTED; error: -111.

ERRNO: No such file or directory.
Failed to handle RDMA CM event.
ERRNO: No such file or directory.
Failed to connect RDMA CM events.
ERRNO: No such file or directory.

rdma-core-24.0-1.el7.x86_64

# uname -r
4.19.36

--use_cuda silently ignored when building without GPU support

This is bad, as people are convinced to be running a build with GPU support, so for example they report unreasonable performance.

perfermance problem about the ib_send_bw test

I test ib_send_bw for two case on ARM services and IB card.
Case1:numactl -C 64 ib_send_bw -d mlx5_0 -a --iters=1000
Case2:numactl -C 64 ib_send_bw -d mlx5_0 --size=1024 --iters=1000

Case1 perfermance data:
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
1024 1000 0.00 5544.28 5.677340

Case2 perfermance data:
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
1024 1000 0.00 920.13 0.942218

The performance of the case1 is 6 times of case2 performance. What may cause the performance gap?

memcpy to pointer address instead of pointer location in perftest_communication.c

perftest/src/perftest_communication.c

Line 519 in 13f7177

memcpy(&rem_dest,comm->rdma_ctx->buf[0],sizeof(struct pingpong_dest));

The line should read:
memcpy(rem_dest, comm->rdma_ctx->buf[0], sizeof(struct pingpong_dest));

RPM spec file Source and URL tags are out-of-date

In the spec file, we have:

Source: http://www.openfabrics.org/downloads/%{name}-%{version}.tar.gz
Url: http://www.openfabrics.org

But OpenFabrics points to GitHub in https://openfabrics.org/downloads/MAINTAINERS.

The spec file also indicates that this is v4.2-%{release} but it doesn't look like a matching %{name}-%{version}.tar.gz (perftest-4.2.tar.gz) is available as a downloadable artifact from GitHub.

support benchmarking of CUDA Unified Memory

perftest should support benchmarking of these new kinds of memory.

There are two basic variants of CUDA Unified Memory:

managed memory, as allocated via cudaMallocManaged()
system allocated memory, as allocated via malloc()

On IBM machines based POWER9, where the GPU is attached to the CPU via NVLINK, e.g. AC922 servers, the CUDA runtime supports GPUDirect RDMA on both variants.
For that to work, ODP must be enabled.

On other systems, like x86_64, support is still missing.

some question about multicast test

i want to test multicast. 1-to-2
server1:
ib_send_bw -d mlx5_1 -F -x 3 -c UD --run_infinitely --report_gbits -M 225.1.1.20
server2:
ib_send_bw -d mlx5_1 -F -x 3 -c UD --run_infinitely --report_gbits -M 225.1.1.20
client:
ib_send_bw -d mlx5_1 -F -x 3 -c UD --run_infinitely --report_gbits -g -M 225.1.1.20

and in this way, i failed.

i don't know correct parameters to test multicast. can anyone help me?

With two peers using different HCAs, modify QP to RTS on ib_read_lat fails due to out_reads exceeding max_qp_rd_atom

I'm using running ib_read_lat between two different HCAs, and modify QP to RTS fails on the side with a lower max_qp_rd_atom because it appears that the code uses the out_reads of the peer when modifying the QP (I think in ctx_modify_qp_to_rts, where we set attr->max_rd_atomic = dest->out_reads;). Shouldn't this just be the device's max_rd_atomic?

How can I compile the code using C++?

I would like to revise the code a little bit to add my feature, and I want to use some containers from C++ STL. So how can I compile the code using C++?

Fails to build on mips*/armel/armhf: get_cycles not implemented for this architecture

perftest fails to build from source on mips*/armel/armhf (on Debian):

[...]
make  all-am
make[2]: Entering directory '/«BUILDDIR»/perftest-3.4+0.6.gc3435c2'
gcc -DHAVE_CONFIG_H -I.   -Wdate-time -D_FORTIFY_SOURCE=2  -g -Wall -D_GNU_SOURCE -O3 -g -O2 -fdebug-prefix-map=/«BUILDDIR»/perftest-3.4+0.6.gc3435c2=. -fstack-protector-strong -Wformat -Werror=format-security -c -o src/get_clock.o src/get_clock.c
In file included from /usr/include/arm-linux-gnueabihf/sys/time.h:21:0,
                 from src/get_clock.c:43:
/usr/include/features.h:148:3: warning: #warning "_BSD_SOURCE and _SVID_SOURCE are deprecated, use _DEFAULT_SOURCE" [-Wcpp]
 # warning "_BSD_SOURCE and _SVID_SOURCE are deprecated, use _DEFAULT_SOURCE"
   ^~~~~~~
In file included from src/get_clock.c:48:0:
src/get_clock.h:101:2: warning: #warning get_cycles not implemented for this architecture: attempt asm/timex.h [-Wcpp]
 #warning get_cycles not implemented for this architecture: attempt asm/timex.h
  ^~~~~~~
src/get_clock.h:102:23: fatal error: asm/timex.h: No such file or directory
 #include <asm/timex.h>
                       ^
compilation terminated.
Makefile:786: recipe for target 'src/get_clock.o' failed
make[2]: *** [src/get_clock.o] Error 1

Debian bug: https://bugs.debian.org/863158

What does the "user_param->wait_destroy" field mean?

int destroy_ctx(struct pingpong_context *ctx,
		struct perftest_parameters *user_param)
{
	int i, first, dereg_counter, rc;
	int test_result = 0;
	int num_of_qps = user_param->num_of_qps;

	if (user_param->wait_destroy) {
		printf(" Waiting %u seconds before releasing resources...\n",
		       user_param->wait_destroy);
		sleep(user_param->wait_destroy);
	}
        ......
}

The field has default value 0, and I want to know why should I wait for x seconds before destroy_ctx(). Does this have any special significance?

infinite ib_write_bw runs do not work when using GPUs on the client side

the problem is in run_iter_bw_infinitely(), where the call to pthread_sigmask happens too late, when the CUDA driver has been initialized and its worker thread launched.
the solution is to move the call to pthread_sigmask somewhere before ctx_init().

How to specify service level with rdma_cm

Hi,

I want to run ib_write_bw with service level 2 when I am running based on rdma_cm (i.e., with -R), but it doesn't take effect. When I remove -R, it takes effect.

So -R and -S cannot be used simultaneously? How can I specify service level with rdma_cm?

Failed to deallocate PD - Device or resource busy

Reported by the UNH-IOL validation group: "When testing with OFED-4.17-rc2 on Kernel 4.17.14 with iWARP, RDMA fails with a variety of vendor cards with all having the same error below on "small/large rdma send. To confirm this is an OFED-4.17-rc2 issue I backtracked to OFED-4.17-rc1 on Kernel 4.17.14 and saw a full pass.

Commands used:

server: ib_send_bw -d i40iw1 -i 1 -s 1 -n 25000 -m 2048 --dont_xchg_versions -R -x 0 -r 510 -F

client: ib_send_bw -d qedr0 -i 1 -s 1 -n 25000 -m 2048 aegaeon-iw.ofa.iol.unh.edu -F --dont_xchg_versions -R -x 0 -r 510

Failed to deallocate PD - Device or resource busy
Couldn't destroy all SEND resources"

It has been confirmed that this is a perftest regression with perftest-4.4-0.5 included with RC2 and OFED 4.17 GA. The perftest-4.4-0.3 that was included with RC1 is working and it also works with OFED 4.17 GA.

ib_write_bw --cuda will lead to system deallock

client
mlx5 nic
./ib_write_bw -d mlx5_0 -i 1 --use_cuda=0 server_ip_address -a

server
mlx5 NIC
run: ./ib_write_bw -d mlx5_0 -i 1 --use_cuda=0

when pressing ctrl+c to kill the process, the hole system will crash and report system deadlock.

it will not happened if we don't use the param --use_cuda;

Please avoid dashes in version

Debian uses dashes to separate the upstream version from the Debian revision. Please avoid dashes in version. Is there a reason why you use the dash? Why do you use 4.1-0.2 as version instead of just 4.1 or 4.1.2 or 4.1.0.2?

send_lat test fails when rx_depth is less than iteration count

The reason is this commit: 239df74. ctx->rx_buffer_addr isn't initialized for the send_lat test. I'm not sure if the fix is just to add if (user_param->tst == BW) before it as I'm not clear on what ctx->rx_buffer_addr is for.

Apologies if this isn't how issues are reported. Let me know what I should do instead if that is the case.

"ib_write_bw" and "ib_read_bw" run into error due to different gid index at both sides

I encountered an error when using the following perftest commands to do an environment test:
server side: ib_write_bw -d mlx5_0 -D 10
client side: ib_write_bw -d mlx5_0 -D 10 10.0.13.4

I reinstalled the OFED driver without any modification to the default configuration, but the same error still occurs. I then noticed that these two machines are using different gid index by default. The error is fixed by specifying the gid index as follows:
server side: ib_write_bw -d mlx5_0 -D 10 -x 3
client side: ib_write_bw -d mlx5_0 -D 10 10.0.13.4 -x 3

My question: Why gid index matters? And why the default gid index is different?

Why does read latency not equal to write latency?

According to my understanding, rdma read and rdma write is the same except data flow direction. I think latency of ib_read_lat should be equal to that of ib_write_lat.

Test command: numactl --cpunodebind=0 --membind=0 ./ib_read/write_lat -d mlx5_0 -R ip_address

Result:

Read latency : 1.91 us.
Raw write latency : 0.93 us
Write latency after disabling IBV_SEND_INLINE: 1.32us
Write latency after disabling IBV_SEND_INLINE and DDIO: 1.32us

Are there any other reasons that may cause the difference?

Initializing huge buffers with rand() takes too much time causing iWARP MPA timeout.

During MPA negotiation, when passive side gets MPA receive packet, the RDMA-CM stack
creates connection resources (PD, MR, CQ, QP) and initiates MPA offload.
The active side, who initiated the MPA connection sets a timer that times out in case of huge buffers
(initiating those buffers takes more than reasonable MPA timeout).
I think that in this case, perftest should support an option to skip buffer initialization -
for (i = 0; i < ctx->buff_size; i++) {
((char*)ctx->buf[qp_index])[i] = (char)rand();

Should disable PCIe Relaxed Ordering by default?

If perftest was built with a machine which PCIe Relaxed Ordering compliant, and run over two machines which are NOT PCIe Relaxed Ordering compliant, we will see significantly performance degradation for large size message.

https://bugzilla.redhat.com/show_bug.cgi?id=1902855

The perftest of "ib_send_lat RC" test on data size of 4194304 bytes and 8388608 bytes had test time increase of about 250 fold when compared with the same test on other MLX4 ROCE device, like MLX5 CX-3.

So, should we disable PCIe RO by default, and let the user to enable it or not when run the perftest benchmarks?

Thanks

CUDA latency tests

I noticed in perftest_parameters.c:1367: fprintf(stderr," Perftest supports CUDA only in BW tests\n");

It seems to work if I uncomment for read_lat, but I wanted to see if there is a good not to use the test?

print_thread using sigset allocated on the stack

see

perftest/src/perftest_resources.c

Line 4222 in 6369e62

 if (pthread_create(&print_thread, NULL, &handle_signal_print_thread,(void *)&set) != 0){ 

Limit on Number of qp?

When we use ib_send_bw or ib_write_bw with about 96 QP q=96 and n=1000, we notice out of sequence or out of buffers or packet sequence errors while running these tests between 2 different machines using mellanox ofed 5.

If we run the client and server tests both on the same machine then we observe no errors.the issue is seen only when we go across machines.

Is that an issue with the code in ibsendbw or ib_write_bw or is the issue some where else?
Thanks

ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 <IP> - Couldn't allocate MR

Hi,

we tried to test GPUDirect RDMA.

Test pod deployed from https://github.com/Mellanox/k8s-images

we deployed 2 pods:

Server pod:

root@rdma-cuda-test-pod-1:~# ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0

Waiting for client to connect... *

Client pod:

root@rdma-cuda-test-pod-1:~# ib_write_bw -d mlx5_0 -F -R -q 2 --use_cuda=0 192.168.111.1
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 02:00

Picking device No. 0
[pid = 56, dev = 0] device name = [NVIDIA A30-8C]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 262144 bytes GPU buffer
allocated GPU buffer address at 0000010013000000 pointer=0x10013000000
Couldn't allocate MR
failed to create mr
Failed to create MR
Failed to initialize RDMA contexts.
ERRNO: Bad address.
Failed to handle RDMA CM event.
ERRNO: Bad address.
Failed to connect RDMA CM events.
ERRNO: Bad address.
Segmentation fault (core dumped)

what does "Couldn't allocate MR" mean?

thanks in advance

"Couldn't allocate MR" errors on latest perftest

I get immediate failures on the latest perftest when running ib_send_bw or ib_write_bw over either mlx5 or cxgb4. git bisect points to this commit as the culprit:

[root@stevo1 perftest]# ./ib_send_bw -a -R -d cxgb4_0 172.16.2.2
Couldn't allocate MR
failed to create mr
Failed to create MR
Unable to create the resources needed by comm struct
Unable to perform rdma_client function
Unable to init the socket connection
[root@stevo1 perftest]# git bisect bad
3d9815e is the first bad commit
commit 3d9815e
Author: Zohar Ben Aharon [email protected]
Date: Mon Feb 26 14:14:48 2018 +0200

Fix for commit a9df5ed7f04484b6838ee2b48246be1ab66513f1

FreeBSD not supporting contig memory allocation

Signed-off-by: Zohar Ben Aharon <[email protected]>

:040000 040000 f01b099fc410e017aadb5967c2e0cdb815ba6c52 b3c0d0589c2daff7acc14b752ad1f4175e2d673f M src
[root@stevo1 perftest]#

ib_read_bw issue- Swapped VA in RETH

We are testing ibverbs on newly developed IB driver on our microblaze(32 bit) platform with xilinx rdma nic HW. While running ib_read_bw (server on microblaze and client on mlnx connectx-4), I found the client feeding a swapped VA in RETH. I narrowed it down to struct pingpong_dest. The size of this structure is 52bytes on 32bit system and 56bytes on 64bit x86 system with connect-x4.

I did a quick test with attribute((packed)) macro on struct pingpong_dest and I now see correct VA coming back from client via RETH.

ib_read_bw --version shows 5.5. In config.h I have HAVE_ENDIAN 1.

latency test with duration flag doesn't report min/max/median latency

Hi, when running latency tests with "iterations" (-n) flag, the tool reports min/max/avg/std_dev/99/99.9 latencies. The same latency test when run with "duration" (-D) flag reports only avg latency. How to get the same report for "duration" based tests?
Iteration test : ib_write_lat -F -d mlx5_0 -I 64 -n 100000 -s 1 20.2.1.2
Duration test : ib_write_lat -F -d mlx5_0 -I 64 -D 20 -s 1 20.2.1.2

ib_send_bw has very poor incoming performance with split traffic between two nodes.

I have 3 nodes, Mellanox ConnectX-5 ethernet card with RoCe. Nodes are A, B, C.

I start two copies of ib_send_bw on node A (server side):

taskset -c 4 ib_send_bw -d mlx5_0 -s 4194304 --run_infinitely -p 5001

taskset -c 5 ib_send_bw -d mlx5_0 -s 4194304 --run_infinitely -p 5002

Then on node B, I start ib_send_bw (client side) to connect to port 5001 on node A:

taskset -c 1 ib_send_bw -d mlx5_0 -s 4194304 -D 20 --run_infinitely -p 5001 10.254.28.96

again, on node C, I start another ib_send_bw (client side) to connect to port 5002 on node A:

taskset -c 1 ib_send_bw -d mlx5_0 -s 4194304 -D 20 --run_infinitely -p 5002 10.254.28.96

10.254.28.96 is one of the ip address on node A.

The performance is very poor:
Node B output:
4194304 4826 0.00 965.20 0.000241
4194304 7695 0.00 1539.00 0.000385
4194304 2596 0.00 519.20 0.000130
4194304 7186 0.00 1437.20 0.000359
4194304 14261 0.00 2852.20 0.000713
4194304 6801 0.00 1360.20 0.000340
Node C output:
4194304 2456 0.00 491.19 0.000123
4194304 162 0.00 32.40 0.000008
4194304 4125 0.00 824.99 0.000206
4194304 6177 0.00 1235.39 0.000309
4194304 498 0.00 99.60 0.000025
4194304 80 0.00 16.00 0.000004

If I either start ib_send_bw on node B or on node C, it can generate very good number:
4194304 14612 0.00 11689.36 0.002922
4194304 14612 0.00 11689.35 0.002922
4194304 14612 0.00 11689.35 0.002922
4194304 14612 0.00 11689.35 0.002922
4194304 14612 0.00 11689.36 0.002922

Or, I exchange the client/server role (reverse the traffic direction), the BW is evenly split:
4194304 7306 0.00 5844.68 0.001461
4194304 7306 0.00 5844.68 0.001461
4194304 7306 0.00 5844.68 0.001461
4194304 7309 0.00 5847.08 0.001462
4194304 7306 0.00 5844.68 0.001461

What I hoped is that the BW is evenly split in the poor performance case.
Want to get confirm if this is problem of ib_send_bw itself, or general RoCE problem.

reporter: [email protected]

[--rate_limit] A stable relative error (~7.5%) observed

Hi all:

I'm very interested in your work. And I personally tested --rate_limit using a Python script between two machines in an IB network. The network traffic limits were set ranging from 10MBps to 5000MBps+. Whether it is ib_send_bw,ib_read_bw or ib_write_bw, a stable relative error was observed.
The rate_limit_type is SW.:

I wonder what mechanism causes this stable relative error， thanks!

Please add man pages to all binaries

Currently lintian complains about missing man pages for following binaries:

ib_atomic_bw
ib_atomic_lat
raw_ethernet_burst_lat
raw_ethernet_bw
raw_ethernet_lat
run_perftest_loopback

Please write man pages for these tools.

ib_write_bw work normally but ib_write_bw -R failed

This is output of 'ib_write_bw -a -d mlx5_0 --report_gbits node1', seems to work fine:

[root@node3 bin]# ib_write_bw -a -d mlx5_0 --report_gbits  node1
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x02 QPN 0x0032 PSN 0x5f2841 RKey 0x002440 VAddr 0x007f2e5f64b000
 remote address: LID 0x01 QPN 0x003a PSN 0xc0fd7e RKey 0x002442 VAddr 0x007f443b2f8000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 2          5000           0.042350            0.042185            2.636533
 4          5000           0.084507            0.084457            2.639276
 8          5000             0.17               0.17   		   2.641399
 16         5000             0.34               0.34   		   2.640952
 32         5000             0.68               0.68   		   2.638957
 64         5000             1.35               1.35   		   2.638629
 128        5000             2.71               2.71   		   2.643606
 256        5000             5.42               5.42   		   2.644112
 512        5000             10.78              10.77  		   2.629186
 1024       5000             21.38              21.37  		   2.608802
 2048       5000             42.13              42.09  		   2.568967
 4096       5000             83.97              83.91  		   2.560721
 8192       5000             186.89             149.84 		   2.286319
 16384      5000             195.18             169.98 		   1.296822
 32768      5000             196.21             185.39 		   0.707209
 65536      5000             196.25             190.26 		   0.362886
 131072     5000             196.33             193.93 		   0.184945
 262144     5000             195.49             195.03 		   0.092996
 524288     5000             196.25             196.25 		   0.046789
 1048576    5000             196.48             196.48 		   0.023422
 2097152    5000             196.62             196.59 		   0.011718
 4194304    5000             196.67             196.63 		   0.005860
 8388608    5000             196.63             196.58 		   0.002929
---------------------------------------------------------------------------------------

But it would fail if I plus '-R', like:

[root@node3 bin]# ib_write_bw -a -d mlx5_0 --report_gbits  -R node1
Received 10 times ADDR_ERROR
 Unable to perform rdma_client function
 Unable to init the socket connection

And I read source code and have known it's caused by RDMA_CM_EVENT_ADDR_ERROR, but I don't known why.

This is output about 'lscpi -vvv':

[root@node3 bin]#  lspci -vvv | grep Mellanox  -A 65
41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
	Subsystem: Mellanox Technologies Device 0007
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 1125
	NUMA node: 0
	Region 0: Memory at 2807e000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at b4400000 [disabled] [size=1M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [48] Vital Product Data
		Product Name: ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56                                                                                                          
		Read-only fields:
			[PN] Part number: MCX653105A-HDAT          
			[EC] Engineering changes: AE
			[V2] Vendor specific: MCX653105A-HDAT          
			[SN] Serial number: MT2130T07644   
			[V3] Vendor specific: 92a87ffbcbeaeb118000b8cef6f7f1c0
			[VA] Vendor specific: MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:MODL=CX653105A      
			[V0] Vendor specific: PCIeGen4 x16 
			[VU] Vendor specific: MT2130T07644MLNXS0D0F0 
			[RV] Reserved: checksum good, 1 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		AERCap:	First Error Pointer: 04, GenCap+ CGenEn+ ChkCap+ ChkEn+
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [1c0 v1] #19
	Capabilities: [320 v1] #27
	Capabilities: [370 v1] #26
	Capabilities: [420 v1] #25
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core

42:00.0 Non-Volatile memory controller: Intel Corporation NVMe DC SSD [3DNAND, Beta Rock Controller] (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation Device 8008

Any clue about what happened? look forward to your reply, thanks!

support --use_cuda in send_lat

send_lat could easily support CUDA device memory as source/sink.
write_lat cannot do the same as easily, as it relies on direct memory polling from the CPU.

Set up issues

Hi,
I'm trying to run a latency test for the first time but I get an error when I try to execute it.

My steps:

Downloaded and unzipped the repo on the box on which I want to run the server for the test
Executed:
./autogen
./configure
make clean && make V=1
./ib_send_lat --duration=30 -H

Port number 1 state is Down
Couldn't set the link layer
Couldn't get context for the device

What am I doing wrong?

Running other tests using RDMA (eg. the ones in Accelio) the system shows no problems.

Thanks

linux-rdma / perftest Goto Github PK

perftest's Introduction

perftest's People

Contributors

Stargazers

Watchers

Forkers

perftest's Issues

Command:

Statistics of EFA hw_counters:

client

server

on server side

on client side

result

Dual-port : OFF Device : mlx5_bond_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : rdma_cm

local address: LID 0000 QPN 0x1295 PSN 0xe41d73 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:05 remote address: LID 0000 QPN 0x126e PSN 0x9f23a6 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:06

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 18446744072490473291 0.000000 -3.122397 -0.005955

Dual-port : OFF Device : mlx5_bond_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF RX depth : 512 CQ Moderation : 1 Mtu : 4096[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : rdma_cm

local address: LID 0000 QPN 0x12a1 PSN 0x2955ce GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:05 remote address: LID 0000 QPN 0x127e PSN 0x2afc3c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:06

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 1906054058 0.00 4.88 0.009312

Dual-port : OFF Device : mlx5_bond_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF RX depth : 512 Mtu : 4096[B] Link type : Ethernet GID index : 3 Max inline data : 236[B] rdma_cm QPs : OFF Data ex. method : rdma_cm

local address: LID 0000 QPN 0x12a4 PSN 0xb2115c GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:05 remote address: LID 0000 QPN 0x1280 PSN 0x6966f2 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:06

#bytes #iterations t_avg[usec] tps average 2 -267885825 -382.06 -1308.69

Dual-port : OFF Device : mlx5_bond_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF RX depth : 512 Mtu : 4096[B] Link type : Ethernet GID index : 3 Max inline data : 236[B] rdma_cm QPs : OFF Data ex. method : rdma_cm

local address: LID 0000 QPN 0x12ac PSN 0xc7fee4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:05 remote address: LID 0000 QPN 0x128b PSN 0x57659e GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:07

#bytes #iterations t_avg[usec] tps average 2 1158940729 88.31 5661.71

taskset -c 4 ib_send_bw -d mlx5_0 -s 4194304 --run_infinitely -p 5001

taskset -c 5 ib_send_bw -d mlx5_0 -s 4194304 --run_infinitely -p 5002

taskset -c 1 ib_send_bw -d mlx5_0 -s 4194304 -D 20 --run_infinitely -p 5001 10.254.28.96

taskset -c 1 ib_send_bw -d mlx5_0 -s 4194304 -D 20 --run_infinitely -p 5002 10.254.28.96

Recommend Projects

Recommend Topics

Recommend Org

Dual-port : OFF Device : mlx5_bond_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 1
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : rdma_cm

local address: LID 0000 QPN 0x1295 PSN 0xe41d73
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:05
remote address: LID 0000 QPN 0x126e PSN 0x9f23a6
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:06

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 18446744072490473291 0.000000 -3.122397 -0.005955

Dual-port : OFF Device : mlx5_bond_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
RX depth : 512
CQ Moderation : 1
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : rdma_cm

local address: LID 0000 QPN 0x12a1 PSN 0x2955ce
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:05
remote address: LID 0000 QPN 0x127e PSN 0x2afc3c
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:06

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 1906054058 0.00 4.88 0.009312

Dual-port : OFF Device : mlx5_bond_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
RX depth : 512
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 236[B]
rdma_cm QPs : OFF
Data ex. method : rdma_cm

local address: LID 0000 QPN 0x12a4 PSN 0xb2115c
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:05
remote address: LID 0000 QPN 0x1280 PSN 0x6966f2
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:06

#bytes #iterations t_avg[usec] tps average
2 -267885825 -382.06 -1308.69

Dual-port : OFF Device : mlx5_bond_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
RX depth : 512
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 236[B]
rdma_cm QPs : OFF
Data ex. method : rdma_cm

local address: LID 0000 QPN 0x12ac PSN 0xc7fee4
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:05
remote address: LID 0000 QPN 0x128b PSN 0x57659e
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:219:07

#bytes #iterations t_avg[usec] tps average
2 1158940729 88.31 5661.71