nvidia / l2fwd-nv Goto Github PK

l2fwd-nv provides an example of how to leverage your DPDK network application with the NVIDIA GPUDirect RDMA techonology.

License: Apache License 2.0

CMake 9.24% C++ 68.63% C 9.53% Cuda 12.60%

l2fwd-nv's Introduction

l2fwd-nv

In the vanilla l2fwd DPDK example each thread (namely, DPDK core) receives a burst (set) of packets, does a swap of the src/dst MAC addresses and transmits back the same burst of modified packets. l2fwd-nv is an improvement of l2fwd to show the usage of mbuf pool with GPU data buffers using the vanilla DPDK API. The overall flow of the app is organised as follows:

Create a number of pipelines, each one composed by:
- one core to receive and accumulate bursts of packets (RX core) from a dedicated DPDK queue (RX queue)
- a dedicated GPU/CPU workload entity which process the bursts
- one core to transmit the burst of modified packets (TX core) using a dedicated DPDK queue (TX queue)
For each pipeline, in a loop:
- The RX core accumulates packets in bursts
- The RX core triggers (asynchronously) the work (MAC swapping) on the received burst using CUDA kernel(s)
- The TX core waits for the completion of the work
- The TX core sends the burst of modified packets

Please note that a single mempool is used for all the DPDK RX/TX queues. Using different command line option it's possible to:

Create the mempool either in GPU memory or CPU pinned memory
Decide how to do the MAC swapping in the packets:
- No workload: MAC addresses are not swapped, l2fwd-nv is doing basic I/O forwarding
- CPU workload: the CPU does the swap
- GPU workload: a new CUDA kernel is triggered for each burst of accumulated packets
- GPU persistent workload: a persistent CUDA kernel is triggered at the beginning on the CUDA stream dedicated to each pipeline. CPU has to communicate to this kernel that a new burst of packets has to be processed
- GPU workload with CUDA graphs: a number of CUDA kernels is triggered for the next 8 bursts of packets
Enable buffer split feature: each received packet is split in two mbufs. 60B into a CPU memory mbuf, remaning bytes are stored into the a GPU memory mbufs. The worklaod in this case is swapping some random bytes.

Please note that not all the combinations give the best performance. This app should be considered a showcase to expose all the possibile combinations when dealing with GPUDirect RDMA and DPDK. l2fwd-nv has a trivial workload that doesn't really require the use of CUDA kernels.

Changelog

03/11/2022

Updated to DPDK 22.03
GDRCopy direct calls removed in favour of new gpudev cpu_map functions
Code cleanup

11/26/2021

Updated to the latest DPDK 21.11 release
Introduced the new gpudev library
Benchmarks updated to latest MOFED 5.4, DPDK 21.11 and CUDA 11.4 with V100 and A100
Benchmarks executed using testpmd as packet generator

System configuration

Please note that DPDK 22.03 is included as submodule of this project and it's built locally with l2fwd-nv.

Kernel configuration

Ensure that your kernel parameters include the following list:

default_hugepagesz=1G hugepagesz=1G hugepages=16 tsc=reliable clocksource=tsc intel_idle.max_cstate=0 mce=ignore_ce processor.max_cstate=0 audit=0 idle=poll isolcpus=2-21 nohz_full=2-21 rcu_nocbs=2-21 rcu_nocb_poll nosoftlockup iommu=off intel_iommu=off

Note that 2-21 corresponds to the list of CPUs you intend to use for the DPDK application and the value of this parameter needs to be changed depending on the HW configuration.

To permanently include these items in the kernel parameters, open /etc/default/grub with your favourite text editor and add them to the variable named GRUB_CMDLINE_LINUX_DEFAULT. Save this file, install new GRUB configuration and reboot the server:

$ sudo vim /etc/default/grub
$ sudo update-grub
$ sudo reboot

After reboot, verify that the changes have been applied. As an example, to verify the system has 1 GB hugepages:

$ cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-5.4.0-53-lowlatency root=/dev/mapper/ubuntu--vg-ubuntu--lv ro maybe-ubiquity default_hugepagesz=1G hugepagesz=1G hugepages=16 tsc=reliable clocksource=tsc intel_idle.max_cstate=0 mce=ignore_ce processor.max_cstate=0 idle=poll isolcpus=2-21 nohz_full=2-21 rcu_nocbs=2-21 nosoftlockup iommu=off intel_iommu=off
$ grep -i huge /proc/meminfo
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:      16
HugePages_Free:       15
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        16777216 kB

Mellanox network card

You need to follow few steps to configure your Mellanox network card.

Download Mellanox OFED 5.4 from here
Enable CQE compression mlxconfig -d <NIC PCIe address> set CQE_COMPRESSION=1

If the Mellanox NIC supports IB and Ethernet mode (VPI adapters):

Set the IB card as an Ethernet card mlxconfig -d <NIC PCIe address> set LINK_TYPE_P1=2 LINK_TYPE_P2=2
Reboot the server or mlxfwreset -d <NIC PCIe address> reset and /etc/init.d/openibd restart

NVIDIA GPU

Download and install the latest CUDA toolkit from here.

Install meson

DPDK 22.03 requires Meson > 0.49.2.

sudo apt-get install python3-setuptools ninja-build
wget https://github.com/mesonbuild/meson/releases/download/0.56.0/meson-0.56.0.tar.gz
tar xvfz meson-0.56.0.tar.gz
cd meson-0.56.0
sudo python3 setup.py install

Enable GPUDirect RDMA

In order to enable GPUDirect RDMA with a Mellanox network card you need an additional kernel module.

If your system has CUDA 11.4 or newer, you need to load the nvidia_peermem module that comes with the NVIDIA CUDA Toolkit.

sudo modprobe nvidia-peermem

More info here.

If your system has an older CUDA version you need to manually build and install the nv_peer_memory module.

git clone https://github.com/Mellanox/nv_peer_memory.git
cd nv_peer_memory
make
sudo insmod nv_peer_mem.ko

Build the project

You can use cmake to build everything.

git clone --recurse-submodules https://github.com/NVIDIA/l2fwd-nv.git
cd l2fwd-nv
mkdir build
cd build
cmake ..
make -j$(nproc --all)

GDRCopy & gdrdrv

Starting from DPDK 22.03, GDRCopy has been embedded in DPDK and exposed through rte_gpu_mem_cpu_map function. The CMakeLists.txt file automatically builds GDRCopy libgdrapi.so library. After the build stage, you still need to launch gdrdrv kernel module on the system.

cd external/gdrcopy
sudo ./insmod.sh

Please note that, to enable GDRCopy in l2fwd-nv at runtime, you need to set the env var GDRCOPY_PATH_L with the path to libgdrapi.so library which resides in /path/to/l2fwd-nv/external/gdrcopy/src.

Benchmarks

l2fwd-nv with -h shows the usage and all the possible options

./build/l2fwdnv [EAL options] -- b|c|d|e|g|m|s|t|w|B|E|N|P|W
 -b BURST SIZE: how many pkts x burst to RX
 -d DATA ROOM SIZE: mbuf payload size
 -g GPU DEVICE: GPU device ID
 -m MEMP TYPE: allocate mbufs payloads in 0: host pinned memory, 1: GPU device memory
 -n CUDA PROFILER: Enable CUDA profiler with NVTX for nvvp
 -p PIPELINES: how many pipelines (each with 1 RX and 1 TX cores) to use
 -s BUFFER SPLIT: enable buffer split, 64B CPU, remaining bytes GPU
 -t PACKET TIME: force workload time (nanoseconds) per packet
 -v PERFORMANCE PKTS: packets to be received before closing the application. If 0, l2fwd-nv keeps running until the CTRL+C
 -w WORKLOAD TYPE: who is in charge to swap the MAC address, 0: No swap, 1: CPU, 2: GPU with one dedicated CUDA kernel for each burst of received packets, 3: GPU with a persistent CUDA kernel, 4: GPU with CUDA Graphs
 -z WARMUP PKTS: wait this amount of packets before starting to measure performance

To run l2fwd-nv in an infinite loop options -z and -w must be set to 0. To simulate an hevier workload per packet, the -t parameter can be used to setup the number of nanoseconds per packet. This should help you to evaluate what's the best workload approach for your algorithm combining processing time per packet -t with number of packets per burst -b.

In the following benchmarks we report the forwarding throughput: assuming packet generator is able to transmit packets at the full linerate of 100Gbps, we're interested in the network throughput l2fwd-nv can reach retransmitting the packets.

Performance

In this section we report some performance analysis to highlight different l2fwd-nv configurations. Benchmarks executed with between two different machines connected back-to-back, one with l2fwd-nv and the other with testpmd.

We didn't observe any performance regression upgrading from DPDK 21.11 to DPDK 22.03.

l2fwd-nv machine

HW features:

GIGABYTE E251-U70
CPU Xeon Gold 6240R. 2.4GHz. 24C48T
NIC ConnectX-6 Dx (MT4125 - MCX623106AE-CDAT)
NVIDIA GPU V100-PCIE-32GB
NVIDIA GPU A100-PCIE-40GB
PCIe bridge between NIC and GPU: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s)

HW topology between NIC and GPU:

-+-[0000:b2]-+-00.0-[b3-b6]----00.0-[b4-b6]--+-08.0-[b5]--+-00.0  Mellanox Technologies MT28841
 |           |                               |            \-00.1  Mellanox Technologies MT28841
 |           |                               \-10.0-[b6]----00.0  NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]

SW features:

Ubuntu 18.04 LTS
Linux kernel 5.4.0-58-lowlatency
GCC: 8.4.0 (Ubuntu 8.4.0-1ubuntu1~18.04)
Mellanox OFED 5.4-3.1.0.0
DPDK version: 21.11
CUDA 11.4

Suggestes system configuration assuming a Mellanox network card with bus id b5:00.0 and network interface enp181s0f0:

mlxconfig -d b5:00.0 set CQE_COMPRESSION=1
mlxfwreset -d b5:00.0 r -y
ifconfig enp181s0f0 mtu 8192 up
ifconfig enp181s0f1 mtu 8192 up
ethtool -A enp181s0f0 rx off tx off
ethtool -A enp181s0f1 rx off tx off
sysctl -w vm.zone_reclaim_mode=0
sysctl -w vm.swappiness=0

PCIe Max Read Request:

$ sudo setpci -s b5:00.0 68.w
2930
$ setpci -s b5:00.0 68.w=5930
$ sudo lspci -s b5:00.0 -vvv | egrep "MaxRead"
		MaxPayload 256 bytes, MaxReadReq 4096 bytes

Packet generator

In the following performance report, we used the testpmd packet generator that comes with the DPDK 21.11 code. The set of commands used to run and start testpmd is:

cd l2fwd-nv/external/dpdk/x86_64-native-linuxapp-gcc/app

sudo ./dpdk-testpmd -l 2-10 --main-lcore=2 -a b5:00.0 -- --port-numa-config=0,0 --socket-num=0 --burst=64 --txd=1024 --rxd=1024 --mbcache=512 --rxq=8 --txq=8 --forward-mode=txonly -i --nb-cores=8 --txonly-multi-flow

testpmd> set txpkts <pkt size>
start

Throughput measurement

In order to measure network throughput, we used the mlnx_perf application that comes with regular installation of MOFED. Command line for mlnx_perf is:

mlnx_perf -i enp181s0f1

This tool reads network card numbers to determine number of sent and received bytes and calculate the data rate

tx_bytes_phy: 12,371,821,688 Bps = 98,974.57 Mbps
rx_bytes_phy: 12,165,283,124 Bps = 97,322.26 Mbps

I/O forwarding

In this test, GPU memory is used only to receive packets and transmit them back without and workload (I/O forwarding only) in an infinite loop (no performance or warmup max packets). The number of packets received per workload (burst size -b) is fixed to 64 packets.

Assuming a system with Mellanox network card bus id b5:00.0 and an NVIDIA GPU with bus id b6:00.0, the command line used is:

sudo GDRCOPY_PATH_L=./external/gdrcopy/src ./build/l2fwdnv -l 0-9 -n 8 -a b5:00.1,txq_inline_max=0 -a b6:00.0 -- -m 1 -w 0 -b 64 -p 4 -v 0 -z 0

Please note that, if libcuda.so is not installed in the default system location, you need to specify the path through the CUDA_PATH_L=/path/to/libcuda.so env var.

Network throughput measured with mlnx_perf:

Packet bytes	Testpmd throughput	CPU memory throughput	GPU V100 memory throughput	GPU A100 memory throughput
64	74 Gbps	18 Gbps	19 Gbps	19 Gbps
128	82 Gbps	36 Gbps	37 Gbps	37 Gbps
256	82 Gbps	68 Gbps	67 Gbps	67 Gbps
512	97 Gbps	97 Gbps	94 Gbps	95 Gbps
1024	98 Gbps	98 Gbps	94 Gbps	97 Gbps

Please note that l2fwd-nv performance relies on the number of packets/sec rather than bytes/sec because the I/O (and the workload) doesn't depend on the lenght of the packet. In order to keep up with the line rate, in case of smaller packets, the generator has to send more packets/sec than in case of 1kB packets.

Comparing GPU workloads

Here we compare I/O forwarding throughput using differnt GPU workloads:

CUDA kernel (-w 2)
CUDA persistent kernel (-w 3)
CUDA Graph (-w 4)

Packet size is always 1kB, testpmd send throughput is ~98 Gbps and type of memory is GPU memory (-m 1).

Benchmarks with V100:

Burst size	CUDA kernel throughput	CUDA Persistent kernel throughput	CUDA Graphs throughput
16	18 Gbps	50 Gbps	48 Gbps
32	37 Gbps	88 Gbps	62 Gbps
64	90 Gbps	90 Gbps	90 Gbps
128	90 Gbps	90 Gbps	90 Gbps

Benchmarks with A100:

Burst size	CUDA kernel throughput	CUDA Persistent kernel throughput	CUDA Graphs throughput
16	23 Gbps	50 Gbps	30 Gbps
32	49 Gbps	97 Gbps	85 Gbps
64	97 Gbps	97 Gbps	97 Gbps
128	97 Gbps	97 Gbps	97 Gbps

Caveats

Packet size

If the packet generator is sending non-canonical packets sizes (e.g. 1514B) cache alignment problems may slow down the performance in case of GPU memory. To enhance performance you may try to use the EAL param rxq_pkt_pad_en=1 to the command line, e.g. -w b5:00.1,txq_inline_max=0,rxq_pkt_pad_en=1.

References

More info in NVIDIA GTC'21 session S31972 - Accelerate DPDK Packet Processing Using GPU E. Agostini

l2fwd-nv's People

Contributors

Stargazers

Watchers

Forkers

sakaia kkoch-nvidia zorrohahaha daschr muthuramanecs03g andrewdai05 dreamflychen frankfanslc cloud-za vyes sinofeng ajunlonglive notherthing chris123540 johnny5188 ajitkhaparde khadem1 xmercynme homer-mctavish

l2fwd-nv's Issues

Compiler error

Package check was not found in the pkg-config search path.
Perhaps you should add the directory containing `check.pc'
to the PKG_CONFIG_PATH environment variable
No package 'check' found
Package check was not found in the pkg-config search path.
Perhaps you should add the directory containing `check.pc'
to the PKG_CONFIG_PATH environment variable
No package 'check' found
In file included from copybw.cpp:35:
common.hpp:29:10: fatal error: check.h: No such file or directory
   29 | #include <check.h>
      |          ^~~~~~~~~
compilation terminated.
make[4]: *** [<builtin>: copybw.o] Error 1
make[3]: *** [Makefile:62: exes] Error 2
make[2]: *** [CMakeFiles/gdrcopy_lib.dir/build.make:70: CMakeFiles/gdrcopy_lib] Error 2
make[1]: *** [CMakeFiles/Makefile2:112: CMakeFiles/gdrcopy_lib.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

NO GPU FOUND

DPDK found 0 GPUs:
rte_gpu_count_avail return 0, i dont konw why.

Guidance on this working with a SRIOV VF?

Will this work using SRIOV VF of a Bluefield card?

L2fwd using Trex

Hi,

I have been testing the L2fwd application of DPDK by sending packets from Trex v2.82 in a loopback manner. The issued faced here is that if L2fwd application is started on the DUT and then starting the Trex traffic generator will give 'Invalid MAC'. This must be due to the L2fwd taking access of the interfaces which Trex is trying to send packets to.

However if I reverse the order and start Trex first and then the L2fwd application at the DUT, then Trex continues to send packets. In this case the L2fwd statistics displayed at regular intervals doesn't show the actual no. of packets sent from Trex, but is showing a very less no.

How did you send traffic from Trex to the L2fwd and manage to get the statistics? Could you please suggest your method of configuration?

With regards,
Shashank

Testing it on a p4d aws instance

Shall this work on a p4d aws instance? Are there any specific instructions? Have you some hints to make it work on a p4d instance?

NVCC Error

Hi,

In running the last step, I'm seeing this error:
[ 9%] Built target dpdk_target
[ 18%] Building C object CMakeFiles/l2fwdnv.dir/external/gdrcopy/src/memcpy_sse41.c.o
[ 27%] Building CUDA object CMakeFiles/l2fwdnv.dir/src/kernel.cu.o
nvcc fatal : Unknown option '-march=native'
make[2]: *** [CMakeFiles/l2fwdnv.dir/build.make:63: CMakeFiles/l2fwdnv.dir/src/kernel.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:78: CMakeFiles/l2fwdnv.dir/all] Error 2
make: *** [Makefile:141: all] Error 2

Is there an error in the build configuration...?

Multiple errors when compiling l2fwd-nv

I have multiple errores when trying to compule the l2fwd-nv example.

This is my CUDA drivers and version

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+

The errors that i have encountred are detailed as follow:

First, the link to download Mellanox OFED 5.4 (http://www.mellanox.com/page/products_dyn?product_family=26) is broke and it can be installed.

Second, after following all the steps described in the readme (except the previous one) I got the following warning when executed the cmake .. command:

CMake Warning (dev) in CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

  CUDA_ARCHITECTURES is empty for target "l2fwdnv".
This warning is for project developers.  Use -Wno-dev to suppress it.

Guessing that it did not affect the compilation, I continued with the readme commands, and after ran the make -j$(nproc --all) command I got the following error:

[ 33%] Building CUDA object CMakeFiles/l2fwdnv.dir/src/kernel.cu.o
/home/user/Documentos/l2fwd-nv/external/dpdk/x86_64-native-linuxapp-gcc/install/include/rte_common.h(879): warning #1217-D: unrecognized format function type "gnu_printf" ignored

/home/user/Documentos/l2fwd-nv/external/dpdk/x86_64-native-linuxapp-gcc/install/include/rte_log.h(291): warning #1217-D: unrecognized format function type "gnu_printf" ignored

/home/user/Documentos/l2fwd-nv/external/dpdk/x86_64-native-linuxapp-gcc/install/include/rte_log.h(320): warning #1217-D: unrecognized format function type "gnu_printf" ignored

/home/user/Documentos/l2fwd-nv/external/dpdk/x86_64-native-linuxapp-gcc/install/include/rte_debug.h(69): warning #1217-D: unrecognized format function type "gnu_printf" ignored

/usr/lib/gcc/x86_64-linux-gnu/11/include/serializeintrin.h(41): error: identifier "__builtin_ia32_serialize" is undefined

1 error detected in the compilation of "/home/user/Documentos/l2fwd-nv/src/kernel.cu".
make[2]: *** [CMakeFiles/l2fwdnv.dir/build.make:76: CMakeFiles/l2fwdnv.dir/src/kernel.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:139: CMakeFiles/l2fwdnv.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

Guessing that I could comment that source code line, I modified file /usr/lib/gcc/x86_64-linux-gnu/11/include/serializeintrin.h, line 41 and commented the call to the __builtin_ia32_serialize function and despite all the previous error I compiled again and it "worked".

Third, when I ran the ./l2fwdnv -h I got an output entirely different that one shown in the readme of the github (In fact, the arguments provided in this github does not work):

************ L2FWD-NV ************

EAL: Detected CPU lcores: 24
EAL: Detected NUMA nodes: 1

Usage: ./l2fwdnv [options]

EAL common options:
  -c COREMASK         Hexadecimal bitmask of cores to run on
  -l CORELIST         List of cores to run on
                      The argument format is <c1>[-c2][,c3[-c4],...]
                      where c1, c2, etc are core indexes between 0 and 128
  --lcores COREMAP    Map lcore set to physical cpu set
                      The argument format is
                            '<lcores[@cpus]>[<,lcores[@cpus]>...]'
                      lcores and cpus list are grouped by '(' and ')'
                      Within the group, '-' is used for range separator,
                      ',' is used for single number separator.
                      '( )' can be omitted for single element group,
                      '@' can be omitted if cpus and lcores have the same value
  -s SERVICE COREMASK Hexadecimal bitmask of cores to be used as service cores
  --main-lcore ID     Core ID that is used as main
  --mbuf-pool-ops-name Pool ops name for mbuf to use
  -n CHANNELS         Number of memory channels
  -m MB               Memory to allocate (see also --socket-mem)
  -r RANKS            Force number of memory ranks (don't detect)
  -b, --block         Add a device to the blocked list.
                      Prevent EAL from using this device. The argument
                      format for PCI devices is <domain:bus:devid.func>.
  -a, --allow         Add a device to the allow list.
                      Only use the specified devices. The argument format
                      for PCI devices is <[domain:]bus:devid.func>.
                      This option can be present several times.
                      [NOTE: allow cannot be used with block option]
  --vdev              Add a virtual device.
                      The argument format is <driver><id>[,key=val,...]
                      (ex: --vdev=net_pcap0,iface=eth2).
  --iova-mode   Set IOVA mode. 'pa' for IOVA_PA
                      'va' for IOVA_VA
  -d LIB.so|DIR       Add a driver or driver directory
                      (can be used multiple times)
  --vmware-tsc-map    Use VMware TSC map instead of native RDTSC
  --proc-type         Type of this process (primary|secondary|auto)
  --syslog            Set syslog facility
  --log-level=<level> Set global log level
  --log-level=<type-match>:<level>
                      Set specific log level
  --log-level=help    Show log types and levels
  --trace=<regex-match>
                      Enable trace based on regular expression trace name.
                      By default, the trace is disabled.
                      User must specify this option to enable trace.
  --trace-dir=<directory path>
                      Specify trace directory for trace output.
                      By default, trace output will created at
                      $HOME directory and parameter must be
                      specified once only.
  --trace-bufsz=<int>
                      Specify maximum size of allocated memory
                      for trace output for each thread. Valid
                      unit can be either 'B|K|M' for 'Bytes',
                      'KBytes' and 'MBytes' respectively.
                      Default is 1MB and parameter must be
                      specified once only.
  --trace-mode=<o[verwrite] | d[iscard]>
                      Specify the mode of update of trace
                      output file. Either update on a file can
                      be wrapped or discarded when file size
                      reaches its maximum limit.
                      Default mode is 'overwrite' and parameter
                      must be specified once only.
  -v                  Display version information on startup
  -h, --help          This help
  --in-memory   Operate entirely in memory. This will
                      disable secondary process support
  --base-virtaddr     Base virtual address
  --telemetry   Enable telemetry support (on by default)
  --no-telemetry   Disable telemetry support
  --force-max-simd-bitwidth Force the max SIMD bitwidth

EAL options for DEBUG use only:
  --huge-unlink[=existing|always|never]
                      When to unlink files in hugetlbfs
                      ('existing' by default, no value means 'always')
  --no-huge           Use malloc instead of hugetlbfs
  --no-pci            Disable PCI
  --no-hpet           Disable HPET
  --no-shconf         No shared config (mmap'd files)

EAL Linux options:
  --socket-mem        Memory to allocate on sockets (comma separated values)
  --socket-limit      Limit memory allocation on sockets (comma separated values)
  --huge-dir          Directory where hugetlbfs is mounted
  --file-prefix       Prefix for hugepage filenames
  --create-uio-dev    Create /dev/uioX (usually done by hotplug)
  --vfio-intr         Interrupt mode for VFIO (legacy|msi|msix)
  --vfio-vf-token     VF token (UUID) shared between SR-IOV PF and VFs
  --legacy-mem        Legacy memory mode (no dynamic allocation, contiguous segments)
  --single-file-segments Put all hugepage memory in single files
  --match-allocations Free hugepages exactly as allocated

When the output should be:

./build/l2fwdnv [EAL options] -- b|c|d|e|g|m|s|t|w|B|E|N|P|W
 -b BURST SIZE: how many pkts x burst to RX
 -d DATA ROOM SIZE: mbuf payload size
 -g GPU DEVICE: GPU device ID
 -m MEMP TYPE: allocate mbufs payloads in 0: host pinned memory, 1: GPU device memory
 -n CUDA PROFILER: Enable CUDA profiler with NVTX for nvvp
 -p PIPELINES: how many pipelines (each with 1 RX and 1 TX cores) to use
 -s BUFFER SPLIT: enable buffer split, 64B CPU, remaining bytes GPU
 -t PACKET TIME: force workload time (nanoseconds) per packet
 -v PERFORMANCE PKTS: packets to be received before closing the application. If 0, l2fwd-nv keeps running until the CTRL+C
 -w WORKLOAD TYPE: who is in charge to swap the MAC address, 0: No swap, 1: CPU, 2: GPU with one dedicated CUDA kernel for each burst of received packets, 3: GPU with a persistent CUDA kernel, 4: GPU with CUDA Graphs
 -z WARMUP PKTS: wait this amount of packets before starting to measure performance

I checked the utils.cpp file inside the src folder and I can see the correct get_opt options

void l2fwdnv_usage(const char *prgname)
{
        printf("\n\n%s [EAL options] -- b|c|d|e|g|m|s|t|w|B|E|N|P|W\n"
               " -b BURST SIZE: how many pkts x burst to RX\n"
               " -d DATA ROOM SIZE: mbuf payload size\n"
               " -g GPU DEVICE: GPU device ID\n"
               " -m MEMP TYPE: allocate mbufs payloads in 0: host pinned memory, 1: GPU device memory\n"
                   " -n CUDA PROFILER: Enable CUDA profiler with NVTX for nvvp\n"
                   " -p PIPELINES: how many pipelines (each with 1 RX and 1 TX cores) to use\n"
                   " -s BUFFER SPLIT: enable buffer split, 64B CPU, remaining bytes GPU\n"
                   " -t PACKET TIME: force exec time (nanoseconds) per packet\n"
               " -v PERFORMANCE PKTS: packets to be received before closing the application. If 0, l2fwd-nv keeps running until the CTRL+C\n"
                   " -w WORKLOAD TYPE: who is in charge to swap the MAC address, 0: No swap, 1: CPU, 2: GPU with one dedicated CUDA kernel for each burst of received packets,>
               " -z WARMUP PKTS: wait this amount of packets before starting to measure performance\n",
               prgname);
}

Therefore, I am compiling the correct source code but, for some reason, the resulting binary is not behave correctly.

So, the questions are the following:

How can I download Mellanox OFED 5.4?
The warning that I got when running cmake are important and need to be solved? If so, how can I fix those warnings?
Why I am getting an error on the __builtin_ia32_serialize function and how can I solve it?
Why I am getting a totaly different help message? and why the arguments provided in this github (and in the source file) do not work?

About the last question, I guess that I am running (or compiling) another example or another example version different that the one showed in this github. However, I followed all the steps in this github, so maybe the github miss some key step or anything that need to be run to compile and run the proper example

Thanks you

Add .editorconfig for Correct Tab Space

Would you add .editorconfig like follows?
Since github default tab spacing is 8, it needs to change tab spacing 4 by .editorconfig file or other methods.

[*]
indent_style = space
indent_size = 4

The performance is not reach to 100Gbps

I am not able to reach to 100Gbps and only get 55Gbps.

The Linux dmesg log reports the maximum as 63Gb/s only like below

[10380.315885] mlx5_core 0000:b3:00.1: 63.008 Gb/s available PCIe bandwidth, limited by 8 GT/s x8 link at 0000:b2:00.0 (capable of 252.048 Gb/s with 16 GT/s x16 link)

Could you please give some comments on my configuration? and direct me on how to improve the performance?

System Information:

Host: 5.0.0-37-generic #40~18.04.1-Ubuntu SMP Thu Nov 14 12:06:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
CPU: Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
Mellanox PN: MCX623106AN-CDAT
Mellanox NIC FW: 22.32.2004
DPDK: v21.11-rc4
OFED: MLNX_OFED_LINUX-5.5-1.0.3.2:

DPDK Testmd:

sudo ./dpdk-testpmd -l 2-10 --main-lcore=2 -a b3:00.0 -- --port-numa-config=0,0 --socket-num=0 --burst=64 --txd=1024 --rxd=1024 --mbcache=512 --rxq=4 --txq=4 --forward-mode=txonly -i --nb-cores=8 --txonly-multi-flow
EAL: Detected CPU lcores: 24
EAL: Detected NUMA nodes: 1
EAL: Detected static linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: VFIO support initialized
EAL: Probe PCI driver: mlx5_pci (15b3:101d) device: 0000:b3:00.1 (socket 0)
TELEMETRY: No legacy callbacks, legacy socket not created
Set txonly packet forwarding mode
Interactive-mode selected
testpmd: create a new mbuf pool <mb_pool_0>: n=278528, size=2048, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc

Warning! port-topology=paired and odd forward ports number, the last port will pair with itself.

Configuring Port 0 (socket 0)
Port 0: 0C:42:A1:B1:C1:D1
Checking link statuses...
Done

testpmd> set txpkts 1024
testpmd> start
txonly packet forwarding - ports=1 - cores=4 - streams=4 - NUMA support enabled, MP allocation mode: native
Logical Core 3 (socket 0) forwards packets on 1 streams:
RX P=0/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
Logical Core 4 (socket 0) forwards packets on 1 streams:
RX P=0/Q=1 (socket 0) -> TX P=0/Q=1 (socket 0) peer=02:00:00:00:00:00
Logical Core 5 (socket 0) forwards packets on 1 streams:
RX P=0/Q=2 (socket 0) -> TX P=0/Q=2 (socket 0) peer=02:00:00:00:00:00
Logical Core 6 (socket 0) forwards packets on 1 streams:
RX P=0/Q=3 (socket 0) -> TX P=0/Q=3 (socket 0) peer=02:00:00:00:00:00

txonly packet forwarding packets/burst=64
packet len=1024 - nb packet segments=1
nb forwarding cores=8 - nb forwarding ports=1
port 0: RX queue number: 4 Tx queue number: 4
Rx offloads=0x0 Tx offloads=0x10000
RX queue: 0
RX desc=1024 - RX free threshold=64
RX threshold registers: pthresh=0 hthresh=0 wthresh=0
RX Offloads=0x0
TX queue: 0
TX desc=1024 - TX free threshold=0
TX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX offloads=0x10000 - TX RS bit threshold=0
testpmd> stop
Telling cores to stop...
Waiting for lcores to finish...

------- Forward Stats for RX Port= 0/Queue= 0 -> TX Port= 0/Queue= 0 -------
RX-packets: 0 TX-packets: 239418112 TX-dropped: 5639512704

------- Forward Stats for RX Port= 0/Queue= 1 -> TX Port= 0/Queue= 1 -------
RX-packets: 0 TX-packets: 239418240 TX-dropped: 5641563328

------- Forward Stats for RX Port= 0/Queue= 2 -> TX Port= 0/Queue= 2 -------
RX-packets: 0 TX-packets: 239418176 TX-dropped: 5642059072

------- Forward Stats for RX Port= 0/Queue= 3 -> TX Port= 0/Queue= 3 -------
RX-packets: 0 TX-packets: 239418112 TX-dropped: 5629728832

---------------------- Forward statistics for port 0 ----------------------
RX-packets: 0 RX-dropped: 0 RX-total: 0
TX-packets: 957672640 TX-dropped: 22552863936 TX-total: 23510536576

+++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
RX-packets: 0 RX-dropped: 0 RX-total: 0
TX-packets: 957672640 TX-dropped: 22552863936 TX-total: 23510536576
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Done.

Mellanox perf log:

  tx_vport_unicast_packets: 6,695,119
    tx_vport_unicast_bytes: 6,855,801,856 Bps    = 54,846.41 Mbps      
            tx_packets_phy: 6,695,110
              tx_bytes_phy: 6,882,575,972 Bps    = 55,060.60 Mbps      
            tx_prio0_bytes: 6,882,515,512 Bps    = 55,060.12 Mbps      
          tx_prio0_packets: 6,695,053
                      UP 0: 55,060.12            Mbps = 100.00%
                      UP 0: 6,695,053            Tran/sec = 100.00%

The lspci log:

b3:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
Subsystem: Mellanox Technologies MT28841
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 66
NUMA node: 0
Region 0: Memory at e4000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at e1000000 [disabled] [size=1M]
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [48] Vital Product Data
Product Name: ConnectX-6 Dx EN adapter card, 100GbE, Dual-port QSFP56, PCIe 4.0 x16, No Crypto
Read-only fields:
[PN] Part number: MCX623106AN-CDAT
[EC] Engineering changes: A6
[V2] Vendor specific: MCX623106AN-CDAT
[SN] Serial number: MT2006X19556
[V3] Vendor specific: de46691c904cea1180000c42a11dd12a
[VA] Vendor specific: MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:MODL=CX623106A
[V0] Vendor specific: PCIeGen4 x16
[RV] Reserved: checksum good, 1 byte(s) reserved
End
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap: First Error Pointer: 04, GenCap+ CGenEn+ ChkCap+ ChkEn+
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
IOVSta: Migration-
Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 00
VF offset: 2, stride: 1, Device ID: 101e
Supported Page Size: 000007ff, System Page Size: 00000001
Region 0: Memory at 00000000e6800000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [1c0 v1] #19
Capabilities: [230 v1] Access Control Services
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Capabilities: [320 v1] #27
Capabilities: [370 v1] #26
Capabilities: [420 v1] #25
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core

Log ibstat:

CA 'mlx5_0'
CA type: MT4125
Number of ports: 1
Firmware version: 22.32.2004
Hardware version: 0
Node GUID: 0x0c42a103001dd12a
System image GUID: 0x0c42a103001dd12a
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0e42a1fffe1dd12a
Link layer: Ethernet
CA 'mlx5_1'
CA type: MT4125
Number of ports: 1
Firmware version: 22.32.2004
Hardware version: 0
Node GUID: 0x0c42a103001dd12b
System image GUID: 0x0c42a103001dd12a
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0e42a1fffe1dd12b
Link layer: Ethernet