Giter VIP home page Giter VIP logo

bam's Introduction

GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture

This is the opencourse implementation of BaM system (ASPLOS'23). Contributions to the codebase are most welcome

Abstract

Graphics Processing Units (GPUs) have traditionally relied on the CPU to orchestrate access to data storage. This approach is well-suited for GPU applications with known data access patterns that enable partitioning their data sets to be processed in a pipelined fashion in the GPU. However, many emerging applications, such as graph and data analytics, recommender systems, or graph neural networks, require fine-grained, data-dependent access to storage. CPU orchestration of storage access is unsuitable for these applications due to high CPU-GPU synchronization overheads, I/O traffic amplification, and long CPU processing latencies. GPU self-orchestrated storage access avoids these overheads by removing the CPU from the storage control path and, thus, can potentially support these applications at a much higher speed. However, there is a lack of systems architecture and software stack that enables efficient self-orchestrated storage access by the GPUs.

In this work, we present a novel system architecture, BaM, that offers mechanisms for GPU code to efficiently access storage and enables GPU self-orchestrated storage access. BaM features a fine-grained software cache to coalesce data storage requests while minimizing I/O amplification effects. This software cache communicates with the storage system through high-throughput queues that enable the massive number of concurrent threads in modern GPUs to generate I/O requests at a high-enough rate to fully utilize the storage devices, and the system interconnect. Experimental results show that GPU self-orchestrated storage access running on BaM delivers 1.0$\times$ and 1.49$\times$ end-to-end speed up for BFS and CC graph analytics benchmarks while reducing hardware costs by up to 21.7$\times$. Our experiments also show GPU self-orchestrated storage access speeds up data-analytics workloads by 5.3$\times$ when running on the same hardware.

Hardware/System Requirements

This code base requires specific type of hardware and specific system configuration to be functional and performant.

Hardware Requirements

  • A x86 system supporting PCIe P2P
  • A NVMe SSD. Any NVMe SSD will do.
    • Please make sure there isn't any needed data on this SSD as the system can write data to the SSD if the application requests to.
  • A NVIDIA Tesla/Datacenter grade GPU that is from the Volta or newer generation. A Tesla V100/A100/H100 fit both of these requirements
    • A Tesla grade GPU is needed as it can expose all of its memory for P2P accesses over PCIe. (NVIDIA Tesla T4 does not work as it only provides 256M of BAR space)
    • A Volta or newer generation of GPU is needed as we rely on memory synchronization primitives only supported since Volta.
  • A system that can support Above 4G Decoding for PCIe devices.
    • This is needed to address more than 4GB of memory for PCIe devices, specifically GPU memory.
    • This is a feature that might need to be ENABLED in the BIOS of the system.
    • For high throughput implementation, we recommend using a PCIe switch to connect the GPU and SSDs in your server. Going over IOMMU degrades performance.

System Configurations

  • As mentioned above, Above 4G Decoding needs to be ENABLED in the BIOS
  • The system's IOMMU should be disabled for ease of debugging.
    • In Intel Systems, this requires disabling Vt-d in the BIOS
    • In AMD Systems, this requires disabling IOMMU in the BIOS
  • The iommu support in Linux must be disabled too, which can be checked and disabled following the instructions below.
  • In the system's BIOS, ACS must be disabled if the option is available
  • Relatively new Linux kernel (ie. 5.x).
  • CMake 3.10 or newer and the FindCUDA package for CMake
  • GCC version 5.4.0 or newer. Compiler must support C++11 and POSIX threads.
  • CUDA 12.3 or newer
  • Nvidia driver (at least 440.33 or newer)
  • The kernel version we have tested is 5.8.x. A newer kernel like 6.x may not work with BaM as the kernel APIs have dramatically changed.
  • Kernel module symbols and headers for the Nvidia driver. The instructions for how to compile these symbols are given below.

Disable IOMMU in Linux

If you are using CUDA or implementing support for your own custom devices, you need to explicitly disable IOMMU as IOMMU support for peer-to-peer on Linux is a bit flaky at the moment. If you are not relying on peer-to-peer, we would in fact recommend you leaving the IOMMU on for protecting memory from rogue writes.

To check if the IOMMU is on, you can do the following:

$ cat /proc/cmdline | grep iommu

If either iommu=on or intel_iommu=on is found by grep, the IOMMU is enabled.

You can disable it by removing iommu=on and intel_iommu=on from the CMDLINE variable in /etc/default/grub and then reconfiguring GRUB. The next time you reboot, the IOMMU will be disabled.

Compiling Nvidia Driver Kernel Symbols

Typically the Nvidia driver kernel sources are installed in the /usr/src/ directory. So if the Nvidia driver version is 470.141.03, then they will be in the /usr/src/nvidia-470.141.03 directory. So assuming the driver version is 470.141.03, to get the kernel symbols you need to do the following commands as the root user.

$ cd /usr/src/nvidia-470.141.03/
$ sudo make

FOR ASPLOS AOE: ALL OF THESE CONFIGURATIONS ARE ALREADY SET APPROPRIATELY ON THE PROVIDED MACHINE!

Building the Project

From the project root directory, do the following:

$ git submodule update --init --recursive
$ mkdir -p build; cd build
$ cmake ..
$ make libnvm                         # builds library
$ make benchmarks                     # builds benchmark program

The CMake configuration is supposed to autodetect the location of CUDA, Nvidia driver and project library. CUDA is located by the FindCUDA package for CMake, while the location of both the Nvidia driver can be manually set by overriding the NVIDIA defines for CMake (cmake .. -DNVIDIA=/usr/src/nvidia-470.141.03/).

After this, you should also compile the custom libnvm kernel module for NVMe devices. Assuming you are in the root project directory, run the following:

$ cd build/module
$ make

Loading/Unloading the Kernel Module

In order to be able to use the custom kernel module for the NVMe device, we need to first unbind the NVMe device from the default Linux NVMe driver. To do this, we need to find the PCI ID of the NVMe device. To find this we can use the kernel log. For example, if the required NVMe device want is mapped to the /dev/nvme0 block device, we can do the following to find the PCI ID.

$ dmesg | grep nvme0
[  126.497670] nvme nvme0: pci function 0000:65:00.0
[  126.715023] nvme nvme0: 40/0/0 default/read/poll queues
[  126.720783]  nvme0n1: p1
[  190.369341] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null)

The first line gives the PCI ID for the /dev/nvme0 device as 0000:65:00.0.

To unbind the NVMe driver for this device we need to do the following as the root user:

# echo -n "0000:65:00.0" > /sys/bus/pci/devices/0000\:65\:00.0/driver/unbind

Please do this for each NVMe device you want to use with this system.

Now we can load the custom kernel module from the root project directory with the following:

$ cd build/module
$ sudo make load

This should create a /dev/libnvm* device file for each controller that isn't bound to the NVMe driver.

The module can be unloaded from the project root directory with the following:

$ cd build/module
$ sudo make unload

The module can be reloaded (unloaded and then loaded) from the project root directory with the following:

$ cd build/module
$ sudo make reload

Running the Example Benchmark

The fio like benchmark application is compiled as ./bin/nvm-block-bench binary in the build directory. It basically assigns NVMe block IDs (randomly or sequentially) to each GPU thread and then a GPU kernel is launched in which the GPU threads make the appropriate IO requests. When multiple NVMe devices are available, the threads (in group of 32) self-assign a SSD in round-robin fashion, so we get uniform distribution of requests to the NVMe devices. The application must be run with sudo as it needs direct access to the /dev/libnvm* files. The application arguments are as follows:

$ ./bin/nvm-block-bench --help
OPTION            TYPE            DEFAULT   DESCRIPTION                       
  page_size       count           4096      size of each IO request               
  blk_size        count           64        CUDA thread block size              
  queue_depth     count           16        queue depth per queue               
  num_blks        count           2097152   number of pages in backing array    
  input           path                      Input dataset path used to write to NVMe SSD
  gpu             number          0         specify CUDA device                 
  n_ctrls         number          1         specify number of NVMe controllers  
  reqs            count           1         number of reqs per thread           
  access_type     count           0         type of access to make: 0->read, 1->write, 2->mixed
  pages           count           1024      number of pages in cache            
  num_queues      count           1         number of queues per controller     
  random          bool            true      if true the random access benchmark runs, if false the sequential access benchmark runs
  ratio           count           100       ratio split for % of mixed accesses that are read
  threads         count           1024      number of CUDA threads       

The application prints many things during initalization as it helps in debugging, however near the end it prints some statistics of the GPU kernel, as shown below:

Elapsed Time: 169567	Number of Ops: 262144	Data Size (bytes): 134217728
Ops/sec: 1.54596e+06	Effective Bandwidth(GB/S): 0.73717

If you want to run a large GPU kernel on GPU 5 with many threads (262144 threads grouped into GPU block size of 64) each making 1 random request to the first 2097152 NVME blocks, an NVMe IO read size of 512 bytes (page_size), 128 NVMe queues each 1024 elements deep, you would run the following command:

sudo ./bin/nvm-block-bench --threads=262144 --blk_size=64 --reqs=1 --pages=262144 --queue_depth=1024  --page_size=512 --num_blks=2097152 --gpu=0 --n_ctrls=1 --num_queues=128 --random=true

If you want to run the same benchmark but now with each thread accessing the array sequentially, you would run the following command:

sudo ./bin/nvm-block-bench --threads=262144 --blk_size=64 --reqs=1 --pages=262144 --queue_depth=1024  --page_size=512 --num_blks=2097152 --gpu=0 --n_ctrls=1 --num_queues=128 --random=false

Disclaimer: The NVMe SSD I was using supports 128 queues each with 1024 depth. However, even if your SSD supports less number of queues and/or less depth the system will automatically use the numbers reported by your device if you specify larger numbers.

Example Applications

BaM is evaluated on several applications and datasets. A limited set of them are released to public in benchmarks folder. Other applications, e.g. data analytics, are proprietary.

Microbenchmarks Several microbenchmarks are provided in benchmarks folder. array benchmark evaluated the performance of array abstraction with the BaM cache and access to SSDs on misses, block benchmark is akin to fio and evaluates the performance of BaM I/O stack. cache and pattern benchmark evaluates the performance of BaM cache across different access patterns. readwrite benchmarks evaluates the reading and writing very large datasets to single SSDs (multiple SSD write is not enabled in the application despite BaM supporting it).

VectorAdd, Scan and Reduction These benchmarks test the performance of extremely large operations on array for given compute primitives. Dataset is randomly generated as these applications are not data-dependent.

Graph Benchmarks Initial implementation of graph benchmarks are taken from EMOGI (https://github.com/illinois-impact/EMOGI). We use the same dataset used in EMOGI and write them into SSDs using benchmark/readwrite application. BFS, CC, SSSP and PageRank benchmarks are implemented. BFS and CC workloads are extensively studied while SSSP and PageRank studies in progress.

Dataset If you need access to these preprocessed datasets, please reach out to us. These are gigantic datasets and we can figure out how to share the preprocessed ones.

Your application One can use any of these example benchmark and implement their application using bam::array abstraction. Feel free to reach out over Github issues incase if you run into issue or require additional insights on how to best use BaM in your codebase.

FOR ASPLOS AOE: PLEASE REFER TO THE asplosaoe DIRECTORY FOR HOW TO RUN THE APPLICATIONS AND THE EXPECTED OUTPUTS!

Citations

If you use BaM or concepts or derviate codebase of BaM in your work, please cite the following articles.

@inproceedings{bamasplos,
    author = {Qureshi, Zaid and Mailthody, Vikram Sharma and Gelado, Isaac and Min, Seung Won and Masood, Amna and Park, Jeongmin and Xiong, Jinjun and Newburn, CJ and Vainbrand, Dmitri and Chung, I-Hsin and Garland, Michael and Dally, William and Hwu, Wen-mei},
     title = {GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture},
     year = {2023},
     booktitle = {Proceedings of the Twenty-Eigth International Conference on Architectural Support for Programming Languages and Operating Systems},
     series = {ASPLOS '23}
}

@phdthesis{phdthesis1,
  title={Infrastructure to Enable and Exploit GPU Orchestrated High-Throughput Storage Access on GPUs},
  author={Qureshi, Zaid},
  year={2022},
  school={University of Illinois Urbana-Champaign}
}

@phdthesis{phdthesis2,
  title={Application Support And Adaptation For High-throughput Accelerator Orchestrated Fine-grain Storage Access},
  author={Mailthody, Vikram Sharma},
  year={2022},
  school={University of Illinois Urbana-Champaign}
}

Acknowledgement

The codebase builds on top of a opensource codebase by Jonas Markussen available here. We take his codebase and make it more robust by adding more error checking and fixing issues of memory alignment along with increasing the performance when large number of requests are available.

We add to the codebase functionality allowing any GPU thread independently access any location on the NVMe device. To facilitate this we develop high-throughput concurrent queues.

Furthermore, we add the support for an application to use multiple NVMe SSDs.

Finally, to lessen the programmer's burden we develop abstractions, like an array abstraction and a data caching layer, so that the programmer writes their GPU code like they are trained to and the library automatically checks if accesses hit in the cache or not and if they miss to automatically fetch the needed data from the NVMe device. All of these features are developed into a header-only library in the include directory. These headers can be used in Cuda C/C++ application code.

Contributions

Please check the Contribution.md file for more details.

bam's People

Contributors

cooldavid avatar enfiskutensykkel avatar jeongminpark417 avatar larsbk avatar liangzhou001 avatar lineagech avatar msharmavikram avatar patstew avatar zaidqureshi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bam's Issues

The bandwidth does not scale linearly when increasing the number of SSDs

Describe the bug
Hi,
I am trying to run nvm-block-bench on ASUS ESC8000-E11 that has V100 GPU and 6x Intel NVMe SSDs. The GPU is configured with PCIe5 x16.

To Reproduce
In fact, I'd like to reproduce this test: #17 .

Expected behavior
In my expectation, I should get the same linear scaling of bandwidth as he did.

However, my bandwidth cap seems to be limited to ~10GB/s.
Here are some results:

  1. I could get ~5GB/s results when reading only one ssd:
    run:
    ./bin/nvm-blockbench --threads=$((1024*1024*4)) --blk_size=64 --reqs=1 --pages=$((1024*1024*4)) --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=4 --n_ctrls=1 --num_queues=128 --random=true --access_type=0

result:

Elapsed Time: 3.10162e+06 Number of Ops: 4194304 Data Size (bytes): 17179869184
Ops/sec: 1.3523e+06 Effective Bandwidth(GB/S): 5.1586

2.When we increase to 2 SSDs, we only get ~8GB/s
run:
./bin/nvm-blockbench --threads=$((1024*1024*4)) --blk_size=64 --reqs=1 --pages=$((1024*1024*4)) --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=4 --n_ctrls=2 --num_queues=128 --random=true --access_type=0

result:

Elapsed Time: 1.97415e+06 Number of Ops: 4194304 Data Size (bytes): 17179869184
Ops/sec: 2.12461e+06 Effective Bandwidth(GB/S): 8.10473

3.Next increase to 4 SSDs, ~10GB/s
run:
./bin/nvm-blockbench --threads=$((1024*1024*4)) --blk_size=64 --reqs=1 --pages=$((1024*1024*4)) --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=4 --n_ctrls=4 --num_queues=128 --random=true --access_type=0

result:

Elapsed Time: 1.54689e+06 Number of Ops: 4194304 Data Size (bytes): 17179869184
Ops/sec: 2.71144e+06 Effective Bandwidth(GB/S): 10.3433

4.Unfortunately, increasing the number of SSDS didn't work, and the bandwidth seemed to be limited.
run:
./bin/nvm-blockbench --threads=$((1024*1024*4)) --blk_size=64 --reqs=1 --pages=$((1024*1024*4)) --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=4 --n_ctrls=6 --num_queues=128 --random=true --access_type=0

result:

Elapsed Time: 1.54337e+06 Number of Ops: 4194304 Data Size (bytes): 17179869184
Ops/sec: 2.71762e+06 Effective Bandwidth(GB/S): 10.3669

I tried to change page_size, req, threads, etc., but the bandwidth was only ~10GB/s.

I tried to troubleshoot the problem from the ssd perspective, using the fio tool to read data from multiple SSDs to the CPU at the same time, and the bandwidth was up to ~30GB/s:
CPU_bw

Do you have any ideas or solutions for this result? Thanks.

Machine Setup (please complete the following information):
OS: Ubuntu 20.04.6, Kernel 5.4.0-99-generic
NVIDIA Driver: 545.23.08, CUDA Versions: 12.3, GPU name: NVIDIA V100-PCIE-32GB
SSD used: Intel SSD D7-P5520 SERIES

Failed to map device memory: Invalid argument

Describe the bug
I'm attempting to run the examples but when I run ./bin/nvm-array-bench it gives me:

SQs: 63 CQs: 63 n_qps: 1
[ioctl_map] Page mapping kernel request failed (ptr=0x7fca51410000, n_pages=1): Invalid argument
Unexpected error: Failed to map device memory: Invalid argument

Machine Setup (please complete the following information):

  • OS: ubuntu 20.04, 5.15.0-88-generic
  • Driver Version: 470.223.02, CUDA Version: 11.4
  • A800 GPU
  • SSD: SAMSUNG NVMe ssd

Kioxia drives observe freezing of nvm-block-bench when running with n_ctrls greater than or equal to two

Describe the bug
I am trying to run nvm-block-bench on a Dell PowerEdge XE8545 that has 4x A100 GPUs in them and 2x NVMe SSDs. It works fine when I only use n_ctrls=1 while using gpu0, but anymore than that and the program freezes indefinitely. I have attached parts of dmesg output which shows the following potential errors:

[52272.105362] NVRM: Xid (PCI:0000:41:00): 119, pid=12600, name=nvm-block-bench, Timeout waiting for RPC from GSP1! Expected function 76 (GSP_RM_CONTROL) (0x20803019 0x40).

[52272.105380] CPU: 80 PID: 12600 Comm: nvm-block-bench Tainted: P OE 5.8.0-63-generic #71~20.04.1-Ubuntu

[52272.105909] WARNING: kernel stack frame pointer at 000000008f1c4968 in nvm-block-bench:12600 has bad value 00000000c9cd6af6

To Reproduce
Run the following command:
sudo ./bin/nvm-block-bench --threads=262144 --blk_size=64 --reqs=1 --pages=262144 --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=0 --num_queues=128 --random=true -S 1 --n_ctrls=2

Expected behavior
Expected nvm-block-bench to not freeze with n_ctrl >= 2.

Machine Setup (please complete the following information):

  • OS: Ubuntu 20.04.6, Kernel 5.8.0-63-generic
  • NVIDIA Driver: 535.54.03, CUDA Versions: 12.2, GPU name: NVIDIA A100-SXM4-80GB
  • SSD used: KIOXIA CM7

Additional context
I have disabled the other 3 GPUs with the following commands as it seems like systems with 4 GPUs or more tends to make nvidia-smi tool freeze or take a long time. However, when other GPUs are disabled, nvidia-smi becomes snappy and responsive.

nvidia-smi -i <pcie_address_gpu> -pm 0
nvidia-smi drain -p <pcie_address_gpu> -m 1

dmesg.txt
nvidia-smi.txt

Page size of page cache

I noticed the code in include/page_cache.h:

if ((pdt.page_size > (ctrl.ns.lba_data_size * uints_per_page)) || (np == 0) || (pdt.page_size < ctrl.ns.lba_data_size))
            throw error(string("page_cache_t: Can't have such page size or number of pages"));

Wondering if it should be:

if ((pdt.page_size > (ctrl.ctrl->page_size* uints_per_page)) || (np == 0) || (pdt.page_size < ctrl.ns.lba_data_size))
            throw error(string("page_cache_t: Can't have such page size or number of pages"));

change ctrl.ns.lba_data_size to ctrl.ctrl->page_size so that page size can be up to 2MB? Thanks.

How to get the file offset

When processing files on SSD in Bam, GPU needs to obtain theFile Offsetof the file on SSD. The File Offset may not be the same on each disk. How can I obtain the File Offset?

DIRTY flag in bam_ptr?

Hi, I found no DIRTY flag is set in the following code:

include/page_cache.h: 477

T& operator[](const size_t i) {
    if ((i < start) || (i >= end)) {
        update_page(i);
    }
    return addr[i-start];
}

from benchmark/vectoradd/main.cu: 176

for (size_t j = 0; j < n_elems_per_page; j += WARP_SIZE) {
    idx = start_idx + j; 
    if(idx < n_elems){
        val  = Aptr[idx] + Bptr[idx];
        Cptr[idx] = val; 

The code should be changed or do I miss something here? Thanks.

update_page(i);
page->state.fetch_or(DIRTY, simt::memory_order_relaxed);

Failed to open descriptor

Hi, I'm attempting to run the examples but when I run ./bin/nvm-array-bench it gives me:

Unexpected error: Failed to open descriptor: No such file or directory

I run the example on a node with

Machine Setup (please complete the following information):

  • Linux gn15 5.4.0-47-generic #51~18.04.1-Ubuntu
  • Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
  • NVIDIA A100 GPU
  • Diver version: nvidia-515.65.01, CUDA version: 12.1
  • ZHITAI NVMe SSD

Besides, when I run the make load, it dose not create a /dev/libnvm*.

I wonder what is the problem and how to solve it?

Thanks!

What type of SSD should I choose to test the performance of Bam

Hello, I would like to ask if I want to reproduce the experiment in the BAM paper, is there any restriction on the SSD model used in the experiment?

  • What type of SSD is recommended?
  • I hope the price of SSD will be more affordable compared to Intel Optane P5800x.
  • Can I choose Samsung 980 pro SSD to reproduce the experiment shown in the following figure in the paper?
    image
    Thank you !

where to get the dataset ?

Hi guys,

I'm new to BaM and currently I can run the array benchmark on my machine successfully.
So I'm going to run the BFS benchmark but I'm wondering how to get the dataset (graph raw dataset) firstly ?

Appreciated if anyone could share the dataset to me :-)

Thanks.

The program (vectorAdd) hangs when page cache is oversubscribed

I did the test for page cache oversubscription with very simple setting:

Environment:

Linux 5.4.0-131-generic #147-Ubuntu SMP
CPU: Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
GPU: Nvidia A100 @ 1410 MHz
Driver version: 515.65.01
CUDA version: CUDA 11.7

benchmark/vectoradd/settings.h:

numThreads = 256;
maxPageCacheSize = 2097152;
n_elems = 131072;

The program would hang because

uint32_t cnt = expected_state & CNT_MASK;

cnt never decreased to zero so the page could not be reused.

I also encountered this issue sometimes when I used the large size (like heisenbug...):

benchmark/vectoradd/settings.h:

numThreads = 256;
maxPageCacheSize = 12884901888;
n_elems = 1073741824;

readwrite_stripe benchmark is not able to write a large file due to the limit of loffset parameter [in window buffer branch]

Describe the bug
If IGB heterogeneous large dataset is written to the drive by using build/bin/nvm-readwrite_stripe-bench, an error message will be displayed: "Option loffset must lower than 4235344"

The nvm-readwrite_stripe-bench is one of the benchmarks program, and can be compiled by using the window buffer branch.

To Reproduce

  1. compile the benchmarks.
  2. download the IGB heterogenerous large dataset, and place the .npy files in proper path which is exploited by the commands in the next step.
  3. write a series of command to read node features from .npy files and write them into SSD. Since the number of nodes for each type in this dataset is below:

num_nodes={'author': 116959896, 'fos': 649707, 'institute': 26524, 'paper': 100000000}

the shell command to write the node features should be like this:

build/bin/nvm-readwrite_stripe-bench --input /mnt/rs4-lwn/large/processed/paper/node_feat.npy --queue_depth 1024 --access_type 1 --num_queues 128 --threads 102400 --n_ctrls 1

build/bin/nvm-readwrite_stripe-bench --input /mnt/rs4-lwn/large/processed/author/node_feat.npy --queue_depth 1024 --access_type 1 --num_queues 128 --threads 102400 --n_ctrls 1 --loffset $((100000000*4096))

build/bin/nvm-readwrite_stripe-bench --input /mnt/rs4-lwn/large/processed/fos/node_feat.npy --queue_depth 1024 --access_type 1 --num_queues 128 --threads 102400 --n_ctrls 1 --loffset $((216959896*4096))

build/bin/nvm-readwrite_stripe-bench --input /mnt/rs4-lwn/large/processed/institute/node_feat.npy --queue_depth 1024 --access_type 1 --num_queues 128 --threads 102400 --n_ctrls 1 --loffset $((217609603*4096))

  1. In order to quickly reproduce this issue, one can jump to execute the last command only, because the problem is that the nvm-readwrite_stripe-bench program can not accept large value for --loffset parameter.

Expected behavior
The node features should be written to SSD sucessfully.

Screenshots
An error message will be displayed: "Option loffset must lower than 4235344".

Machine Setup (please complete the following information):

  • OS centos 8 kernel 5.8.0
  • NVIDIA Driver 545.23.08
  • CUDA Versions 12.3
  • GPU name RTX A4000
  • SSD used DapuStor H5100 7.68TB

Additional context
The issue can be fixed if change "(uint64_t)std::numeric_limits<uint64_t>::max" to "(uint64_t)std::numeric_limits<uint64_t>::max()" in benchmarks/readwrite_stripe/settings.h, adding the parenthesis at end. I may submit this modification later. Please leave your comments if you have other opinions.

cq_dequeue and move_head_cq

I ran into an issue: threads enqueued cmds, and then the threads with larger pos (the second parameter of cq_dequeue) entered cq_dequeue and called move_head_cq earlier, which could not make cq->head_mark UNLOCKED because threads with smaller pos would not make cq->head_mark LOCKED (not being scheduled). The return value of move_head_cq (head_move_count) was always 0. Is it a known issue? Thank you!

uint32_t move_head_cq(nvm_queue_t* q, uint32_t cur_head, nvm_queue_t* sq) {
    uint32_t count = 0;
    (void) sq;

    bool pass = true;
    //uint32_t old_head;
    while (pass) {
        uint32_t loc = (cur_head+count++)&q->qs_minus_1;
        pass = (q->head_mark[loc].val.exchange(UNLOCKED, simt::memory_order_relaxed)) == LOCKED;

Utility to Track IO Statistics (IOPS, Throughput, Latency, Utilization, etc.)

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Unable to view NVMe disk level metrics for throughput, IOPS, latency, and other metrics when running benchmark tests.

Describe the solution you'd like
A clear and concise description of what you want to happen.

I was able to successfully compile and run the initial benchmark tests with nvm-block-bench. With nvidia-smi, I can see the memory utilization and gpu utilization when running the benchmark.

However, I want to be able to see the NVMe disk metrics as I am doing the runs. Something similar to iostat, or what other SPDK libraries offer that essentially replicate what iostat provides. Or, if there is a way to extract this information from /sys/ then I would like to know what directory the files are in so I can do the calculations on my own.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

No alternatives, I do not see any tool like an iostat for spdk to reveal these metrics. Normally when the device is attached to the normal block level driver, you can access this data through tools like iostat.

Additional context
N/A

Error when loading the custom kernel module

Hi, I'm attempting to build the project but when I run sudo make load to load the custom kernel module, it gives me

insmod libnvm.ko max_num_ctrls=64
insmod: ERROR: could not insert module libnvm.ko: Unknown symbol in module
make: *** [Makefile:24: load] Error 1

And I check the dmesg and it reports:

[64532.536098] libnvm: module using GPL-only symbols uses symbols from proprietary module nvidia.
[64532.543921] libnvm: Unknown symbol nvidia_p2p_dma_unmap_pages (err -2)
[64532.543950] libnvm: module using GPL-only symbols uses symbols from proprietary module nvidia.
[64532.551627] libnvm: Unknown symbol nvidia_p2p_get_pages (err -2)
[64532.551636] libnvm: module using GPL-only symbols uses symbols from proprietary module nvidia.
[64532.559403] libnvm: Unknown symbol nvidia_p2p_put_pages (err -2)
[64532.559411] libnvm: module using GPL-only symbols uses symbols from proprietary module nvidia.
[64532.567112] libnvm: Unknown symbol nvidia_p2p_dma_map_pages (err -2)
[64532.567119] libnvm: module using GPL-only symbols uses symbols from proprietary module nvidia.
[64532.574778] libnvm: Unknown symbol nvidia_p2p_free_page_table (err -2)

I wonder what is the problem and how to solve it?

I run the build on an aws instance with

  • Linux version 5.15.0-1038-aws #43~20.04.1-Ubuntu
  • Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
  • Nvidia V100 GPU
  • Driver version: 525.85.12, CUDA version: 11.8

Thanks

Multi-GPU control

The current situation is that each SSD can only be controlled by one GPU. Is it possible to achieve control and reading of one SSD by multiple GPUs?

Add ROCmSupport

Creating a thread to keep track of potential enhancements that might need to be added in the future. We do not have access to AMD GPUs to enable this support at the moment.

[ioctl_map] Page mapping kernel request failed: Invalid argument

Describe the bug
Hi ! I'm attempting to run the examples but when I run ./bin/nvm-array-bench it gives me:

SQs: 64 CQs: 64 n_qps: 1
[ioctl_map] Page mapping kernel request failed (ptr=0x7fcf34210000, n_pages=1): Invalid argument
Unexpected error: Failed to map device memory: Invalid argument
Machine Setup (please complete the following information):

Machine Setup

  • OS: ubuntu 22.04, linux-5.10.213
  • Driver Version: 535.161.07, CUDA Version: 12.2
  • GPU: Geforce RTX 3060
  • SSD: PC SN730 NVMe SSD

I wonder what is the problem and how to solve it ?
Thanks!

nvm-block-bench causes Segmentation fault

Hi, I'm able to sucessfully build the project but when I run the block benchmark as instructed:

sudo ./bin/nvm-block-bench --threads=262144 --blk_size=64 --reqs=1 --pages=262144 --queue_depth=1024  --page_size=512 --num_blks=2097152 --gpu=0 --n_ctrls=1 --num_queues=128 --random=true

However, it simply gives me Segmentation fault without any additional information. I'm not aware of where the problem is.

I build the project on an AWS instance with

  • Linux version 5.4.0-1089-aws #97~18.04.1-Ubuntu
  • Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
  • Nvidia V100 GPU
  • Driver version: 515.65.01, CUDA version: 11.7

Thanks

the BaM bandwidth is stopped to increase when the number of NVMe is more than 7

Hi there,

I'm doing benchmark testing on my machine which is configured with some H800 GPUs and 8 NVMe storages dedicated for the BaM.

The GPU is configured with PCIe5 x16 and the NVMe storage is configured with PCIe4 x4, which means in theory the max bandwidth of GPU is around 60 GBps and the max bandwidth of single NVMe storage is around 7.5 GBps.

But according to my testing using "nvm-block-bench", the result is not as expected. I summary thge result here: https://raw.githubusercontent.com/LiangZhou9527/some_stuff/8b48038465858846f864e43cef6d0e6df787a2c2/BaM%20bandwidth%20and%20the%20number%20of%20NVMe.png

In the pciture we can see that the bandwidth with 6 NVMe and 7 NVMe is almost the same, but when the number of NVMe reaches 8, the bandwitdh is dropped a lot.

Any thoughts about what happens here?

BTW, I didn't enable IOMMU on my machine, and the benchmark testing cmdline is as below (I executed the command 8 times, each time with different --n_ctrls value, say, 1, 2 ... 8)

./bin/nvm-block-bench --threads=262144 --blk_size=64 --reqs=1 --pages=262144 --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=0 --num_queues=128 --random=true -S 1 --n_ctrls=1

What does `ctrl.ctrl->page_size * uints_per_page` mean?

Hi, I'm looking at this line of code from (

if ((pdt.page_size > (ctrl.ctrl->page_size * uints_per_page)) || (np == 0) || (pdt.page_size < ctrl.ns.lba_data_size))
) and I'm trying to understand the meaning of ctrl.ctrl->page_size * uints_per_page in the following snippet:

        const uint32_t uints_per_page = ctrl.ctrl->page_size / sizeof(uint64_t);
        if ((pdt.page_size > (ctrl.ctrl->page_size * uints_per_page)) || (np == 0) || (pdt.page_size < ctrl.ns.lba_data_size))
            throw error(string("page_cache_t: Can't have such page size or number of pages"));

Could you please explain what this expression(ctrl.ctrl->page_size * uints_per_page) calculates ?Thank you!

The program (BFS) error

Hi, I ran the benchmarks/bfs and in some cases the following error will be reported:

GPUassert: an illegal memory access was encountered /home/xx/bam/benchmarks/bfs/main.cu 1732
Iter 1 Time: fa ms
GPUassert: an illegal memory access was encountered /home/xx/bam/benchmarks/bfs/main.cu 1742
GPUassert: an illegal memory access was encountered /home/xx/bam/benchmarks/bfs/main.cu 1743
GPUassert: an illegal memory access was encountered /home/xx/bam/benchmarks/bfs/main.cu 1744

When modifying the number of threads or running repeatedly, the error may not occur, but the running of the program will hang causing ssd to be busy. The error message is as follows:
[nvm_raw_ctrl_reset] Timeout exceeded while waiting for controller reset

What I want to know is whether these problems are caused by the environment configuration or by the code, my environment is

  • Linux 5.15.81-generic
  • CPU: Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz
  • GPU: Nvidia A100
  • Driver Version: 470.161.03 CUDA Version: 11.4

page index in acquire_page

Hi, I think the following code may cause invalid accesses:

include/page_cache.h:1477

coalesce_page(lane, mask, r, page, gaddr, false, eq_mask, master, count, base_master);
page_ = &r_->pages[base_master];

the returned base_master could exceed the number of pages in the current range if size of page cache is larger than the size of the range and causes invalid access of r_->pages. Could you help check this is an issue? or I misused the bam_ptr here. Thanks!

Error when making benchmarks

Hi, I'm attempting to build the project but when I run sudo make benchmarks to load the custom kernel module, it gives me

ptxas /tmp/tmpxft_00001971_00000000-8_main.compute_70.ptx, line 10944; error : Unknown modifier '.mmio'
ptxas /tmp/tmpxft_00001971_00000000-8_main.compute_70.ptx, line 11276; error : Unknown modifier '.mmio'
ptxas /tmp/tmpxft_00001971_00000000-8_main.compute_70.ptx, line 11608; error : Unknown modifier '.mmio'
ptxas fatal : Ptx assembly aborted due to errors
CMake Error at iodepth-block-benchmark-module_generated_main.cu.o.cmake:279 (message):
Error generating file
/nvme/bam/build/benchmarks/iodepth-block/CMakeFiles/iodepth-block-benchmark-module.dir//./iodepth-block-benchmark-module_generated_main.cu.o

I wonder what is the problem and how to solve it?

I run the build on a node with

Machine Setup (please complete the following information):

  • Linux gn15 5.4.0-47-generic #51~18.04.1-Ubuntu
  • Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
  • NVIDIA A100 GPU
  • Diver version: nvidia-515.65.01, CUDA version: 12.1
  • NVMe SSD

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.