Giter VIP home page Giter VIP logo

jpdc23-artifact's Introduction

Artifact Submission: cross-corr

This is a replication package containing code and experimental results related to a JPDC paper titled: Efficient GPU-accelerated Parallel Cross-correlation.

Overview

The artifact comprises the following directories:

  • benchmark -- benchmarking scripts
  • plots -- scripts for generating results plots
  • repo -- the CUDA implementation of cross-correlation
    • cross-corr -- main codebase containing all the kernel source files
    • one-to-one-s -- a copy of the codebase with modified one-to-one grouped-overlap kernels to fully saturate GPU for specific micro-benchmarks
  • presented-results -- containing plots (including those that were not published in the paper due to the page limit) and CSV data files with measurements (that were used to generate the plots) from our GPU cluster

Detailed artifact contents

repo/cross-corr/src directory contains the source files to the CUDA kernels. The names of kernels in this directory are in the original form. In the paper, we decided to rename them for the sake of better readability and simpler understanding. Here we provide the translation table for the kernel names:

Optimization name in the paper Kernel name in the source files
overlap-wise cross_corr_naive_original
warp-per-overlap ccn_warp_per_shift
split-row ccn_shuffle_work_distribution
grouped-overlap ccn_shuffle_multirow_both
multi-matrix-right ccn_shuffle_multimat_right
multi-matrix-both ccn_shuffle_multimat_both

Note, that we omitted the combinations of the aforementioned optimizations, which do have their separate kernel names. Also, source files include kernels that were not described in the paper; this is because they did not provide any advantage over the already comprehensive list of the presented optimizations. To see the full list, see the source README.

The compilation of the cross-correlation binary is governed by cross-corr/CMakeLists.txt CMake file. The compilation time can vary according to the defined parameters passed to CMake during the configuration (also documented in the source README). The CMake defines parameters that specify the maximum value of an argument of a specific kernel (e.g., the maximum number of right matrices per thread for the multi-matrix-right algorithm). As a result, these parameters affect how many different specializations of the same kernel will be built and therefore how long the compilation will take. The full build required for performing all experiments takes around 6 hours on our hardware.

Kernels description

For the purposes of documentation and reproducibility, the GPU kernel source code is comprehensively commented. Let us give a brief overview of the most important kernels with the links to their documented source codes.

Kernel ccn_shuffle is the starting kernel for the optimizations from warp shuffle family (all presented optimizations apart from the warp-per-overlap), located in naive_shuffle.cu file on line 171. The responsibility of ccn_shuffle is to compute thread-specific arguments and pass them to the "business" kernel warp_shuffle_impl which performs the computation (located on line 87). As described in the publication, each kernel thread computes a single overlap while using a smart caching mechanism of register buffers. The warp-wide buffer spans x-axis, so the first loop on line 93 iterates y-axis. Inside the loop, the buffer is initialized on line 103. Then, the loop through the x-axis on line 109 follows. Inside it, the buffer is gradually filled on line 132. The threads cooperatively move the buffer values using shuffle instructions in the buffer loop on lines 136 (the right matrix broadcast) and 147 (the left matrix shuffle-down). On line 152, the computed value is stored in the output matrix.

The rest of the kernels from this family build upon warp_shuffle_impl by adding a specific optimization. ccn_shuffle_work_distribution kernel provides a finer granularity by splitting one overlap by rows to be computed by multiple threads (split-row optimization), located in naive_shuffle.cu file on line 307. Similarly, as with ccn_shuffle, this kernel just computes the thread-specific arguments, which are then passed to the same kernel as before, warp_shuffle_impl. This kernel was designed so that only a customized y-axis range of the overlap is computed. Therefore, multiple threads can simultaneously compute a single overlap. The only difference with the previous run of warp_shuffle_impl is the synchronization; storing to the output matrix is performed using atomic instructions.

Kernel ccn_shuffle_multirow_both, presented as grouped-overlap, favors more extensive data sharing and is located in naive_shuffle_multirow_both.cu on line 397. The arguments computed in the wrapper kernel are passed to shuffle_multirow_both_impl_dispatch helper kernel, which helps to call the "business" kernel shuffle_multirow_both_impl (located on line 295) with the proper constexpr parameters. As described in the publication, the kernel contains three phases of computing the overlap - init phase (startup kernel on line 304), main phase (compute_row_group kernel on lines 316, 333), and final phase (wind_down on line 342). All the phases eventually call compute_row_group, just with a different set of parameters. This kernel extends ccn_shuffle, so these kernels share many parts. For more details on how a single thread computes multiple neighboring overlaps using multiple warp-wide buffers, see comments in compute_row_group.

multi-matrix-right and multi-matrix-both optimizations are both adapted to work with the two aforementioned optimizations. Therefore, some kernel names are formed as a composition of the used optimizations (such as ccn_shuffle_one_to_many_multirow_both_multimat_right). Let us start by navigating to the bare multi-matrix-right optimization located in repo/cross-corr/src/naive_shuffle_multimat_right.cu on line 234. Again, the wrapper kernel calls a helper kernel warp_shuffle_impl_right_mats_dispatch, which calls the important warp_shuffle_impl kernel on line 117. This kernel is almost the same as the warp_shuffle_impl from naive_shuffle.cu. The difference is that the threads hold buffers from multiple right matrices instead of a single one. multi-matrix-both works similarly, having multiples of warp-wide buffers.

Here we enumerate the implemented combinations of these optimizations together with their code locations. The source code of these kernels is mainly just a composition of the ones that we described above plus some corner-case handling. Read the source code comments for further guidance.

Optimization combination Kernel name Location
multi-matrix-right + grouped-overlap ccn_shuffle_one_to_many_multirow_both_multimat_right here
multi-matrix-right + split-row ccn_shuffle_multimat_right_work_distribution here
multi-matrix-both + grouped-overlap ccn_n_to_m_shuffle_multirow_both_multimat_both here
multi-matrix-both + split-row ccn_shuffle_n_to_m_multimat_both_work_distribution here

Lastly, warp-per-overlap optimization does not reuse any parts of the code mentioned above since it computes the overlaps by the whole warp instead of a thread. The kernel name is ccn_warp_per_shift and it is located in naive_warp_per_shift.cu on line 22. The most imporant part is the for loop on line 68, where a warp cooperatively computes a single overlap. Below the loop, intermediate thread data is reduced into a final reasult, which is stored into an output matrix.

Requirements for running the experiments

Hardware requirements:

  • A CUDA-compatible GPU

Software requirements:

  • CUDA toolkit 12.2 or later and appropriate driver
  • cmake 3.18 or later
  • boost and nlohmann libraries
  • python for benchmarking
  • R software for plotting the graphs (see details below)

Let us present a few commands for your convenience that will allow you to set up the environment quickly:

Installing all dependencies (except CUDA and GPU driver) on Debian/Ubuntu:

sudo apt-get update && apt-get install -y g++ cmake r-base python3 python3-pip libboost-all-dev nlohmann-json3-dev

Installing all dependencies (except CUDA and GPU driver) on RHEL-like distribution:

sudo dnf install -y cmake gcc-c++ R python3 python3-pip boost-devel json-devel

Installing the Python packages required by our benchmarking suite:

sudo pip3 install numpy pandas scipy ruamel.yaml

R packages necessary for generating the plots:

sudo R -e "install.packages(c('ggplot2', 'cowplot', 'sitools', 'viridis', 'dplyr'), repos='https://cloud.r-project.org')"

Note: If the version of cmake is greater than 3.24, cmake is able to discover the native CUDA Compute Capability (CC), under which the kernels will be compiled. If the cmake version is smaller, the CUDA compilation is defaulted to generate CC 5.0 PTX code. In order to provide the best results, it is advised to change this attribute manually according to the hardware at hand. To do so, modify line 119 in repo/cross-corr/CMakeLists.txt and repo/one-to-one-s/CMakeLists.txt.

Running the experiments

Our experiments are designed to provide a comprehensive analysis of the aforementioned algorithms running various combinations of parameters computing different sizes of input instances. Therefore, the overall duration of running the experiments is quite long (currently about 2 to 3 days on our GPU cluster).

To provide a swift way to check the reproducibility of our experiments, we prepared a special script that runs only a subset of benchmarks.

Kick the tires:

Just to see whether the code is working, run the following from the root directory:

./kick-the-tires.sh

The script should take about 10 minutes to finish. First, it builds the binary passing a specific combination of macro parameters to ensure that only kernels for one-to-one inputs are built. The script then runs a subset of one-to-one experiments.

After the script runs, it will generate results in a CSV format in the results directory. It should contain 5 CSV files, one file for each measured algorithm. Each CSV file contains self-documenting headers. Finally, the plotting script is executed generating a single plot in the plots/one-to-one-fast directory. The benchmark/benchmark.yml file specifies which combination of algorithms, parameters, and input sizes is benchmarked (note section one_to_one_fast starting on line 482). More details on how the CSV results rows are processed into plots can be found in the plots/plot_fast.R script.

The generated plot file will be named one-to-one.pdf and it shows the comparison of the best one-to-one cross-correlation implementations with the baseline (naive overlap-wise approach) and FFT implementation.

Complete set of measurements:

To run the complete benchmark, execute

./run-all.sh

Measured results

The results measured by ./run-all.sh on our GPU cluster were stored in the presented-results directory. The presented-results/data directory is divided into three subdirectories, each containing results from different GPU architectures:

  • ada -- NVIDIA Tesla L40 PCIe 48 GB
  • ampere -- NVIDIA Tesla A100 PCIe 80 GB
  • volta -- NVIDIA Tesla V100 SXM2 32 GB

The presented-results/plots directory contains only figures generated for the ampere architecture. To generate the plots of other architectures, the plot scripts (in plots directory). Here we describe each figure in the presented-results/plots directory:

Plot Description Figure number in paper
one-to-one/grouped-overlap.pdf one-to-one grouped-overlap optimization Figure 15
one-to-one/grouped-overlap-*.pdf one-to-one grouped-overlap optimization with different combinations of cached left rows, cached shifts and block size not included
one-to-one/split-row.pdf one-to-one split-row optimization Figure 16
one-to-one/one-to-one.pdf comparison of all one-to-one algorithms and optimizations Figure 17
one-to-one/warp-per-overlap.pdf one-to-one warp-per-overlap optimization not included
one-to-one/warp-per-overlap-shared-memory.pdf one-to-one warp-per-overlap with shared memory optimization not included
one-to-one/one-to-one-warp-per-overlap-and-split-row.pdf comparison of one-to-one warp-per-overlap optimizations and split-row optimization not included
one-to-many/multimat-right-grouped-overlap.pdf one-to-many multi-matrix-right grouped-overlap optimization Figure 18
one-to-many/multimat-right-grouped-overlap-*.pdf one-to-many multi-matrix-right grouped-overlap optimization with different combinations of cached left rows, cached shifts, right matrices per thread and inputs not included
one-to-many/multimat-right-split-row.pdf one-to-many multi-matrix-right split-row optimization Figure 19
one-to-many/multimat-right-split-row-*.pdf one-to-many multi-matrix-right split-row optimization with different combinations of overlap rows, right matrices per thread and inputs not included
one-to-many/multimat-right.pdf one-to-many multi-matrix-right optimization not included
one-to-many/one-to-many.pdf comparison of all one-to-many algorithms and optimizations Figure 20
n-to-mn/multimat-right-grouped-overlap.pdf n-to-mn multi-matrix-right grouped-overlap optimization not included
n-to-mn/multimat-right-grouped-overlap-*.pdf n-to-mn multi-matrix-right grouped-overlap optimization with different combinations of cached left rows, cached shifts, right matrices per thread, streams and inputs not included
n-to-mn/multimat-right-split-row.pdf n-to-mn multi-matrix-right split-row optimization not included
n-to-mn/multimat-right-split-row-*.pdf n-to-mn multi-matrix-right split-row optimization with different combinations of overlap rows, right matrices per thread and inputs not included
n-to-mn/multimat-right.pdf n-to-mn multi-matrix-right optimization not included
n-to-mn/n-to-mn.pdf comparison of all n-to-mn algorithms and optimizations not included
n-to-mn/n-to-mn-one-to-many-streams.pdf comparison of grouped-overlap and split-row optimizations on n-to-mn input running serial one-to-many kernels and parallel n-to-mn kernels using streams not included
n-to-m/multimat-both-grouped-overlap.pdf n-to-m multi-matrix-both grouped-overlap optimization Figure 21
n-to-m/multimat-both-grouped-overlap-*.pdf n-to-m multi-matrix-both grouped-overlap optimization with different combinations of cached left rows, cached shifts, right matrices, left matrices per thread not included
n-to-m/multimat-both-split-row.pdf n-to-m multi-matrix-both split-row optimization Figure 22
n-to-m/multimat-both-split-row-*.pdf n-to-m multi-matrix-both split-row optimization with different combinations of overlap rows, left matrices, right matrices per thread and inputs not included
n-to-m/multimat-both.pdf n-to-m multi-matrix-both optimization not included
n-to-m/n-to-m.pdf comparison of all n-to-m algorithms and optimizations Figure 23

jpdc23-artifact's People

Contributors

asmelko avatar krulis-martin avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.