chai-benchmarks / chai Goto Github PK

Chai

Makefile 6.75% C 21.33% C++ 61.92% Cuda 10.01%

chai's Introduction

Chai

v1.0-alpha

Overview

Chai is a benchmark suite of Collaborative Heterogeneous Applications for Integrated-architectures. The Chai benchmarks are designed to use the latest features of heterogeneous architectures such as shared virtual memory and system-wide atomics to achieve efficient simultaneous collaboration between host and accelerator devices.

Each benchmark has multiple implementations. This release includes the OpenCL-D, OpenCL-U, CUDA-D, CUDA-U, CUDA-D-Sim, and CUDA-U-Sim implementations. The C++AMP implementations are underway. If you would like early access to premature versions, please contact us.

Instructions

Clone the repository:

git clone https://github.com/chai-benchmarks/chai.git
cd chai

Export environment variables:

export CHAI_OCL_LIB=<path/to/OpenCL/lib>
export CHAI_OCL_INC=<path/to/OpenCL/include>

Select desired implementation:

cd OpenCL-U

Select desired benchmark:

cd BFS

Compile:

make

Execute:

./bfs

For help instructions:

./bfs -h

Citation

Please cite the following paper if you find our benchmark suite useful:

J. Gómez-Luna, I. El Hajj, L.-W. Chang, V. Garcia-Flores, S. Garcia de Gonzalo, T. Jablin, A. J. Peña, W.-M. Hwu. Chai: Collaborative Heterogeneous Applications for Integrated-architectures. In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2017. [bibtex]

Chai Benchmarks for CPU-FPGA Systems

The FPGA synthesizable version of Chai benchmarks can be found in this chai-fpga repository.

chai's People

Contributors

Stargazers

Watchers

Forkers

jlgreathouse cwpearson luisremis dominicgrande jjiantong woodun rishabhmeshram divyanshu1903 akashiisc ielhajj el1goluj shuangistan hassan-yamin ggangliu okamiwong mysoreanoop

chai's Issues

TRNS CPU GPU

Hi,
The Chai paper mentions that SC , PAD and TRNS support only GPU execution in the OPENCL-D benchmarks. I did not try the OpenCL-D ones but the CUDA-D SC and PAD version is CPU-GPU executing together. I was wondering if it is also possible to do this for TRNS of CUDA-D as well or it is not possible at all? Thanks

Best,
Trinayan

Larger input sets

Hi,

Apart from the benchmarks where the user can specify the size of the data are there any larger input sources available for the other benchmarks where the user passes a file? Thank you

Best,
Trinayan

Timer problem in CEDD

Hello everybody.
I'm executing CEDD with default configuration and the output seem really strange.
Randomly it gets an enormous number.

CEDD output

Spectre -p 0 -d 0 -i 16 -a 0.20 -t 4
Allocation Time (ms): 0.071000
Initialization Time (ms): 2532.493000
Total Proxies Time (ms): 436.893000
CPU Proxy:
CPU Proxy: Kernel Time (ms): 432.369000
GPU Proxy:
GPU Proxy: Copy To Device Time (ms): 1501703139477.527100
GPU Proxy: Kernel Time (ms): 58.478000
GPU Proxy: Copy Back Time (ms): 9.976000
Deallocation Time (ms): 8.021000
Test Passed

Spectre -p 0 -d 0 -i 16 -a 0.20 -t 4
Allocation Time (ms): 0.073000
Initialization Time (ms): 2583.880000
Total Proxies Time (ms): 427.340000
CPU Proxy:
CPU Proxy: Kernel Time (ms): 1501703148589.305908
GPU Proxy:
GPU Proxy: Copy To Device Time (ms): 11.677000
GPU Proxy: Kernel Time (ms): 54.310000
GPU Proxy: Copy Back Time (ms): 9.947000
Deallocation Time (ms): 8.555000
Test Passed

I also executed BS and this never happened.
Do you have an idea of the reason of this?

Gabriel

Error : Operation not supported

Hi,

I am having an error when I try to run CUDA-U benchmarks on my jetson tx2. It says Operation not supported which is returned from the CUDA_ERR() function after the code does cudamallocmanaged. But cuda malloc managed definitely works on tx2 becuase it is running cuda 8 on pascal. And there are nvidia samples that use this.

What could be the issue here? Thanks again.

Best,
Trinayan

GPU+nCPU

GPU+nCPU means run the data-partition benchmark with n CPU threads , and all with default partition in your paper?

memory placement consideration for PAD

For some reason the 'CUDA-U' implementation is PAD is taking a long time in the kernel; likely waiting for data. I was able to make the following change to bring the kernel time in-line with the CUDA-D version.

#ifdef CUDA_8_0
T * h_in_out;
cudaStatus = cudaMallocManaged(&h_in_out, in_size * sizeof(T));
cudaMemAdvise(h_in_out, in_size * sizeof(T), cudaMemAdviseSetPreferredLocation, 0);

I realize that this change increases the Allocation phase of the program, but it seems more reasonable that the tax is there. Its unclear to me how this setting may impact other results with CUDA CHAI.
I'm opening this for discussion and consideration.

Using matrix market format for bfs

I would like to feed road_usa graph to CUDA-D/BFS.
The input is in market format which in my understanding is a valid format for BFS in this suite.

$ head -n 10 input/road_usa.mtx
23947347 23947347 28854312
2 1
710 1
1049674 1
2097155 2
10 3
2097153 3
2097157 3
1904 4
1048579 4

However, I get a segmentation fault like this:

$ ./bfs -f input/road_usa.mtx
TITAN V Segmentation fault (core dumped)

I would like to know If that is a valid input for chai or not? The two input graphs are small for my purposes.

Cuda version

Hi,

Which cuda version is required to run these benchmarks? Also since there is cuda-U version I believe it is possible to run those on the jetson boards?

Best,
Trinayan

Update for CEDT

Since CEDT is always working on the GPU first before the CPU, I recommend inserting the following to streamline the initial allocation. I'm finding that the initial copy out for the CUDA-U implementation is somehow buried into the initialization. The kernel time thus performs equivalent to the CUDA-D implementation.

diff --git a/CUDA-U/CEDT/main.cpp b/CUDA-U/CEDT/main.cpp
index 5cf954c..2ff06dd 100644
--- a/CUDA-U/CEDT/main.cpp
+++ b/CUDA-U/CEDT/main.cpp
@@ -162,6 +162,7 @@ int main(int argc, char **argv) {
cudaStatus = cudaMallocManaged(&h_in_out, (p.n_warmup + p.n_reps)*sizeof(unsigned char *));
for(int i = 0; i < p.n_warmup + p.n_reps; i++) {
cudaStatus = cudaMallocManaged(&h_in_out[i], in_size);

cudaMemAdvise(&h_in_out[i], in_size, cudaMemAdviseSetPreferredLocation, p.device);
}
unsigned char * h_interm = (unsigned char *)malloc(in_size);
ALLOC_ERR(h_interm);

CUDA versions release

Hi,

Are there any plans to release the Cuda and CUDA Sim versions of CHAI soon?

Best,
Trinayan

CEDD issue

Hi,

On my NVIDIA device the CEDD benchmark sometimes exits with an error: Invalid resource handle. The error is unpredictable and happens only at certain times. I dont know why. Might be thread race since it is totally indeterministic. Have you faced similar issues?

result mismatch in BFS

Running the BFS example in OpenCL-U/D shows:

Computed node 2 cost (-2147483647 != 45) does not match the expected value

Thanks

Request for input files

Where datasets used in Chai paper are available from ?

CUDA-U Synchronization Deadlock for BFS

I have achieved a state where the GPU kernel is are waiting for the CPU threads to finish, but the CPU threads have already exited according to the debugger. All run time parameters are default. The GPU appears to be done with threads_end[0] = 2048; the threads_run[0] returns 44, which looks to be what the CPUs took of the jobs, is less than iter which is at 45 (with no CPU thread now active)! So the program hangs are line 140 of kernel.cu.

This is also happening to me on the SSSP CUDA-U implementation.

Can you explain how the synchronization behavior is supposed to work so I can root cause this? If private thread is better, robers97 at gmail.com

[New Thread 0x7ffff0b8f190 (LWP 47429)]
Tesla V100-SXM2-16GB [New Thread 0x7fffdffff190 (LWP 47430)]
Allocation Time (ms): 52.675583
Number of nodes = 264346 Number of edges = 733846 Initialization Time (ms): 243.190781
[New Thread 0x7fffdf6af190 (LWP 47433)]
[New Thread 0x7fffdee9f190 (LWP 47434)]
[New Thread 0x7fffde68f190 (LWP 47435)]
[Thread 0x7fffde68f190 (LWP 47435) exited]
[Thread 0x7fffdee9f190 (LWP 47434) exited]
[Thread 0x7fffdf6af190 (LWP 47433) exited]
^C
Thread 1 "bfs" received signal SIGINT, Interrupt.
0x00000000107e0bc0 in _INTERNAL_42_tmpxft_0000db2d_00000000_10_kernel_cpp1_ii_a8c07a88::atomicAdd_system (address=,
val=) at /usr/local/cuda/include/sm_60_atomic_functions.hpp:92
92 return __iAtomicAdd_system(address, val);
(cuda-gdb) where
#0 0x00000000107e0bc0 in _INTERNAL_42_tmpxft_0000db2d_00000000_10_kernel_cpp1_ii_a8c07a88::atomicAdd_system (address=,
val=) at /usr/local/cuda/include/sm_60_atomic_functions.hpp:92
#1 0x00000000107e5320 in BFS_gpu<<<(8,1,1),(256,1,1)>>> (graph_nodes_av=0x7fff80000000, graph_edges_av=0x7fff80400000,
cost=0x7fff80a00000, color=0x7fff80204600, q1=0x7fff80c00000, q2=0x7fff80e00000, n_t=0x7fff81000800, head=0x7fff81000000,
tail=0x7fff81000200, threads_end=0x7fff81000400, threads_run=0x7fff81000600, overflow=0x7fff81000a00, LIMIT=128, CPU=1)
at kernel.cu:140

(cuda-gdb) where
#0 0x00000000107e0bc0 in _INTERNAL_42_tmpxft_0000db2d_00000000_10_kernel_cpp1_ii_a8c07a88::atomicAdd_system (address=,
val=) at /usr/local/cuda/include/sm_60_atomic_functions.hpp:92
#1 0x00000000107e5320 in BFS_gpu<<<(8,1,1),(256,1,1)>>> (graph_nodes_av=0x7fff80000000, graph_edges_av=0x7fff80400000,
cost=0x7fff80a00000, color=0x7fff80204600, q1=0x7fff80c00000, q2=0x7fff80e00000, n_t=0x7fff81000800, head=0x7fff81000000,
tail=0x7fff81000200, threads_end=0x7fff81000400, threads_run=0x7fff81000600, overflow=0x7fff81000a00, LIMIT=128, CPU=1)
at kernel.cu:140
(cuda-gdb) print iter
No symbol "iter" in current context.
(cuda-gdb) print CPU
No symbol "CPU" in current context.
(cuda-gdb) up
#1 0x00000000107e5320 in BFS_gpu<<<(8,1,1),(256,1,1)>>> (graph_nodes_av=0x7fff80000000, graph_edges_av=0x7fff80400000,
cost=0x7fff80a00000, color=0x7fff80204600, q1=0x7fff80c00000, q2=0x7fff80e00000, n_t=0x7fff81000800, head=0x7fff81000000,
tail=0x7fff81000200, threads_end=0x7fff81000400, threads_run=0x7fff81000600, overflow=0x7fff81000a00, LIMIT=128, CPU=1)
at kernel.cu:140
140 while(atomicAdd_system(&threads_run[0], 0) < iter) {
(cuda-gdb) print iter
$1 = 45
(cuda-gdb) print &threads_run[0]
$2 = (@Managed @Generic int *) 0x7fff81000600
(cuda-gdb) print threads_run[0]
$3 = 44

CUDA Unified memory enhancements

Hi,
Do you have any plans supporting CUDA Unified memory enhancements?
More specifically, I'm interested in

Taking advantage of cudaMemAdvise API
Taking advantage of Cache Coherence support via NVLink2 in Power9 systems (starting from CUDA 9)
Thanks, Igor.

chai-benchmarks / chai Goto Github PK

chai's Introduction

Chai

Overview

Instructions

Citation

Chai Benchmarks for CPU-FPGA Systems

chai's People

Contributors

Stargazers

Watchers

Forkers

chai's Issues

Recommend Projects

Recommend Topics

Recommend Org