chai-benchmarks / chai Goto Github PK

Chai

Makefile 6.75% C 21.33% C++ 61.92% Cuda 10.01%

chai's Issues

Larger input sets

Hi,

Apart from the benchmarks where the user can specify the size of the data are there any larger input sources available for the other benchmarks where the user passes a file? Thank you

Best,
Trinayan

Hi,
The Chai paper mentions that SC , PAD and TRNS support only GPU execution in the OPENCL-D benchmarks. I did not try the OpenCL-D ones but the CUDA-D SC and PAD version is CPU-GPU executing together. I was wondering if it is also possible to do this for TRNS of CUDA-D as well or it is not possible at all? Thanks

Best,
Trinayan

Update for CEDT

Since CEDT is always working on the GPU first before the CPU, I recommend inserting the following to streamline the initial allocation. I'm finding that the initial copy out for the CUDA-U implementation is somehow buried into the initialization. The kernel time thus performs equivalent to the CUDA-D implementation.

diff --git a/CUDA-U/CEDT/main.cpp b/CUDA-U/CEDT/main.cpp
index 5cf954c..2ff06dd 100644
--- a/CUDA-U/CEDT/main.cpp
+++ b/CUDA-U/CEDT/main.cpp
@@ -162,6 +162,7 @@ int main(int argc, char **argv) {
cudaStatus = cudaMallocManaged(&h_in_out, (p.n_warmup + p.n_reps)*sizeof(unsigned char *));
for(int i = 0; i < p.n_warmup + p.n_reps; i++) {
cudaStatus = cudaMallocManaged(&h_in_out[i], in_size);

cudaMemAdvise(&h_in_out[i], in_size, cudaMemAdviseSetPreferredLocation, p.device);
}
unsigned char * h_interm = (unsigned char *)malloc(in_size);
ALLOC_ERR(h_interm);

memory placement consideration for PAD

For some reason the 'CUDA-U' implementation is PAD is taking a long time in the kernel; likely waiting for data. I was able to make the following change to bring the kernel time in-line with the CUDA-D version.

#ifdef CUDA_8_0
T * h_in_out;
cudaStatus = cudaMallocManaged(&h_in_out, in_size * sizeof(T));
cudaMemAdvise(h_in_out, in_size * sizeof(T), cudaMemAdviseSetPreferredLocation, 0);

I realize that this change increases the Allocation phase of the program, but it seems more reasonable that the tax is there. Its unclear to me how this setting may impact other results with CUDA CHAI.
I'm opening this for discussion and consideration.

CUDA-U Synchronization Deadlock for BFS

I have achieved a state where the GPU kernel is are waiting for the CPU threads to finish, but the CPU threads have already exited according to the debugger. All run time parameters are default. The GPU appears to be done with threads_end[0] = 2048; the threads_run[0] returns 44, which looks to be what the CPUs took of the jobs, is less than iter which is at 45 (with no CPU thread now active)! So the program hangs are line 140 of kernel.cu.

This is also happening to me on the SSSP CUDA-U implementation.

Can you explain how the synchronization behavior is supposed to work so I can root cause this? If private thread is better, robers97 at gmail.com

[New Thread 0x7ffff0b8f190 (LWP 47429)]
Tesla V100-SXM2-16GB [New Thread 0x7fffdffff190 (LWP 47430)]
Allocation Time (ms): 52.675583
Number of nodes = 264346 Number of edges = 733846 Initialization Time (ms): 243.190781
[New Thread 0x7fffdf6af190 (LWP 47433)]
[New Thread 0x7fffdee9f190 (LWP 47434)]
[New Thread 0x7fffde68f190 (LWP 47435)]
[Thread 0x7fffde68f190 (LWP 47435) exited]
[Thread 0x7fffdee9f190 (LWP 47434) exited]
[Thread 0x7fffdf6af190 (LWP 47433) exited]
^C
Thread 1 "bfs" received signal SIGINT, Interrupt.
0x00000000107e0bc0 in _INTERNAL_42_tmpxft_0000db2d_00000000_10_kernel_cpp1_ii_a8c07a88::atomicAdd_system (address=,
val=) at /usr/local/cuda/include/sm_60_atomic_functions.hpp:92
92 return __iAtomicAdd_system(address, val);
(cuda-gdb) where
#0 0x00000000107e0bc0 in _INTERNAL_42_tmpxft_0000db2d_00000000_10_kernel_cpp1_ii_a8c07a88::atomicAdd_system (address=,
val=) at /usr/local/cuda/include/sm_60_atomic_functions.hpp:92
#1 0x00000000107e5320 in BFS_gpu<<<(8,1,1),(256,1,1)>>> (graph_nodes_av=0x7fff80000000, graph_edges_av=0x7fff80400000,
cost=0x7fff80a00000, color=0x7fff80204600, q1=0x7fff80c00000, q2=0x7fff80e00000, n_t=0x7fff81000800, head=0x7fff81000000,
tail=0x7fff81000200, threads_end=0x7fff81000400, threads_run=0x7fff81000600, overflow=0x7fff81000a00, LIMIT=128, CPU=1)
at kernel.cu:140

(cuda-gdb) where
#0 0x00000000107e0bc0 in _INTERNAL_42_tmpxft_0000db2d_00000000_10_kernel_cpp1_ii_a8c07a88::atomicAdd_system (address=,
val=) at /usr/local/cuda/include/sm_60_atomic_functions.hpp:92
#1 0x00000000107e5320 in BFS_gpu<<<(8,1,1),(256,1,1)>>> (graph_nodes_av=0x7fff80000000, graph_edges_av=0x7fff80400000,
cost=0x7fff80a00000, color=0x7fff80204600, q1=0x7fff80c00000, q2=0x7fff80e00000, n_t=0x7fff81000800, head=0x7fff81000000,
tail=0x7fff81000200, threads_end=0x7fff81000400, threads_run=0x7fff81000600, overflow=0x7fff81000a00, LIMIT=128, CPU=1)
at kernel.cu:140
(cuda-gdb) print iter
No symbol "iter" in current context.
(cuda-gdb) print CPU
No symbol "CPU" in current context.
(cuda-gdb) up
#1 0x00000000107e5320 in BFS_gpu<<<(8,1,1),(256,1,1)>>> (graph_nodes_av=0x7fff80000000, graph_edges_av=0x7fff80400000,
cost=0x7fff80a00000, color=0x7fff80204600, q1=0x7fff80c00000, q2=0x7fff80e00000, n_t=0x7fff81000800, head=0x7fff81000000,
tail=0x7fff81000200, threads_end=0x7fff81000400, threads_run=0x7fff81000600, overflow=0x7fff81000a00, LIMIT=128, CPU=1)
at kernel.cu:140
140 while(atomicAdd_system(&threads_run[0], 0) < iter) {
(cuda-gdb) print iter
$1 = 45
(cuda-gdb) print &threads_run[0]
$2 = (@Managed @Generic int *) 0x7fff81000600
(cuda-gdb) print threads_run[0]
$3 = 44

Using matrix market format for bfs

I would like to feed road_usa graph to CUDA-D/BFS.
The input is in market format which in my understanding is a valid format for BFS in this suite.

$ head -n 10 input/road_usa.mtx
23947347 23947347 28854312
2 1
710 1
1049674 1
2097155 2
10 3
2097153 3
2097157 3
1904 4
1048579 4

However, I get a segmentation fault like this:

$ ./bfs -f input/road_usa.mtx
TITAN V Segmentation fault (core dumped)

I would like to know If that is a valid input for chai or not? The two input graphs are small for my purposes.

CUDA Unified memory enhancements

Hi,
Do you have any plans supporting CUDA Unified memory enhancements?
More specifically, I'm interested in

Taking advantage of cudaMemAdvise API
Taking advantage of Cache Coherence support via NVLink2 in Power9 systems (starting from CUDA 9)
Thanks, Igor.

Cuda version

Hi,

Which cuda version is required to run these benchmarks? Also since there is cuda-U version I believe it is possible to run those on the jetson boards?

Best,
Trinayan

Timer problem in CEDD

Hello everybody.
I'm executing CEDD with default configuration and the output seem really strange.
Randomly it gets an enormous number.

CEDD output

Spectre -p 0 -d 0 -i 16 -a 0.20 -t 4
Allocation Time (ms): 0.071000
Initialization Time (ms): 2532.493000
Total Proxies Time (ms): 436.893000
CPU Proxy:
CPU Proxy: Kernel Time (ms): 432.369000
GPU Proxy:
GPU Proxy: Copy To Device Time (ms): 1501703139477.527100
GPU Proxy: Kernel Time (ms): 58.478000
GPU Proxy: Copy Back Time (ms): 9.976000
Deallocation Time (ms): 8.021000
Test Passed

Spectre -p 0 -d 0 -i 16 -a 0.20 -t 4
Allocation Time (ms): 0.073000
Initialization Time (ms): 2583.880000
Total Proxies Time (ms): 427.340000
CPU Proxy:
CPU Proxy: Kernel Time (ms): 1501703148589.305908
GPU Proxy:
GPU Proxy: Copy To Device Time (ms): 11.677000
GPU Proxy: Kernel Time (ms): 54.310000
GPU Proxy: Copy Back Time (ms): 9.947000
Deallocation Time (ms): 8.555000
Test Passed

I also executed BS and this never happened.
Do you have an idea of the reason of this?

Gabriel

result mismatch in BFS

Running the BFS example in OpenCL-U/D shows:

Computed node 2 cost (-2147483647 != 45) does not match the expected value

Thanks

Request for input files

Where datasets used in Chai paper are available from ?

GPU+nCPU

GPU+nCPU means run the data-partition benchmark with n CPU threads , and all with default partition in your paper?

Error : Operation not supported

Hi,

I am having an error when I try to run CUDA-U benchmarks on my jetson tx2. It says Operation not supported which is returned from the CUDA_ERR() function after the code does cudamallocmanaged. But cuda malloc managed definitely works on tx2 becuase it is running cuda 8 on pascal. And there are nvidia samples that use this.

What could be the issue here? Thanks again.

Best,
Trinayan

CEDD issue

Hi,

On my NVIDIA device the CEDD benchmark sometimes exits with an error: Invalid resource handle. The error is unpredictable and happens only at certain times. I dont know why. Might be thread race since it is totally indeterministic. Have you faced similar issues?

CUDA versions release

Hi,

Are there any plans to release the Cuda and CUDA Sim versions of CHAI soon?

Best,
Trinayan

chai-benchmarks / chai Goto Github PK

chai's Issues

Larger input sets

TRNS CPU GPU

Update for CEDT

memory placement consideration for PAD

CUDA-U Synchronization Deadlock for BFS

Using matrix market format for bfs

CUDA Unified memory enhancements

Cuda version

Timer problem in CEDD

result mismatch in BFS

Request for input files

GPU+nCPU

Error : Operation not supported

CEDD issue

CUDA versions release

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent