chai-benchmarks / chai Goto Github PK
View Code? Open in Web Editor NEWChai
Chai
Hi,
Apart from the benchmarks where the user can specify the size of the data are there any larger input sources available for the other benchmarks where the user passes a file? Thank you
Best,
Trinayan
Hi,
The Chai paper mentions that SC , PAD and TRNS support only GPU execution in the OPENCL-D benchmarks. I did not try the OpenCL-D ones but the CUDA-D SC and PAD version is CPU-GPU executing together. I was wondering if it is also possible to do this for TRNS of CUDA-D as well or it is not possible at all? Thanks
Best,
Trinayan
Since CEDT is always working on the GPU first before the CPU, I recommend inserting the following to streamline the initial allocation. I'm finding that the initial copy out for the CUDA-U implementation is somehow buried into the initialization. The kernel time thus performs equivalent to the CUDA-D implementation.
diff --git a/CUDA-U/CEDT/main.cpp b/CUDA-U/CEDT/main.cpp
index 5cf954c..2ff06dd 100644
--- a/CUDA-U/CEDT/main.cpp
+++ b/CUDA-U/CEDT/main.cpp
@@ -162,6 +162,7 @@ int main(int argc, char **argv) {
cudaStatus = cudaMallocManaged(&h_in_out, (p.n_warmup + p.n_reps)*sizeof(unsigned char *));
for(int i = 0; i < p.n_warmup + p.n_reps; i++) {
cudaStatus = cudaMallocManaged(&h_in_out[i], in_size);
For some reason the 'CUDA-U' implementation is PAD is taking a long time in the kernel; likely waiting for data. I was able to make the following change to bring the kernel time in-line with the CUDA-D version.
#ifdef CUDA_8_0
T * h_in_out;
cudaStatus = cudaMallocManaged(&h_in_out, in_size * sizeof(T));
cudaMemAdvise(h_in_out, in_size * sizeof(T), cudaMemAdviseSetPreferredLocation, 0);
I realize that this change increases the Allocation phase of the program, but it seems more reasonable that the tax is there. Its unclear to me how this setting may impact other results with CUDA CHAI.
I'm opening this for discussion and consideration.
I have achieved a state where the GPU kernel is are waiting for the CPU threads to finish, but the CPU threads have already exited according to the debugger. All run time parameters are default. The GPU appears to be done with threads_end[0] = 2048; the threads_run[0] returns 44, which looks to be what the CPUs took of the jobs, is less than iter which is at 45 (with no CPU thread now active)! So the program hangs are line 140 of kernel.cu.
This is also happening to me on the SSSP CUDA-U implementation.
Can you explain how the synchronization behavior is supposed to work so I can root cause this? If private thread is better, robers97 at gmail.com
[New Thread 0x7ffff0b8f190 (LWP 47429)]
Tesla V100-SXM2-16GB [New Thread 0x7fffdffff190 (LWP 47430)]
Allocation Time (ms): 52.675583
Number of nodes = 264346 Number of edges = 733846 Initialization Time (ms): 243.190781
[New Thread 0x7fffdf6af190 (LWP 47433)]
[New Thread 0x7fffdee9f190 (LWP 47434)]
[New Thread 0x7fffde68f190 (LWP 47435)]
[Thread 0x7fffde68f190 (LWP 47435) exited]
[Thread 0x7fffdee9f190 (LWP 47434) exited]
[Thread 0x7fffdf6af190 (LWP 47433) exited]
^C
Thread 1 "bfs" received signal SIGINT, Interrupt.
0x00000000107e0bc0 in _INTERNAL_42_tmpxft_0000db2d_00000000_10_kernel_cpp1_ii_a8c07a88::atomicAdd_system (address=,
val=) at /usr/local/cuda/include/sm_60_atomic_functions.hpp:92
92 return __iAtomicAdd_system(address, val);
(cuda-gdb) where
#0 0x00000000107e0bc0 in _INTERNAL_42_tmpxft_0000db2d_00000000_10_kernel_cpp1_ii_a8c07a88::atomicAdd_system (address=,
val=) at /usr/local/cuda/include/sm_60_atomic_functions.hpp:92
#1 0x00000000107e5320 in BFS_gpu<<<(8,1,1),(256,1,1)>>> (graph_nodes_av=0x7fff80000000, graph_edges_av=0x7fff80400000,
cost=0x7fff80a00000, color=0x7fff80204600, q1=0x7fff80c00000, q2=0x7fff80e00000, n_t=0x7fff81000800, head=0x7fff81000000,
tail=0x7fff81000200, threads_end=0x7fff81000400, threads_run=0x7fff81000600, overflow=0x7fff81000a00, LIMIT=128, CPU=1)
at kernel.cu:140
(cuda-gdb) where
#0 0x00000000107e0bc0 in _INTERNAL_42_tmpxft_0000db2d_00000000_10_kernel_cpp1_ii_a8c07a88::atomicAdd_system (address=,
val=) at /usr/local/cuda/include/sm_60_atomic_functions.hpp:92
#1 0x00000000107e5320 in BFS_gpu<<<(8,1,1),(256,1,1)>>> (graph_nodes_av=0x7fff80000000, graph_edges_av=0x7fff80400000,
cost=0x7fff80a00000, color=0x7fff80204600, q1=0x7fff80c00000, q2=0x7fff80e00000, n_t=0x7fff81000800, head=0x7fff81000000,
tail=0x7fff81000200, threads_end=0x7fff81000400, threads_run=0x7fff81000600, overflow=0x7fff81000a00, LIMIT=128, CPU=1)
at kernel.cu:140
(cuda-gdb) print iter
No symbol "iter" in current context.
(cuda-gdb) print CPU
No symbol "CPU" in current context.
(cuda-gdb) up
#1 0x00000000107e5320 in BFS_gpu<<<(8,1,1),(256,1,1)>>> (graph_nodes_av=0x7fff80000000, graph_edges_av=0x7fff80400000,
cost=0x7fff80a00000, color=0x7fff80204600, q1=0x7fff80c00000, q2=0x7fff80e00000, n_t=0x7fff81000800, head=0x7fff81000000,
tail=0x7fff81000200, threads_end=0x7fff81000400, threads_run=0x7fff81000600, overflow=0x7fff81000a00, LIMIT=128, CPU=1)
at kernel.cu:140
140 while(atomicAdd_system(&threads_run[0], 0) < iter) {
(cuda-gdb) print iter
$1 = 45
(cuda-gdb) print &threads_run[0]
$2 = (@Managed @Generic int *) 0x7fff81000600
(cuda-gdb) print threads_run[0]
$3 = 44
I would like to feed road_usa graph to CUDA-D/BFS.
The input is in market format which in my understanding is a valid format for BFS in this suite.
$ head -n 10 input/road_usa.mtx
23947347 23947347 28854312
2 1
710 1
1049674 1
2097155 2
10 3
2097153 3
2097157 3
1904 4
1048579 4
However, I get a segmentation fault like this:
$ ./bfs -f input/road_usa.mtx
TITAN V Segmentation fault (core dumped)
I would like to know If that is a valid input for chai or not? The two input graphs are small for my purposes.
Hi,
Do you have any plans supporting CUDA Unified memory enhancements?
More specifically, I'm interested in
Hi,
Which cuda version is required to run these benchmarks? Also since there is cuda-U version I believe it is possible to run those on the jetson boards?
Best,
Trinayan
Hello everybody.
I'm executing CEDD with default configuration and the output seem really strange.
Randomly it gets an enormous number.
CEDD output
Spectre -p 0 -d 0 -i 16 -a 0.20 -t 4
Allocation Time (ms): 0.071000
Initialization Time (ms): 2532.493000
Total Proxies Time (ms): 436.893000
CPU Proxy:
CPU Proxy: Kernel Time (ms): 432.369000
GPU Proxy:
GPU Proxy: Copy To Device Time (ms): 1501703139477.527100
GPU Proxy: Kernel Time (ms): 58.478000
GPU Proxy: Copy Back Time (ms): 9.976000
Deallocation Time (ms): 8.021000
Test Passed
Spectre -p 0 -d 0 -i 16 -a 0.20 -t 4
Allocation Time (ms): 0.073000
Initialization Time (ms): 2583.880000
Total Proxies Time (ms): 427.340000
CPU Proxy:
CPU Proxy: Kernel Time (ms): 1501703148589.305908
GPU Proxy:
GPU Proxy: Copy To Device Time (ms): 11.677000
GPU Proxy: Kernel Time (ms): 54.310000
GPU Proxy: Copy Back Time (ms): 9.947000
Deallocation Time (ms): 8.555000
Test Passed
I also executed BS and this never happened.
Do you have an idea of the reason of this?
Gabriel
Running the BFS example in OpenCL-U/D shows:
Computed node 2 cost (-2147483647 != 45) does not match the expected value
Thanks
Where datasets used in Chai paper are available from ?
GPU+nCPU means run the data-partition benchmark with n CPU threads , and all with default partition in your paper?
Hi,
I am having an error when I try to run CUDA-U benchmarks on my jetson tx2. It says Operation not supported which is returned from the CUDA_ERR() function after the code does cudamallocmanaged. But cuda malloc managed definitely works on tx2 becuase it is running cuda 8 on pascal. And there are nvidia samples that use this.
What could be the issue here? Thanks again.
Best,
Trinayan
Hi,
On my NVIDIA device the CEDD benchmark sometimes exits with an error: Invalid resource handle. The error is unpredictable and happens only at certain times. I dont know why. Might be thread race since it is totally indeterministic. Have you faced similar issues?
Hi,
Are there any plans to release the Cuda and CUDA Sim versions of CHAI soon?
Best,
Trinayan
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.