I have achieved a state where the GPU kernel is are waiting for the CPU threads to finish, but the CPU threads have already exited according to the debugger. All run time parameters are default. The GPU appears to be done with threads_end[0] = 2048; the threads_run[0] returns 44, which looks to be what the CPUs took of the jobs, is less than iter which is at 45 (with no CPU thread now active)! So the program hangs are line 140 of kernel.cu.
This is also happening to me on the SSSP CUDA-U implementation.
Can you explain how the synchronization behavior is supposed to work so I can root cause this? If private thread is better, robers97 at gmail.com
[New Thread 0x7ffff0b8f190 (LWP 47429)]
Tesla V100-SXM2-16GB [New Thread 0x7fffdffff190 (LWP 47430)]
Allocation Time (ms): 52.675583
Number of nodes = 264346 Number of edges = 733846 Initialization Time (ms): 243.190781
[New Thread 0x7fffdf6af190 (LWP 47433)]
[New Thread 0x7fffdee9f190 (LWP 47434)]
[New Thread 0x7fffde68f190 (LWP 47435)]
[Thread 0x7fffde68f190 (LWP 47435) exited]
[Thread 0x7fffdee9f190 (LWP 47434) exited]
[Thread 0x7fffdf6af190 (LWP 47433) exited]
^C
Thread 1 "bfs" received signal SIGINT, Interrupt.
0x00000000107e0bc0 in _INTERNAL_42_tmpxft_0000db2d_00000000_10_kernel_cpp1_ii_a8c07a88::atomicAdd_system (address=,
val=) at /usr/local/cuda/include/sm_60_atomic_functions.hpp:92
92 return __iAtomicAdd_system(address, val);
(cuda-gdb) where
#0 0x00000000107e0bc0 in _INTERNAL_42_tmpxft_0000db2d_00000000_10_kernel_cpp1_ii_a8c07a88::atomicAdd_system (address=,
val=) at /usr/local/cuda/include/sm_60_atomic_functions.hpp:92
#1 0x00000000107e5320 in BFS_gpu<<<(8,1,1),(256,1,1)>>> (graph_nodes_av=0x7fff80000000, graph_edges_av=0x7fff80400000,
cost=0x7fff80a00000, color=0x7fff80204600, q1=0x7fff80c00000, q2=0x7fff80e00000, n_t=0x7fff81000800, head=0x7fff81000000,
tail=0x7fff81000200, threads_end=0x7fff81000400, threads_run=0x7fff81000600, overflow=0x7fff81000a00, LIMIT=128, CPU=1)
at kernel.cu:140
(cuda-gdb) where
#0 0x00000000107e0bc0 in _INTERNAL_42_tmpxft_0000db2d_00000000_10_kernel_cpp1_ii_a8c07a88::atomicAdd_system (address=,
val=) at /usr/local/cuda/include/sm_60_atomic_functions.hpp:92
#1 0x00000000107e5320 in BFS_gpu<<<(8,1,1),(256,1,1)>>> (graph_nodes_av=0x7fff80000000, graph_edges_av=0x7fff80400000,
cost=0x7fff80a00000, color=0x7fff80204600, q1=0x7fff80c00000, q2=0x7fff80e00000, n_t=0x7fff81000800, head=0x7fff81000000,
tail=0x7fff81000200, threads_end=0x7fff81000400, threads_run=0x7fff81000600, overflow=0x7fff81000a00, LIMIT=128, CPU=1)
at kernel.cu:140
(cuda-gdb) print iter
No symbol "iter" in current context.
(cuda-gdb) print CPU
No symbol "CPU" in current context.
(cuda-gdb) up
#1 0x00000000107e5320 in BFS_gpu<<<(8,1,1),(256,1,1)>>> (graph_nodes_av=0x7fff80000000, graph_edges_av=0x7fff80400000,
cost=0x7fff80a00000, color=0x7fff80204600, q1=0x7fff80c00000, q2=0x7fff80e00000, n_t=0x7fff81000800, head=0x7fff81000000,
tail=0x7fff81000200, threads_end=0x7fff81000400, threads_run=0x7fff81000600, overflow=0x7fff81000a00, LIMIT=128, CPU=1)
at kernel.cu:140
140 while(atomicAdd_system(&threads_run[0], 0) < iter) {
(cuda-gdb) print iter
$1 = 45
(cuda-gdb) print &threads_run[0]
$2 = (@Managed @Generic int *) 0x7fff81000600
(cuda-gdb) print threads_run[0]
$3 = 44