Making it easy to add and analyze clockings in CUDA
First run pip install tabulate
; then run ./run.sh
and enjoy the pretty readout!
The only include is clocker.cuh
cuda-clocking currently consists of 5 kernel macros and one host struct. First the macros:
- Add
TIMINGDETAIL_ARGS()
to the parameters of the kernel you want to profile. - Run
INITTHREADTIMER();
at the start of the kernel you want to profile. If you need more than 64 breakpoints, then you can pass in the number you need likeINITTHREADTIMER(200);
- Run
CLOCKRESET();
whenever you want to restart the running timer. - Run
CLOCKPOINT(ID, LABEL);
whenever you want to record elapsed time and restart the timer. You can put it in a loop, and it will count both calls and elapsed time.ID
should be unique, >=0, and <(max number of breakpoints; default 64). TheLABEL
field is completely ignored by the compiler but is used by the analysis script. A valid example use would beCLOCKPOINT(3, "global memory write");
- Run
FINISHTHREADTIMER();
at the end of the kernel to write results back to global memory.
The host struct TimingData
should be initialized with the grid and block dimensions of the kernel, and also the requested number of breakpoints if more than 64. (Otherwise that parameter can be omitted). Pass in timingdata_struct_name.data
as the argument corresponding to TIMINGDETAIL_ARGS()
when running the kernel. Finally, call timingdata_struct_name.write("path_to_cuda_file_containing_kernel");
to write an output profile.
You should be able to compile and run your code normally. The timing does use a relatively small number of registers but it is unlikely to interfere with most kernels.
Finally, run python analysis.py
to see the results!