Reproduce the problem:
/usr/local/cuda-11.5/bin/nvcc test_nvcc.cu -O3 -arch=sm_80
/usr/local/cuda-11.5/bin/ncu --section regex:'^(?!Nvlink)' -f ./a.out
CopyKernel1 get warning:
WRN Uncoalesced global access, expected 2097152 sectors, got 4194304 (2.00x) at PC 0x7fb694faf560
----- --------------------------------------------------------------------------------------------------------------
WRN Uncoalesced global access, expected 2097152 sectors, got 4194304 (2.00x) at PC 0x7fb694faf570
----- --------------------------------------------------------------------------------------------------------------
with cuda-11.2, everything is ok
with cuda-11.3 or newer:
CopyKernel0 works normally, has a 64-bit load and a 64-bit store, but in CopyKernel1 the float2 64-bit store was split to two 32-bit store, which caused a Uncoalesced global access warning in nsight-compute. the only difference between CopyKernel0 and CopyKernel1 is the forloop int row = blockIdx.x
or int64_t row = blockIdx.x
the CopyKernel0:
the CopyKernel1: