Describe the issue Profiler fails to profile CUDA, and only CPU ti

Fail to profile CUDA activities when ProfilerActivity.CPU is not enabled about kineto HOT 3 OPEN

chenlinchuang commented on August 24, 2024

Fail to profile CUDA activities when ProfilerActivity.CPU is not enabled

from kineto.

Comments (3)

chenlinchuang commented on August 24, 2024

To provide better context, i also tried the same code with legacy autograd profiler

Code to reproduce

import torch

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    for _ in range(100):
        y = torch.randn(1).cuda() + torch.randn(1).cuda()
            
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Code Result

STAGE:2024-01-21 21:44:48 2136:2136 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-21 21:44:48 2136:2136 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                            aten::to         6.92%       2.056ms        44.29%      13.158ms      65.790us       2.962ms         9.35%      13.700ms      68.500us           200
                           aten::add         2.04%     605.000us        39.60%      11.764ms     117.640us      12.221ms        38.59%      12.221ms     122.210us           100
                      aten::_to_copy        13.34%       3.962ms        37.37%      11.102ms      55.510us       4.156ms        13.12%      10.738ms      53.690us           200
                         aten::randn        13.30%       3.952ms        14.89%       4.425ms      22.125us       3.134ms         9.90%       5.750ms      28.750us           200
                         aten::copy_         2.40%     714.000us        21.46%       6.376ms      31.880us       4.853ms        15.32%       4.853ms      24.265us           200
                 aten::empty_strided         1.67%     496.000us         2.46%     730.000us       3.650us       1.729ms         5.46%       1.729ms       8.645us           200
                       aten::normal_         1.35%     400.000us         1.35%     400.000us       2.000us       1.601ms         5.06%       1.601ms       8.005us           200
                         aten::empty         0.25%      73.000us         0.25%      73.000us       0.365us       1.015ms         3.20%       1.015ms       5.075us           200
    cudaDeviceGetStreamPriorityRange         1.19%     354.000us         1.19%     354.000us     354.000us       0.000us         0.00%       0.000us       0.000us             1
                  cudaGetDeviceCount         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us             2
------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 29.709ms
Self CUDA time total: 31.671ms

from kineto.

exitNA commented on August 24, 2024

I have the same problem.

cuda: 12.3
pytorch:2.1.2

from kineto.

anupambhatnagar commented on August 24, 2024

The ops ProfilerStep*, aten::empty, aten::to, aten::add etc. are launched on the CPU so the profiler is working as expected when ProfilerActivity.CPU is not added. The output of the profiler is the expected behavior and not a bug.

from kineto.

Recommend Projects