pytorch / kineto Goto Github PK

A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.

License: Other

CMake 0.29% C++ 11.13% Starlark 0.05% Python 8.00% HTML 70.78% TypeScript 8.78% CSS 0.01% JavaScript 0.02% C 0.04% Shell 0.08% Cuda 0.82%

kineto's Introduction

Kineto

Kineto is part of the PyTorch Profiler.

The Kineto project enables:

performance observability and diagnostics across common ML bottleneck components
actionable recommendations for common issues
integration of external system-level profiling tools
integration with popular visualization platforms and analysis pipelines

A central component is Libkineto, a profiling library with special focus on low-overhead GPU timeline tracing.

Libkineto

Libkineto is an in-process profiling library integrated with the PyTorch Profiler. Please refer to the README file in the libkineto folder as well as documentation on the new PyTorch Profiler API.

Holistic Trace Analysis

Holistic Trace Analysis (HTA) is an open source performance debugging library aimed at distributed workloads. HTA takes as input PyTorch Profiler traces and elevates the performance bottlenecks to enable faster debugging. Here's a partial list of features in HTA:

Temporal Breakdown: Breakdown of GPU time in terms of time spent in computation, communication, memory events, and idle time on a single node and across all ranks.
Idle Time Breakdown: Breakdown of GPU idle time into waiting for the host, waiting for another kernel or attributed to an unknown cause.
Kernel Breakdown: Find kernels with the longest duration on each rank.
Kernel Duration Distribution: Distribution of average time taken by longest kernels across different ranks.
Communication Computation Overlap: Calculate the percentage of time when communication overlaps computation.

For a complete list see here.

PyTorch TensorBoard Profiler (Deprecated)

The goal of the PyTorch TensorBoard Profiler is to provide a seamless and intuitive end-to-end profiling experience, including straightforward collection from PyTorch and insightful visualizations and recommendations in the TensorBoard UI. Please refer to the README file in the tb_plugin folder.

Future Development Direction:

Some areas we're currently working on:

Support for tracing distributed workloads
Trace processing, analysis and recommendation engine
System-level activities, multiple tracing sources
Profiling and monitoring daemon for larger scale deployments

Releases and Contributing

We will follow the PyTorch release schedule which roughly happens on a 3 month basis.

We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further discussion.

If you plan to contribute new features, please first open an issue and discuss the feature with us. Sending a PR without discussion might end up resulting in a rejected PR because we might be taking the infrastructure in a different direction than you might be aware of. We expect the architecture to keep evolving.

License

Kineto has a BSD-style license, as found in the LICENSE file.

kineto's People

Contributors

Stargazers

Watchers

Forkers

guyang3532 kingwl gaoteng-git zhangxu-happy brianjo malfet imaginary-person gdankel chaekit liuziyue mwootton seemethere global-localhost global19 global19-atlassian-net kakasraka1 exlsunshine tomwildenhain-microsoft isabella232 boydad briancoutinho licj15 w1d2s chrisxcai longervision mathpopo azhou-determined koide-lab mautier manuyavuz reubend williamwangpeng aunell1 liuyibox jouleyuan digantdesai aaronenyeshi larryliu0820 maharishib leqiao-1 andreabrantes skyline75489 bnmajor rainyfly snapbuy lingongheng r-barnes milesial wangjincheng123456 lipovsek swolchok informaticacba sanchitintel yangdian96 frankfanslc classicvalues 00mjk shenh10 mayoor guotuofeng rocm danielsnider peterbell10 astachowiczhabana herolin12 msaroufim mosout robieta shintaro-iwasaki zurichrain slgong-fb mbrukman pan24n tryweirdier flamefire benghaem wandb caizhi-mt armbiant xunsongh lvleiai123 pallab-zz clemente0731 michal-olek mhqmhy reflecting-light jiaopl xw285cornell kalaracey blorange-amd dzhulgakov mmeecatfish vazkir kit1980 ajunlonglive omarawad2 creativity-spot jj10306 skylion007 easthorsego

kineto's Issues

Kernels overlap in the same CUDA stream

Sometimes(this rarely happens), open this json file from "chrome://tracing", we can see the "stream 7" has 2 lines and there is a small triangle in front of it.
I find it is caused by a very very thin kernel with "dur" 0. This kernel is overlapped with another event. Then 2 lines are shown in stream7.

It's worth further analyzing on why 2 kernels overlaps in the same stream.

Torch-tb-profiler on Conda

https://anaconda.org/search?q=torch-tb-profiler doesn't return anything

[Task] Module view

[Task] Add support for saving chrome trace files to s3 urls and Azure blob

Add support for saving the generated chrome trace files to s3 URLs. This works for Tensorboard SummaryWriter but is not supported for the Profiler traces.

Current Behavior
on_trace_ready=torch.profiler.tensorboard_trace_handler('s3://tb-demo/pytorch/')

Creates local files like:

(base) ubuntu@ip-172-31-22-142:~$ ls 
s3\:/tb-demo/pytorch/ip-172-31-22-142_14545.1618682202565.pt.trace.json 
s3:/tb-demo/pytorch/ip-172-31-22-142_14545.1618682202565.pt.trace.json

Expected Behavior
The chrome trace files should be saved in the S3 bucket

Add ProfilerAction.PAUSE

A ProfilerAction.PAUSE could be used to skip some batches when switching from train_dataloader to validation_dataloader

[Task] GPU Utilization

[Task]: support gzip for chrome tracing file

Support export chrome tracing file in gzip format to reduce file size

PYTORCH_PROFILER tab is not getting rendered automatically is JupyterHub's Tensorboard

The .json files are being saved by profiler at the logdir, but when I try to open tensorboard from jupterhub, the pytorch_profiler tab doesn't seem to be rendered automatically. All the dependencies as mentioned in README are present in the hub environment.

tensorboard "--logdir" path can't support "~" as path prefix

CUDA traces not getting generated in some environments

Cuda traces are not showing up for the Resent sample in AWS DLAMI with CUDA 10.2. Generated trace is attached
trace.zip

Tensorboard hangs in VS code

VScode opens tensorboard but hangs forever (displaying a spinner). The same happens in the browser but is resolved on refreshing the page. The cause appear to be a race between tensorboard loading/finding the trace files and the web page opening. Since VScode loads the frame at the same time as starting tensorboard and there is no refresh for the frame in VSCode, there's no easy way for the user to get the page to load.

[Task] VSCode integration

#147

security warning in tensorboard

There is some warning from “security_validator.py” repeatedly appearing in the logs of tensorboard:

W0303 12:02:32.939834 139855689737984 security_validator.py:51] In 3.0, this warning will become an error
X-Content-Type-Options is required to be "nosniff"

[Document] Profiler API Documentation

Update API Documentation https://pytorch.org/docs/stable/profiler.html

Feedback from Users:

Document difference between “wait” and “warmup” parameters in profiler schedule. Will be great if their meaning can be clarified in the documentation.
Explicitly state in the PyTorch docs that additional plugin needs to be installed to leverage tensorboard. You have the link to github, where it’s clearly stated, but it should be more explicit in my opinion, so I suggest to put that clarification in the main PyTorch documentation
Document profiler.step function. Maybe documentation can be improved.
Document record_function. Is there any way to mark regions of the code to be grouped together in profiling?
Document trace events lifecycle. Our current code has infinite loop in a function, out of which we return from the middle when some “exit condition” is satisfied (either number of steps or if validation loss stops improving). While we can definitely refactor that loop to “normally” exit and thus trigger the “exit” call to the profiler context manager, it will be good if there is some helper functionality (or at least an example in the documentation) on how to best handle such structure of the training loop. I think that tb events are still properly logged, but if I try to summarize the p.key_averages() inside the “with torch.profiler” block (before “return” operator), I get a “RuntimeError: can't export a trace that didn't finish running”. Again, this is not a big deal, since we can refactor the code, but will be nice if such structure is generally supported (maybe it already is, then will be good to have examples in the docs).

[Task] Enhance c10d profiling to capture more information, e.g. tensor dtype

pytorch/pytorch#55358

Profiler Tutorials update

Current tutorials point to autograd and need to be updated with new Profiler:

https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
https://pytorch.org/tutorials/beginner/profiler.html

Why CUDA_SOURCE_DIR instead of CUDA_HOME?

This is sort of nitpicking.

According to README of libkineto and CMakeLists.txt,

If CUDA_SOURCE_DIR is not set, libkineto will fail to build.

kineto/libkineto/CMakeLists.txt

Lines 80 to 83 in 3c77248

 if (NOT CUDA_SOURCE_DIR) 

 set(CUDA_SOURCE_DIR "$ENV{CUDA_SOURCE_DIR}") 

 message(INFO " CUDA_SOURCE_DIR = ${CUDA_SOURCE_DIR}") 

 endif()

However, the build script of PyTorch uses CUDA_HOME.
Why did you choose CUDA_SOURCE_DIR instead of CUDA_HOME? Is there any reason for this?

warnings in profiler console output

There is some warning from “security_validator.py” repeatedly appearing in the logs of tensorboard:
W0303 12:02:32.939834 139855689737984 security_validator.py:51] In 3.0, this warning will become an error
X-Content-Type-Options is required to be "nosniff"

Separate traces for training and validation steps

Our code is structured in such a way that inside the main training loop, we will sometimes run validation as well. Ideally, it should be logged separately from training. Is there any functionality to support this?
E.g. similar to what this issue is asking for: #86,

but ideally we should be able to span a second instance of profiler (or provide some metadata to say that current steps are not for training, but for validation).

[Task] Multiple tracing files support

Build failure on Ubuntu 16.04

Log

/usr/bin/c++ -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMAGMA_V2 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTH_BLAS_MKL -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -I../cmake/../third_party/benchmark/include -Icaffe2/contrib/aten -I../third_party/onnx -Ithird_party/onnx -I../third_party/foxi -Ithird_party/foxi -I../third_party/kineto/libkineto/include -I../third_party/kineto/libkineto/src -I../third_party/fmt/include -I/usr/local/cuda/extras/CUPTI/include -I/usr/local/cuda/include -isystem third_party/gloo -isystem ../cmake/../third_party/gloo -isystem ../cmake/../third_party/googletest/googlemock/include -isystem ../cmake/../third_party/googletest/googletest/include -isystem ../third_party/protobuf/src -isystem /home/chester/miniconda3/envs/pytorch-build-py37/include -isystem ../third_party/gemmlowp -isystem ../third_party/neon2sse -isystem ../third_party/XNNPACK/include -isystem ../third_party -isystem ../cmake/../third_party/eigen -isystem /home/chester/miniconda3/envs/pytorch-build-py37/include/python3.7m -isystem /home/chester/miniconda3/envs/pytorch-build-py37/lib/python3.7/site-packages/numpy/core/include -isystem ../cmake/../third_party/pybind11/include -isystem /usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent -isystem /usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent/include -isystem /usr/lib/openmpi/include -isystem /usr/lib/openmpi/include/openmpi -isystem ../cmake/../third_party/cub -isystem ../third_party/ideep/mkl-dnn/include -isystem ../third_party/ideep/include -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -O3 -DNDEBUG -DNDEBUG -fPIC -fvisibility=hidden -DCAFFE2_USE_GLOO -DCUDA_HAS_FP16=1 -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -DKINETO_NAMESPACE=libkineto -std=gnu++14 -DHAS_CUPTI -std=c++14 -MD -MT third_party/kineto/libkineto/CMakeFiles/kineto_base.dir/src/cupti_strings.cpp.o -MF third_party/kineto/libkineto/CMakeFiles/kineto_base.dir/src/cupti_strings.cpp.o.d -o third_party/kineto/libkineto/CMakeFiles/kineto_base.dir/src/cupti_strings.cpp.o -c ../third_party/kineto/libkineto/src/cupti_strings.cpp
../third_party/kineto/libkineto/src/cupti_strings.cpp: In function ‘const char* libkineto::runtimeCbidName(CUpti_CallbackId)’:
../third_party/kineto/libkineto/src/cupti_strings.cpp:478:105: error: expected ‘,’ before ‘)’ token
   static_assert(CUPTI_RUNTIME_TRACE_CBID_SIZE < (sizeof(runtimeCbidNames) / sizeof(runtimeCbidNames[0])));
                                                                                                         ^
../third_party/kineto/libkineto/src/cupti_strings.cpp:478:105: error: expected string-literal before ‘)’ token

Environment

Collecting environment information...
PyTorch version: 1.9.0a0+gitebfa927
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.7 LTS (x86_64)
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Clang version: Could not collect
CMake version: version 3.18.2

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1080
GPU 1: GeForce GTX 1080

Nvidia driver version: 440.33.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.3
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A

In plugin Overall View, chart does not resize with size of browser window

Start tensorboard with non-max browser Window, then maximize browser Window.
In Overall View, charts stay small.

Loader doesn't correctly parse trace filename when repeat>1

https://github.com/pytorch/pytorch/blob/7d4e9bdba144e162882fb854324430c4b92fb267/torch/profiler/profiler.py#L75
Here's how tensorboard_trace_handler decides the filename

file_name = "{}.{}.pt.trace.json".format(worker_name, int(time.time() * 1000))

And here's how the loader parses it

kineto/tb_plugin/torch_tb_profiler/profiler/loader.py

Line 43 in b4e71c3

worker = path[:-len(pattern)]

When repeat > 1, there will be multiple trace files from each worker under different timestamps, and yet they will be identified as different workers by the loader.

_populate_cpu_children seems pretty slow

[Task] Capture device memory related metadata and export to chrome trace

Profiler plugin does not refresh traces

Running training multiple times generates more traces, but refresh button does not re-scan the log directory and does not show new traces

Chrome tracing file can't tell async task with sync task

Through reviewing the code of pytorch profiler and kineto, I found in the case of async task, there is an inconsistency on assigning thread id to FunctionEvent and ClientTraceActivity.

When running with @torch.jit.script, torch.jit.fork and torch.jit.wait, such as the first example code in Dynamic Parallelism in TorchScript, the “example” function is an async task, its start_thread_id and end_thread_id in profiling result will be different.
In torch/autograd/profiler.py, I could see it carefully handles this kind of event with checking “is_async”, it does not attach children and does not calculate self time on it.
However, in the dumped chrome tracing file (with kineto enabled), I could only see “example” as a normal “Operator” with “ph” as “X”. There is no message showing it is an async task. It just assign end thread's id to "tid".
My concern: Will regarding the async tasks simply as sync tasks have potential risk? If we simply regard async tasks as sync tasks, will they have risk to overlap (cross, not contain) with each other in chrome tracing UI view?
And if we can’t tell async tasks with sync tasks, our tb_plugin may can’t achieve same expected result as pytorch autograd profiler. For example, we will calculate self time of it because we can’t know it is actually an async task.

Another two tiny issues:

There are 3 thread id:
2.1 In profiler_kineto.cpp, the thread id is got from at::RecordFunction::currentThreadId(start, end).
2.2 In chrome tracing file, the “tid” is real pthread id
2.3 In chrome tracing file, the id in “thread_name” is got from ChromeTraceLogger::renameThreadID.
Maybe the 2.1 and 2.3 could be made more consistent?
The Ln56 in profiler_kineto.cpp seems redundant, only keep Ln73 is enough.

Is kineto usable with pytorch 1.8?

Hi, I have noticed use_kineto argument in torch.autograd.profiler.profile’s signature (PyTorch 1.8) but not in docs you point to in the main readme (i.e., https://pytorch.org/docs/master/profiler.html) which make no mention of kineto [anymore]. Therefore the question from the title - is kineto project still alive and usable with PyTorch 1.8?

If affirmative, are there any additional installs required beyond PyTorch itself for the kineto to work?

"memory bandwidth" with invalid value inf in dumped chrome tracing file

Our colleague produced the duration of "Memcpy" as 0 when using kineto to profile. It causes the "memory bandwidth (GB/s)" to be inf. Then the file can't be open by chrome://tracing and our tensorboard plugin's "json.load" fail to load it.

Because the inf is neither a string nor a number, so it can't be parsed as valid json format.
A solution is to add "if" to judge whether "dur" is 0, if so a string with "inf" could be assigned as value.

The bug case:

{
  "schemaVersion": 1,
  "traceEvents": [
  
  {
    "ph": "X", "cat": "Memcpy", 
    "name": "Memcpy HtoD (Pageable -> Device)", "pid": 0, "tid": "stream 7",
    "ts": 1614800519473220, "dur": 0,
    "args": {
      "device": 0, "context": 1,
      "stream": 7, "correlation": 20002, "external id": 3981,
      "bytes": 200, "memory bandwidth (GB/s)": inf
    }
  }
]}

Plugin fails to load traces

Running BERT training:
python /home/azureuser/pyprofiler/bert_for_sequence_classification.py --train-steps 64 --epochs 2 --pytorch-only

https://github.com/lenisha/pyprofiler/blob/main/bert_for_sequence_classification.py

and producing following trace:
https://github.com/lenisha/pyprofiler/blob/main/trace/bert_record_nt/worker0.pt.trace.json

Plugin does not load the trace , just showing spinner

blog variable name error

Should change to "with_stack".
https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/

Add support for visualizing chrome trace files in tb_plugin from S3 URLs

For chrome trace files in S3 bucket, the tb_plugin UI gets stuck in spinning wheel to load the files and nothing gets displayed. Add support for visualizing the traces from S3 URLs for the Profiler plugin, similar to how one can view the main tensorboard runs in the rest of the tensorboard UI.

'"ts": 0' event is found in kineto's chrome tracing file

When I train resnet50 model with big batch size such as 128 or 256, the profiler dumps the following events in chrome tracing file:

  {
    "ph": "X", "cat": "Kernel", 
    "name": "volta_sgemm_64x64_nt", "pid": 0, "tid": "stream 7",
    **"ts": 0, "dur": 0,**
    "args": {
      "queued": 0, "device": 0, "context": 1,
      "stream": 7, "correlation": 106123, "external id": 34735,
      "registers per thread": 126,
      "shared memory": 8192,
      "warps per SM": 14.4,
      "grid": [4, 4, 36],
      "block": [64, 1, 1]
    }
  },
  {
    "ph": "f", "id": 106123, "pid": 0, "tid": "stream 7", "ts": 0,
    "cat": "async", "name": "launch", "bp": "e"
  },
  {
    "ph": "X", "cat": "Runtime", 
    "name": "cudaLaunchKernel", "pid": 29463, "tid": "3239786240",
    "ts": 1615207264505000, "dur": 6,
    "args": {
      "cbid": 211, "correlation": 106123,
      "external id": 34735, "external ts": 1615207264504915
    }
  },
  {
    "ph": "s", "id": 106123, "pid": 29463, "tid": 3239786240, "ts": 1615207264505000,
    "cat": "async", "name": "launch"
  },

You can see "ts" is 0 and "dur" is 0 in above kernel. And there are more than 1 thousand these events in the file.

The model code I used is plugin example resnet50, and changing its "batch_size" from 32 to 128.
The GPU: NVIDIA V100.
The torch whl: https://download.pytorch.org/whl/nightly/cu111/torch-1.9.0.dev20210305%2Bcu111-cp38-cp38-linux_x86_64.whl
The torchvision whl: https://download.pytorch.org/whl/nightly/cu111/torchvision-0.9.0.dev20210305%2Bcu111-cp38-cp38-linux_x86_64.whl

Because these kernels with 0 ts are launched by the runtimes which are nearly the end of profiling, I guess it is due to these kernels are executed later than "profiler stop". This is my snapshot:

You can see the kernels that which should be executed after "profiler stop" is not painted here(They are painted at time "0"). And the last operators don't have related kernels, but they really launched kernels in chrome tracing file.

The is also bad experience that user will be confused to see the following in this case: (ts=0 will make these events starts at 0, and other events will be shown to many years later)

BTW, the expected behavior is correctly dumping the kernels launched during profiler's start and stop, rather than removing these kernels from file. Because we have to make the operators' kernel time to be correct.

[Task] Distributed training

Add Google Cloud support for tb_plugin

Intended usage

What is the intended usage of this library? It doesn't look like there are any examples or blog up yet. Do I need to build a C++ program around this library? How do I then invoke it while running a python pytorch run?

Profiler plugin arrow navigation in VSCode

When TensorBoard with Profiler plugin is opened in VSCode key navigation with arrows does not work, while it is working in regular browser

Page scroll up

Scroll the page down:

Input a string into "Search by Name" and press enter:

The page is automatically scrolled up. It is not a good user experience which makes user lose focus.

The "Group By" still behave like this.
The "Group By" and "Search by Name" in "Kernel View" also behave like this.

Why the time range of events are strange?

I'm sorry I probably shouldn't ask questions here, but I have tried to ask questions on stackoverflow and pytoch forums, and no one responded.

My problem is that the timestamp of the events reported by the pytorch profiler is strange.

# %%
import torch
import torchvision.models as models
import time
print(torch.__version__)

model = models.resnet18().cuda()
inputs = torch.randn(5, 3, 224, 224).cuda()
with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA]
) as p:
    outputs = model(inputs)

# %%
events = p.events()
print('begin     ', min(events, key=lambda e: e.time_range.start).time_range.start / 1000000)
print('end       ', max(events, key=lambda e: e.time_range.end).time_range.end / 1000000)
print('realtime  ', time.time())
print('monotic   ', time.perf_counter())

The output is as follows:

1.8.1
begin      1618284730.230899
end        1618284733.091974
realtime   1618299047.4422061
monotic    14317.211549314

My question is why the time of the event is so different from the time of the two clocks. The result I hope is that all the timestamps are consistent with the monotic clock.

Can I achieve the desired effect by modifying the source code, and if so, what should I do?

[Task] TensorCore

Add documentation for cloud storage support

Show scopes for operations

Hi,
A key feature we are looking but is currently missing is showing scopes for operations (like the tensorflow profiler).

Example:

empty_

vs.

SequentialModel/layers[2]/Attention/empty_

And we want the 2nd option.
(with the 2nd option, getting combined statistics for all occurrences of empty_ could be done by grouping)

According to the Pytorch API, using with_stack=True would only record file and line numbers.
Unfortunately, file and line numbers is still confusing since we can have both

SequentialModel/layers[2]/Attention/empty_
SequentialModel/layers[3]/SomeModelWithAttention/Attention/empty_

which would point to the same place in code.

[Task]: Trace torch.nn.DistributedDataParallel and torch.nn.DataParallel forward

Negative number in input textbox

In Operator view, now user can input negative number into the "Top kernels to show". This should be forbidden.
The way to reproduce it:

Mouse move to head of the number, click.
Input the negative character "-", press enter key.

Profiler gets stuck on trace handling for a toy training loop

Basic setup info

device: Tesla V100-SXM2-32GB
ram: 64GB
pytorch: 1.8
cudatoolkit: 10.2
python: 3.7.8
environment: conda 4.7.5
os: CentOS Linux release 7.9.200

Description
Hi, I tried to export profiler trace for 1 epoch of training for a tutorial toy problem with examples of your new profiler API. Unfortunately, whenever training finishes and trace handling is called either via profiler or manually it gets stuck indefinitely, never outputting anything. I observed that from the moment of entering the trace handler RAM consumption of the host increases from 10 to 25GB over couple of minutes and stays there. Mindful of legacy profiler issues I checked the impact of setting DataLoader's num_workers to 0, but didn't seem to play a role. Any help appreciated

Conda's environment.yml

name: torch18
channels:
  - conda-forge
  - defaults
  - pytorch
dependencies:
  - matplotlib
  - cudatoolkit=10.2
  - pytorch=1.8.0
  - torchvision=0.9.0
  - request
  - request-oauthlib
  - pip:
    - tensorboard==1.15.0
    - tensorboard-plugin-wit==1.8.0
    - torch-tb-profiler
    - pynvml

Minimal Example

import sys

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms

device = torch.device("cuda:0")

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True,
                                        transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True,
                                          num_workers=2)


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()
net.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

tb_handler = torch.profiler.tensorboard_trace_handler('./log')

with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA]) as p:
    for epoch in range(1):  # loop over the dataset multiple times
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data[0].to(device), data[1].to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print('[%d, %5d] loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / 2000))
                running_loss = 0.0
        p.step()
    print("epochs done")
    tb_handler(p)
    print("export done") # never gets here

print('Finished Training')

[Task] Additional Recommendations: 1) Long idle period 2) Frequent CPU-GPU interactions

[Task] Capture GPU/SM utilization and occupancy: Top level and on timeline

Chrome tracing json file fail to load because of unexpected string encoding

When using pytorch profiler with kineto enabled to profile model such as torchvision.alexnet, the dumped chrome tracing file can't be loaded by chrome://tracing.
It is caused by string with unexpected encoding in the file:

The running environment:
OS version: Ubuntu 18.04; Python: 3.8.5; CUDA: 11.1
PyTorch install: https://download.pytorch.org/whl/test/cu111/torch-1.8.0%2Bcu111-cp38-cp38-linux_x86_64.whl
Torchvision install: https://download.pytorch.org/whl/test/cu111/torchvision-0.9.0%2Bcu111-cp38-cp38-linux_x86_64.whl

The corrupted string will cause the dumped chrome tracing file fail to load. So it will cause our tensorboard plugin fail to show it.
It also makes PyTorch profiler CLI print confused to user. You could see the corrupted event is shown as empty string.

Profiler issue for our Cloud Advocates notebook related to multiple file workers.

Cloud advocates team is working on the demo for Build and is looking to include profiler part in the demo.
Walked them thru setup, API etc but we see a weird behaviour, here is notebook and training loop:
tlaloc/explore.ipynb at main · sethjuarez/tlaloc (github.com) (cell 21)

When running we see profiler generates two traces files instead of one. See in the logs directory tlaloc/notebooks/logs at main · sethjuarez/tlaloc

Two trace files generated for the run
TB Plugin shows 4 Workers in the configuration pane – why? There is only one GPU worker.

In the traces we also see lots of GPU streams is it expected?

	if (NOT CUDA_SOURCE_DIR)
	set(CUDA_SOURCE_DIR "$ENV{CUDA_SOURCE_DIR}")
	message(INFO " CUDA_SOURCE_DIR = ${CUDA_SOURCE_DIR}")
	endif()