Giter VIP home page Giter VIP logo

kineto's Introduction

Kineto

Kineto is part of the PyTorch Profiler.

The Kineto project enables:

  • performance observability and diagnostics across common ML bottleneck components
  • actionable recommendations for common issues
  • integration of external system-level profiling tools
  • integration with popular visualization platforms and analysis pipelines

A central component is Libkineto, a profiling library with special focus on low-overhead GPU timeline tracing.

Libkineto

Libkineto is an in-process profiling library integrated with the PyTorch Profiler. Please refer to the README file in the libkineto folder as well as documentation on the new PyTorch Profiler API.

Holistic Trace Analysis

Holistic Trace Analysis (HTA) is an open source performance debugging library aimed at distributed workloads. HTA takes as input PyTorch Profiler traces and elevates the performance bottlenecks to enable faster debugging. Here's a partial list of features in HTA:

  1. Temporal Breakdown: Breakdown of GPU time in terms of time spent in computation, communication, memory events, and idle time on a single node and across all ranks.
  2. Idle Time Breakdown: Breakdown of GPU idle time into waiting for the host, waiting for another kernel or attributed to an unknown cause.
  3. Kernel Breakdown: Find kernels with the longest duration on each rank.
  4. Kernel Duration Distribution: Distribution of average time taken by longest kernels across different ranks.
  5. Communication Computation Overlap: Calculate the percentage of time when communication overlaps computation.

For a complete list see here.

PyTorch TensorBoard Profiler (Deprecated)

The goal of the PyTorch TensorBoard Profiler is to provide a seamless and intuitive end-to-end profiling experience, including straightforward collection from PyTorch and insightful visualizations and recommendations in the TensorBoard UI. Please refer to the README file in the tb_plugin folder.

Future Development Direction:

Some areas we're currently working on:

  • Support for tracing distributed workloads
  • Trace processing, analysis and recommendation engine
  • System-level activities, multiple tracing sources
  • Profiling and monitoring daemon for larger scale deployments

Releases and Contributing

We will follow the PyTorch release schedule which roughly happens on a 3 month basis.

We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further discussion.

If you plan to contribute new features, please first open an issue and discuss the feature with us. Sending a PR without discussion might end up resulting in a rejected PR because we might be taking the infrastructure in a different direction than you might be aware of. We expect the architecture to keep evolving.

License

Kineto has a BSD-style license, as found in the LICENSE file.

kineto's People

Contributors

aaronenyeshi avatar astachowiczhabana avatar bertmaher avatar briancoutinho avatar chaekit avatar cloudhan avatar davidberard98 avatar dzhulgakov avatar fenypatel99 avatar fwenguang avatar gaoteng-git avatar gdankel avatar guotuofeng avatar guyang3532 avatar ilia-cher avatar kimishpatel avatar malfet avatar mantaionut avatar mwootton avatar openrichardfb avatar r-barnes avatar shinytang6 avatar slgong-fb avatar sraikund16 avatar valentinandrei avatar wizzniu avatar wraymo avatar xw285cornell avatar yoyoyocmu avatar yuguo68 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kineto's Issues

Kernels overlap in the same CUDA stream

Sometimes(this rarely happens), open this json file from "chrome://tracing", we can see the "stream 7" has 2 lines and there is a small triangle in front of it.
I find it is caused by a very very thin kernel with "dur" 0. This kernel is overlapped with another event. Then 2 lines are shown in stream7.
image
It's worth further analyzing on why 2 kernels overlaps in the same stream.

[Task] Add support for saving chrome trace files to s3 urls and Azure blob

Add support for saving the generated chrome trace files to s3 URLs. This works for Tensorboard SummaryWriter but is not supported for the Profiler traces.

Current Behavior
on_trace_ready=torch.profiler.tensorboard_trace_handler('s3://tb-demo/pytorch/')

Creates local files like:

(base) ubuntu@ip-172-31-22-142:~$ ls 
s3\:/tb-demo/pytorch/ip-172-31-22-142_14545.1618682202565.pt.trace.json 
s3:/tb-demo/pytorch/ip-172-31-22-142_14545.1618682202565.pt.trace.json

Expected Behavior
The chrome trace files should be saved in the S3 bucket

Add ProfilerAction.PAUSE

A ProfilerAction.PAUSE could be used to skip some batches when switching from train_dataloader to validation_dataloader

Tensorboard hangs in VS code

VScode opens tensorboard but hangs forever (displaying a spinner). The same happens in the browser but is resolved on refreshing the page. The cause appear to be a race between tensorboard loading/finding the trace files and the web page opening. Since VScode loads the frame at the same time as starting tensorboard and there is no refresh for the frame in VSCode, there's no easy way for the user to get the page to load.

image

security warning in tensorboard

There is some warning from “security_validator.py” repeatedly appearing in the logs of tensorboard:

W0303 12:02:32.939834 139855689737984 security_validator.py:51] In 3.0, this warning will become an error
X-Content-Type-Options is required to be "nosniff"

[Document] Profiler API Documentation

Update API Documentation https://pytorch.org/docs/stable/profiler.html

Feedback from Users:

  • Document difference between “wait” and “warmup” parameters in profiler schedule. Will be great if their meaning can be clarified in the documentation.

  • Explicitly state in the PyTorch docs that additional plugin needs to be installed to leverage tensorboard. You have the link to github, where it’s clearly stated, but it should be more explicit in my opinion, so I suggest to put that clarification in the main PyTorch documentation

  • Document profiler.step function. Maybe documentation can be improved.

  • Document record_function. Is there any way to mark regions of the code to be grouped together in profiling?

  • Document trace events lifecycle. Our current code has infinite loop in a function, out of which we return from the middle when some “exit condition” is satisfied (either number of steps or if validation loss stops improving). While we can definitely refactor that loop to “normally” exit and thus trigger the “exit” call to the profiler context manager, it will be good if there is some helper functionality (or at least an example in the documentation) on how to best handle such structure of the training loop. I think that tb events are still properly logged, but if I try to summarize the p.key_averages() inside the “with torch.profiler” block (before “return” operator), I get a “RuntimeError: can't export a trace that didn't finish running”. Again, this is not a big deal, since we can refactor the code, but will be nice if such structure is generally supported (maybe it already is, then will be good to have examples in the docs).

Why CUDA_SOURCE_DIR instead of CUDA_HOME?

This is sort of nitpicking.

According to README of libkineto and CMakeLists.txt,

If CUDA_SOURCE_DIR is not set, libkineto will fail to build.

if (NOT CUDA_SOURCE_DIR)
set(CUDA_SOURCE_DIR "$ENV{CUDA_SOURCE_DIR}")
message(INFO " CUDA_SOURCE_DIR = ${CUDA_SOURCE_DIR}")
endif()

However, the build script of PyTorch uses CUDA_HOME.
Why did you choose CUDA_SOURCE_DIR instead of CUDA_HOME? Is there any reason for this?

warnings in profiler console output

There is some warning from “security_validator.py” repeatedly appearing in the logs of tensorboard:
W0303 12:02:32.939834 139855689737984 security_validator.py:51] In 3.0, this warning will become an error
X-Content-Type-Options is required to be "nosniff"

Separate traces for training and validation steps

Our code is structured in such a way that inside the main training loop, we will sometimes run validation as well. Ideally, it should be logged separately from training. Is there any functionality to support this?
E.g. similar to what this issue is asking for: #86,

but ideally we should be able to span a second instance of profiler (or provide some metadata to say that current steps are not for training, but for validation).

Build failure on Ubuntu 16.04

Log

/usr/bin/c++ -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMAGMA_V2 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTH_BLAS_MKL -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -I../cmake/../third_party/benchmark/include -Icaffe2/contrib/aten -I../third_party/onnx -Ithird_party/onnx -I../third_party/foxi -Ithird_party/foxi -I../third_party/kineto/libkineto/include -I../third_party/kineto/libkineto/src -I../third_party/fmt/include -I/usr/local/cuda/extras/CUPTI/include -I/usr/local/cuda/include -isystem third_party/gloo -isystem ../cmake/../third_party/gloo -isystem ../cmake/../third_party/googletest/googlemock/include -isystem ../cmake/../third_party/googletest/googletest/include -isystem ../third_party/protobuf/src -isystem /home/chester/miniconda3/envs/pytorch-build-py37/include -isystem ../third_party/gemmlowp -isystem ../third_party/neon2sse -isystem ../third_party/XNNPACK/include -isystem ../third_party -isystem ../cmake/../third_party/eigen -isystem /home/chester/miniconda3/envs/pytorch-build-py37/include/python3.7m -isystem /home/chester/miniconda3/envs/pytorch-build-py37/lib/python3.7/site-packages/numpy/core/include -isystem ../cmake/../third_party/pybind11/include -isystem /usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent -isystem /usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent/include -isystem /usr/lib/openmpi/include -isystem /usr/lib/openmpi/include/openmpi -isystem ../cmake/../third_party/cub -isystem ../third_party/ideep/mkl-dnn/include -isystem ../third_party/ideep/include -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -O3 -DNDEBUG -DNDEBUG -fPIC -fvisibility=hidden -DCAFFE2_USE_GLOO -DCUDA_HAS_FP16=1 -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -DKINETO_NAMESPACE=libkineto -std=gnu++14 -DHAS_CUPTI -std=c++14 -MD -MT third_party/kineto/libkineto/CMakeFiles/kineto_base.dir/src/cupti_strings.cpp.o -MF third_party/kineto/libkineto/CMakeFiles/kineto_base.dir/src/cupti_strings.cpp.o.d -o third_party/kineto/libkineto/CMakeFiles/kineto_base.dir/src/cupti_strings.cpp.o -c ../third_party/kineto/libkineto/src/cupti_strings.cpp
../third_party/kineto/libkineto/src/cupti_strings.cpp: In function ‘const char* libkineto::runtimeCbidName(CUpti_CallbackId)’:
../third_party/kineto/libkineto/src/cupti_strings.cpp:478:105: error: expected ‘,’ before ‘)’ token
   static_assert(CUPTI_RUNTIME_TRACE_CBID_SIZE < (sizeof(runtimeCbidNames) / sizeof(runtimeCbidNames[0])));
                                                                                                         ^
../third_party/kineto/libkineto/src/cupti_strings.cpp:478:105: error: expected string-literal before ‘)’ token

Environment

Collecting environment information...
PyTorch version: 1.9.0a0+gitebfa927
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.7 LTS (x86_64)
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Clang version: Could not collect
CMake version: version 3.18.2

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1080
GPU 1: GeForce GTX 1080

Nvidia driver version: 440.33.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.3
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Loader doesn't correctly parse trace filename when repeat>1

https://github.com/pytorch/pytorch/blob/7d4e9bdba144e162882fb854324430c4b92fb267/torch/profiler/profiler.py#L75
Here's how tensorboard_trace_handler decides the filename

file_name = "{}.{}.pt.trace.json".format(worker_name, int(time.time() * 1000))

And here's how the loader parses it

worker = path[:-len(pattern)]

When repeat > 1, there will be multiple trace files from each worker under different timestamps, and yet they will be identified as different workers by the loader.

Chrome tracing file can't tell async task with sync task

Through reviewing the code of pytorch profiler and kineto, I found in the case of async task, there is an inconsistency on assigning thread id to FunctionEvent and ClientTraceActivity.

  1. When running with @torch.jit.script, torch.jit.fork and torch.jit.wait, such as the first example code in Dynamic Parallelism in TorchScript, the “example” function is an async task, its start_thread_id and end_thread_id in profiling result will be different.
    In torch/autograd/profiler.py, I could see it carefully handles this kind of event with checking “is_async”, it does not attach children and does not calculate self time on it.
    However, in the dumped chrome tracing file (with kineto enabled), I could only see “example” as a normal “Operator” with “ph” as “X”. There is no message showing it is an async task. It just assign end thread's id to "tid".
    My concern: Will regarding the async tasks simply as sync tasks have potential risk? If we simply regard async tasks as sync tasks, will they have risk to overlap (cross, not contain) with each other in chrome tracing UI view?
    And if we can’t tell async tasks with sync tasks, our tb_plugin may can’t achieve same expected result as pytorch autograd profiler. For example, we will calculate self time of it because we can’t know it is actually an async task.

Another two tiny issues:

  1. There are 3 thread id:
    2.1 In profiler_kineto.cpp, the thread id is got from at::RecordFunction::currentThreadId(start, end).
    2.2 In chrome tracing file, the “tid” is real pthread id
    2.3 In chrome tracing file, the id in “thread_name” is got from ChromeTraceLogger::renameThreadID.
    Maybe the 2.1 and 2.3 could be made more consistent?

  2. The Ln56 in profiler_kineto.cpp seems redundant, only keep Ln73 is enough.

Is kineto usable with pytorch 1.8?

Hi, I have noticed use_kineto argument in torch.autograd.profiler.profile’s signature (PyTorch 1.8) but not in docs you point to in the main readme (i.e., https://pytorch.org/docs/master/profiler.html) which make no mention of kineto [anymore]. Therefore the question from the title - is kineto project still alive and usable with PyTorch 1.8?

If affirmative, are there any additional installs required beyond PyTorch itself for the kineto to work?

"memory bandwidth" with invalid value inf in dumped chrome tracing file

Our colleague produced the duration of "Memcpy" as 0 when using kineto to profile. It causes the "memory bandwidth (GB/s)" to be inf. Then the file can't be open by chrome://tracing and our tensorboard plugin's "json.load" fail to load it.

Because the inf is neither a string nor a number, so it can't be parsed as valid json format.
A solution is to add "if" to judge whether "dur" is 0, if so a string with "inf" could be assigned as value.

The bug case:

{
  "schemaVersion": 1,
  "traceEvents": [
  
  {
    "ph": "X", "cat": "Memcpy", 
    "name": "Memcpy HtoD (Pageable -> Device)", "pid": 0, "tid": "stream 7",
    "ts": 1614800519473220, "dur": 0,
    "args": {
      "device": 0, "context": 1,
      "stream": 7, "correlation": 20002, "external id": 3981,
      "bytes": 200, "memory bandwidth (GB/s)": inf
    }
  }
]}

Add support for visualizing chrome trace files in tb_plugin from S3 URLs

For chrome trace files in S3 bucket, the tb_plugin UI gets stuck in spinning wheel to load the files and nothing gets displayed. Add support for visualizing the traces from S3 URLs for the Profiler plugin, similar to how one can view the main tensorboard runs in the rest of the tensorboard UI.

'"ts": 0' event is found in kineto's chrome tracing file

When I train resnet50 model with big batch size such as 128 or 256, the profiler dumps the following events in chrome tracing file:

  {
    "ph": "X", "cat": "Kernel", 
    "name": "volta_sgemm_64x64_nt", "pid": 0, "tid": "stream 7",
    **"ts": 0, "dur": 0,**
    "args": {
      "queued": 0, "device": 0, "context": 1,
      "stream": 7, "correlation": 106123, "external id": 34735,
      "registers per thread": 126,
      "shared memory": 8192,
      "warps per SM": 14.4,
      "grid": [4, 4, 36],
      "block": [64, 1, 1]
    }
  },
  {
    "ph": "f", "id": 106123, "pid": 0, "tid": "stream 7", "ts": 0,
    "cat": "async", "name": "launch", "bp": "e"
  },
  {
    "ph": "X", "cat": "Runtime", 
    "name": "cudaLaunchKernel", "pid": 29463, "tid": "3239786240",
    "ts": 1615207264505000, "dur": 6,
    "args": {
      "cbid": 211, "correlation": 106123,
      "external id": 34735, "external ts": 1615207264504915
    }
  },
  {
    "ph": "s", "id": 106123, "pid": 29463, "tid": 3239786240, "ts": 1615207264505000,
    "cat": "async", "name": "launch"
  },

You can see "ts" is 0 and "dur" is 0 in above kernel. And there are more than 1 thousand these events in the file.

The model code I used is plugin example resnet50, and changing its "batch_size" from 32 to 128.
The GPU: NVIDIA V100.
The torch whl: https://download.pytorch.org/whl/nightly/cu111/torch-1.9.0.dev20210305%2Bcu111-cp38-cp38-linux_x86_64.whl
The torchvision whl: https://download.pytorch.org/whl/nightly/cu111/torchvision-0.9.0.dev20210305%2Bcu111-cp38-cp38-linux_x86_64.whl

Because these kernels with 0 ts are launched by the runtimes which are nearly the end of profiling, I guess it is due to these kernels are executed later than "profiler stop". This is my snapshot:
image
You can see the kernels that which should be executed after "profiler stop" is not painted here(They are painted at time "0"). And the last operators don't have related kernels, but they really launched kernels in chrome tracing file.

The is also bad experience that user will be confused to see the following in this case: (ts=0 will make these events starts at 0, and other events will be shown to many years later)
image

BTW, the expected behavior is correctly dumping the kernels launched during profiler's start and stop, rather than removing these kernels from file. Because we have to make the operators' kernel time to be correct.

Intended usage

What is the intended usage of this library? It doesn't look like there are any examples or blog up yet. Do I need to build a C++ program around this library? How do I then invoke it while running a python pytorch run?

Page scroll up

Scroll the page down:
image

Input a string into "Search by Name" and press enter:
image

The page is automatically scrolled up. It is not a good user experience which makes user lose focus.

The "Group By" still behave like this.
The "Group By" and "Search by Name" in "Kernel View" also behave like this.

Why the time range of events are strange?

I'm sorry I probably shouldn't ask questions here, but I have tried to ask questions on stackoverflow and pytoch forums, and no one responded.

My problem is that the timestamp of the events reported by the pytorch profiler is strange.

# %%
import torch
import torchvision.models as models
import time
print(torch.__version__)

model = models.resnet18().cuda()
inputs = torch.randn(5, 3, 224, 224).cuda()
with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA]
) as p:
    outputs = model(inputs)

# %%
events = p.events()
print('begin     ', min(events, key=lambda e: e.time_range.start).time_range.start / 1000000)
print('end       ', max(events, key=lambda e: e.time_range.end).time_range.end / 1000000)
print('realtime  ', time.time())
print('monotic   ', time.perf_counter())

The output is as follows:

1.8.1
begin      1618284730.230899
end        1618284733.091974
realtime   1618299047.4422061
monotic    14317.211549314

My question is why the time of the event is so different from the time of the two clocks. The result I hope is that all the timestamps are consistent with the monotic clock.

Can I achieve the desired effect by modifying the source code, and if so, what should I do?

Show scopes for operations

Hi,
A key feature we are looking but is currently missing is showing scopes for operations (like the tensorflow profiler).

Example:

empty_

vs.

SequentialModel/layers[2]/Attention/empty_

And we want the 2nd option.
(with the 2nd option, getting combined statistics for all occurrences of empty_ could be done by grouping)

According to the Pytorch API, using with_stack=True would only record file and line numbers.
Unfortunately, file and line numbers is still confusing since we can have both

  • SequentialModel/layers[2]/Attention/empty_
  • SequentialModel/layers[3]/SomeModelWithAttention/Attention/empty_

which would point to the same place in code.

Negative number in input textbox

In Operator view, now user can input negative number into the "Top kernels to show". This should be forbidden.
The way to reproduce it:

  1. Mouse move to head of the number, click.
  2. Input the negative character "-", press enter key.

image

image

Profiler gets stuck on trace handling for a toy training loop

Basic setup info

device: Tesla V100-SXM2-32GB
ram: 64GB
pytorch: 1.8
cudatoolkit: 10.2
python: 3.7.8
environment: conda 4.7.5
os: CentOS Linux release 7.9.200

Description
Hi, I tried to export profiler trace for 1 epoch of training for a tutorial toy problem with examples of your new profiler API. Unfortunately, whenever training finishes and trace handling is called either via profiler or manually it gets stuck indefinitely, never outputting anything. I observed that from the moment of entering the trace handler RAM consumption of the host increases from 10 to 25GB over couple of minutes and stays there. Mindful of legacy profiler issues I checked the impact of setting DataLoader's num_workers to 0, but didn't seem to play a role. Any help appreciated

Conda's environment.yml

name: torch18
channels:
  - conda-forge
  - defaults
  - pytorch
dependencies:
  - matplotlib
  - cudatoolkit=10.2
  - pytorch=1.8.0
  - torchvision=0.9.0
  - request
  - request-oauthlib
  - pip:
    - tensorboard==1.15.0
    - tensorboard-plugin-wit==1.8.0
    - torch-tb-profiler
    - pynvml

Minimal Example

import sys

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms

device = torch.device("cuda:0")

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True,
                                        transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True,
                                          num_workers=2)


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()
net.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

tb_handler = torch.profiler.tensorboard_trace_handler('./log')

with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA]) as p:
    for epoch in range(1):  # loop over the dataset multiple times
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data[0].to(device), data[1].to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print('[%d, %5d] loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / 2000))
                running_loss = 0.0
        p.step()
    print("epochs done")
    tb_handler(p)
    print("export done") # never gets here

print('Finished Training')

Chrome tracing json file fail to load because of unexpected string encoding

When using pytorch profiler with kineto enabled to profile model such as torchvision.alexnet, the dumped chrome tracing file can't be loaded by chrome://tracing.
It is caused by string with unexpected encoding in the file:
image

The running environment:
OS version: Ubuntu 18.04; Python: 3.8.5; CUDA: 11.1
PyTorch install: https://download.pytorch.org/whl/test/cu111/torch-1.8.0%2Bcu111-cp38-cp38-linux_x86_64.whl
Torchvision install: https://download.pytorch.org/whl/test/cu111/torchvision-0.9.0%2Bcu111-cp38-cp38-linux_x86_64.whl

The corrupted string will cause the dumped chrome tracing file fail to load. So it will cause our tensorboard plugin fail to show it.
It also makes PyTorch profiler CLI print confused to user. You could see the corrupted event is shown as empty string.
image

Profiler issue for our Cloud Advocates notebook related to multiple file workers.

Cloud advocates team is working on the demo for Build and is looking to include profiler part in the demo.
Walked them thru setup, API etc but we see a weird behaviour, here is notebook and training loop:
tlaloc/explore.ipynb at main · sethjuarez/tlaloc (github.com) (cell 21)

When running we see profiler generates two traces files instead of one. See in the logs directory tlaloc/notebooks/logs at main · sethjuarez/tlaloc

  • Two trace files generated for the run
  • TB Plugin shows 4 Workers in the configuration pane – why? There is only one GPU worker.

image

image

  • In the traces we also see lots of GPU streams is it expected?

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.