pytorch / kineto Goto Github PK
View Code? Open in Web Editor NEWA CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.
License: Other
A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.
License: Other
Update API Documentation https://pytorch.org/docs/stable/profiler.html
Feedback from Users:
Document difference between “wait” and “warmup” parameters in profiler schedule. Will be great if their meaning can be clarified in the documentation.
Explicitly state in the PyTorch docs that additional plugin needs to be installed to leverage tensorboard. You have the link to github, where it’s clearly stated, but it should be more explicit in my opinion, so I suggest to put that clarification in the main PyTorch documentation
Document profiler.step function. Maybe documentation can be improved.
Document record_function. Is there any way to mark regions of the code to be grouped together in profiling?
Document trace events lifecycle. Our current code has infinite loop in a function, out of which we return from the middle when some “exit condition” is satisfied (either number of steps or if validation loss stops improving). While we can definitely refactor that loop to “normally” exit and thus trigger the “exit” call to the profiler context manager, it will be good if there is some helper functionality (or at least an example in the documentation) on how to best handle such structure of the training loop. I think that tb events are still properly logged, but if I try to summarize the p.key_averages() inside the “with torch.profiler” block (before “return” operator), I get a “RuntimeError: can't export a trace that didn't finish running”. Again, this is not a big deal, since we can refactor the code, but will be nice if such structure is generally supported (maybe it already is, then will be good to have examples in the docs).
Hi, I have noticed use_kineto
argument in torch.autograd.profiler.profile
’s signature (PyTorch 1.8) but not in docs you point to in the main readme (i.e., https://pytorch.org/docs/master/profiler.html) which make no mention of kineto [anymore]. Therefore the question from the title - is kineto project still alive and usable with PyTorch 1.8?
If affirmative, are there any additional installs required beyond PyTorch itself for the kineto to work?
I'm sorry I probably shouldn't ask questions here, but I have tried to ask questions on stackoverflow and pytoch forums, and no one responded.
My problem is that the timestamp of the events reported by the pytorch profiler is strange.
# %%
import torch
import torchvision.models as models
import time
print(torch.__version__)
model = models.resnet18().cuda()
inputs = torch.randn(5, 3, 224, 224).cuda()
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA]
) as p:
outputs = model(inputs)
# %%
events = p.events()
print('begin ', min(events, key=lambda e: e.time_range.start).time_range.start / 1000000)
print('end ', max(events, key=lambda e: e.time_range.end).time_range.end / 1000000)
print('realtime ', time.time())
print('monotic ', time.perf_counter())
The output is as follows:
1.8.1
begin 1618284730.230899
end 1618284733.091974
realtime 1618299047.4422061
monotic 14317.211549314
My question is why the time of the event is so different from the time of the two clocks. The result I hope is that all the timestamps are consistent with the monotic clock.
Can I achieve the desired effect by modifying the source code, and if so, what should I do?
What is the intended usage of this library? It doesn't look like there are any examples or blog up yet. Do I need to build a C++ program around this library? How do I then invoke it while running a python pytorch run?
For chrome trace files in S3 bucket, the tb_plugin UI gets stuck in spinning wheel to load the files and nothing gets displayed. Add support for visualizing the traces from S3 URLs for the Profiler plugin, similar to how one can view the main tensorboard runs in the rest of the tensorboard UI.
The .json files are being saved by profiler at the logdir, but when I try to open tensorboard from jupterhub, the pytorch_profiler tab doesn't seem to be rendered automatically. All the dependencies as mentioned in README are present in the hub environment.
When TensorBoard with Profiler plugin is opened in VSCode key navigation with arrows does not work, while it is working in regular browser
https://anaconda.org/search?q=torch-tb-profiler doesn't return anything
https://github.com/pytorch/pytorch/blob/7d4e9bdba144e162882fb854324430c4b92fb267/torch/profiler/profiler.py#L75
Here's how tensorboard_trace_handler
decides the filename
file_name = "{}.{}.pt.trace.json".format(worker_name, int(time.time() * 1000))
And here's how the loader parses it
When repeat > 1
, there will be multiple trace files from each worker under different timestamps, and yet they will be identified as different workers by the loader.
When using pytorch profiler with kineto enabled to profile model such as torchvision.alexnet, the dumped chrome tracing file can't be loaded by chrome://tracing.
It is caused by string with unexpected encoding in the file:
The running environment:
OS version: Ubuntu 18.04; Python: 3.8.5; CUDA: 11.1
PyTorch install: https://download.pytorch.org/whl/test/cu111/torch-1.8.0%2Bcu111-cp38-cp38-linux_x86_64.whl
Torchvision install: https://download.pytorch.org/whl/test/cu111/torchvision-0.9.0%2Bcu111-cp38-cp38-linux_x86_64.whl
The corrupted string will cause the dumped chrome tracing file fail to load. So it will cause our tensorboard plugin fail to show it.
It also makes PyTorch profiler CLI print confused to user. You could see the corrupted event is shown as empty string.
Support export chrome tracing file in gzip format to reduce file size
Through reviewing the code of pytorch profiler and kineto, I found in the case of async task, there is an inconsistency on assigning thread id to FunctionEvent and ClientTraceActivity.
Another two tiny issues:
There are 3 thread id:
2.1 In profiler_kineto.cpp, the thread id is got from at::RecordFunction::currentThreadId(start, end).
2.2 In chrome tracing file, the “tid” is real pthread id
2.3 In chrome tracing file, the id in “thread_name” is got from ChromeTraceLogger::renameThreadID.
Maybe the 2.1 and 2.3 could be made more consistent?
The Ln56 in profiler_kineto.cpp seems redundant, only keep Ln73 is enough.
Cuda traces are not showing up for the Resent sample in AWS DLAMI with CUDA 10.2. Generated trace is attached
trace.zip
Current tutorials point to autograd and need to be updated with new Profiler:
https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
https://pytorch.org/tutorials/beginner/profiler.html
Similar Item: pytorch/tutorials#1451
Basic setup info
device: Tesla V100-SXM2-32GB
ram: 64GB
pytorch: 1.8
cudatoolkit: 10.2
python: 3.7.8
environment: conda 4.7.5
os: CentOS Linux release 7.9.200
Description
Hi, I tried to export profiler trace for 1 epoch of training for a tutorial toy problem with examples of your new profiler API. Unfortunately, whenever training finishes and trace handling is called either via profiler or manually it gets stuck indefinitely, never outputting anything. I observed that from the moment of entering the trace handler RAM consumption of the host increases from 10 to 25GB over couple of minutes and stays there. Mindful of legacy profiler issues I checked the impact of setting DataLoader
's num_workers
to 0
, but didn't seem to play a role. Any help appreciated
Conda's environment.yml
name: torch18
channels:
- conda-forge
- defaults
- pytorch
dependencies:
- matplotlib
- cudatoolkit=10.2
- pytorch=1.8.0
- torchvision=0.9.0
- request
- request-oauthlib
- pip:
- tensorboard==1.15.0
- tensorboard-plugin-wit==1.8.0
- torch-tb-profiler
- pynvml
Minimal Example
import sys
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
device = torch.device("cuda:0")
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True,
transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True,
num_workers=2)
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
net = Net()
net.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
tb_handler = torch.profiler.tensorboard_trace_handler('./log')
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA]) as p:
for epoch in range(1): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data[0].to(device), data[1].to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0
p.step()
print("epochs done")
tb_handler(p)
print("export done") # never gets here
print('Finished Training')
Should change to "with_stack".
https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/
Running BERT training:
python /home/azureuser/pyprofiler/bert_for_sequence_classification.py --train-steps 64 --epochs 2 --pytorch-only
https://github.com/lenisha/pyprofiler/blob/main/bert_for_sequence_classification.py
and producing following trace:
https://github.com/lenisha/pyprofiler/blob/main/trace/bert_record_nt/worker0.pt.trace.json
Plugin does not load the trace , just showing spinner
Cloud advocates team is working on the demo for Build and is looking to include profiler part in the demo.
Walked them thru setup, API etc but we see a weird behaviour, here is notebook and training loop:
tlaloc/explore.ipynb at main · sethjuarez/tlaloc (github.com) (cell 21)
When running we see profiler generates two traces files instead of one. See in the logs directory tlaloc/notebooks/logs at main · sethjuarez/tlaloc
A ProfilerAction.PAUSE could be used to skip some batches when switching from train_dataloader to validation_dataloader
There is some warning from “security_validator.py” repeatedly appearing in the logs of tensorboard:
W0303 12:02:32.939834 139855689737984 security_validator.py:51] In 3.0, this warning will become an error
X-Content-Type-Options is required to be "nosniff"
Sometimes(this rarely happens), open this json file from "chrome://tracing", we can see the "stream 7" has 2 lines and there is a small triangle in front of it.
I find it is caused by a very very thin kernel with "dur" 0. This kernel is overlapped with another event. Then 2 lines are shown in stream7.
It's worth further analyzing on why 2 kernels overlaps in the same stream.
Hi,
A key feature we are looking but is currently missing is showing scopes for operations (like the tensorflow profiler).
Example:
empty_
vs.
SequentialModel/layers[2]/Attention/empty_
And we want the 2nd option.
(with the 2nd option, getting combined statistics for all occurrences of empty_
could be done by grouping)
According to the Pytorch API, using with_stack=True
would only record file and line numbers.
Unfortunately, file and line numbers is still confusing since we can have both
SequentialModel/layers[2]/Attention/empty_
SequentialModel/layers[3]/SomeModelWithAttention/Attention/empty_
which would point to the same place in code.
Our code is structured in such a way that inside the main training loop, we will sometimes run validation as well. Ideally, it should be logged separately from training. Is there any functionality to support this?
E.g. similar to what this issue is asking for: #86,
but ideally we should be able to span a second instance of profiler (or provide some metadata to say that current steps are not for training, but for validation).
Add support for saving the generated chrome trace files to s3 URLs. This works for Tensorboard SummaryWriter but is not supported for the Profiler traces.
Current Behavior
on_trace_ready=torch.profiler.tensorboard_trace_handler('s3://tb-demo/pytorch/')
Creates local files like:
(base) ubuntu@ip-172-31-22-142:~$ ls
s3\:/tb-demo/pytorch/ip-172-31-22-142_14545.1618682202565.pt.trace.json
s3:/tb-demo/pytorch/ip-172-31-22-142_14545.1618682202565.pt.trace.json
Expected Behavior
The chrome trace files should be saved in the S3 bucket
This is sort of nitpicking.
According to README of libkineto and CMakeLists.txt
,
If CUDA_SOURCE_DIR is not set, libkineto will fail to build.
kineto/libkineto/CMakeLists.txt
Lines 80 to 83 in 3c77248
However, the build script of PyTorch uses CUDA_HOME.
Why did you choose CUDA_SOURCE_DIR instead of CUDA_HOME? Is there any reason for this?
Our colleague produced the duration of "Memcpy" as 0 when using kineto to profile. It causes the "memory bandwidth (GB/s)" to be inf. Then the file can't be open by chrome://tracing and our tensorboard plugin's "json.load" fail to load it.
Because the inf is neither a string nor a number, so it can't be parsed as valid json format.
A solution is to add "if" to judge whether "dur" is 0, if so a string with "inf" could be assigned as value.
The bug case:
{
"schemaVersion": 1,
"traceEvents": [
{
"ph": "X", "cat": "Memcpy",
"name": "Memcpy HtoD (Pageable -> Device)", "pid": 0, "tid": "stream 7",
"ts": 1614800519473220, "dur": 0,
"args": {
"device": 0, "context": 1,
"stream": 7, "correlation": 20002, "external id": 3981,
"bytes": 200, "memory bandwidth (GB/s)": inf
}
}
]}
VScode opens tensorboard but hangs forever (displaying a spinner). The same happens in the browser but is resolved on refreshing the page. The cause appear to be a race between tensorboard loading/finding the trace files and the web page opening. Since VScode loads the frame at the same time as starting tensorboard and there is no refresh for the frame in VSCode, there's no easy way for the user to get the page to load.
There is some warning from “security_validator.py” repeatedly appearing in the logs of tensorboard:
W0303 12:02:32.939834 139855689737984 security_validator.py:51] In 3.0, this warning will become an error
X-Content-Type-Options is required to be "nosniff"
When I train resnet50 model with big batch size such as 128 or 256, the profiler dumps the following events in chrome tracing file:
{
"ph": "X", "cat": "Kernel",
"name": "volta_sgemm_64x64_nt", "pid": 0, "tid": "stream 7",
**"ts": 0, "dur": 0,**
"args": {
"queued": 0, "device": 0, "context": 1,
"stream": 7, "correlation": 106123, "external id": 34735,
"registers per thread": 126,
"shared memory": 8192,
"warps per SM": 14.4,
"grid": [4, 4, 36],
"block": [64, 1, 1]
}
},
{
"ph": "f", "id": 106123, "pid": 0, "tid": "stream 7", "ts": 0,
"cat": "async", "name": "launch", "bp": "e"
},
{
"ph": "X", "cat": "Runtime",
"name": "cudaLaunchKernel", "pid": 29463, "tid": "3239786240",
"ts": 1615207264505000, "dur": 6,
"args": {
"cbid": 211, "correlation": 106123,
"external id": 34735, "external ts": 1615207264504915
}
},
{
"ph": "s", "id": 106123, "pid": 29463, "tid": 3239786240, "ts": 1615207264505000,
"cat": "async", "name": "launch"
},
You can see "ts" is 0 and "dur" is 0 in above kernel. And there are more than 1 thousand these events in the file.
The model code I used is plugin example resnet50, and changing its "batch_size" from 32 to 128.
The GPU: NVIDIA V100.
The torch whl: https://download.pytorch.org/whl/nightly/cu111/torch-1.9.0.dev20210305%2Bcu111-cp38-cp38-linux_x86_64.whl
The torchvision whl: https://download.pytorch.org/whl/nightly/cu111/torchvision-0.9.0.dev20210305%2Bcu111-cp38-cp38-linux_x86_64.whl
Because these kernels with 0 ts are launched by the runtimes which are nearly the end of profiling, I guess it is due to these kernels are executed later than "profiler stop". This is my snapshot:
You can see the kernels that which should be executed after "profiler stop" is not painted here(They are painted at time "0"). And the last operators don't have related kernels, but they really launched kernels in chrome tracing file.
The is also bad experience that user will be confused to see the following in this case: (ts=0 will make these events starts at 0, and other events will be shown to many years later)
BTW, the expected behavior is correctly dumping the kernels launched during profiler's start and stop, rather than removing these kernels from file. Because we have to make the operators' kernel time to be correct.
/usr/bin/c++ -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DMAGMA_V2 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTH_BLAS_MKL -DUSE_EXTERNAL_MZCRC -D_FILE_OFFSET_BITS=64 -I../cmake/../third_party/benchmark/include -Icaffe2/contrib/aten -I../third_party/onnx -Ithird_party/onnx -I../third_party/foxi -Ithird_party/foxi -I../third_party/kineto/libkineto/include -I../third_party/kineto/libkineto/src -I../third_party/fmt/include -I/usr/local/cuda/extras/CUPTI/include -I/usr/local/cuda/include -isystem third_party/gloo -isystem ../cmake/../third_party/gloo -isystem ../cmake/../third_party/googletest/googlemock/include -isystem ../cmake/../third_party/googletest/googletest/include -isystem ../third_party/protobuf/src -isystem /home/chester/miniconda3/envs/pytorch-build-py37/include -isystem ../third_party/gemmlowp -isystem ../third_party/neon2sse -isystem ../third_party/XNNPACK/include -isystem ../third_party -isystem ../cmake/../third_party/eigen -isystem /home/chester/miniconda3/envs/pytorch-build-py37/include/python3.7m -isystem /home/chester/miniconda3/envs/pytorch-build-py37/lib/python3.7/site-packages/numpy/core/include -isystem ../cmake/../third_party/pybind11/include -isystem /usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent -isystem /usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent/include -isystem /usr/lib/openmpi/include -isystem /usr/lib/openmpi/include/openmpi -isystem ../cmake/../third_party/cub -isystem ../third_party/ideep/mkl-dnn/include -isystem ../third_party/ideep/include -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -O3 -DNDEBUG -DNDEBUG -fPIC -fvisibility=hidden -DCAFFE2_USE_GLOO -DCUDA_HAS_FP16=1 -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -DKINETO_NAMESPACE=libkineto -std=gnu++14 -DHAS_CUPTI -std=c++14 -MD -MT third_party/kineto/libkineto/CMakeFiles/kineto_base.dir/src/cupti_strings.cpp.o -MF third_party/kineto/libkineto/CMakeFiles/kineto_base.dir/src/cupti_strings.cpp.o.d -o third_party/kineto/libkineto/CMakeFiles/kineto_base.dir/src/cupti_strings.cpp.o -c ../third_party/kineto/libkineto/src/cupti_strings.cpp
../third_party/kineto/libkineto/src/cupti_strings.cpp: In function ‘const char* libkineto::runtimeCbidName(CUpti_CallbackId)’:
../third_party/kineto/libkineto/src/cupti_strings.cpp:478:105: error: expected ‘,’ before ‘)’ token
static_assert(CUPTI_RUNTIME_TRACE_CBID_SIZE < (sizeof(runtimeCbidNames) / sizeof(runtimeCbidNames[0])));
^
../third_party/kineto/libkineto/src/cupti_strings.cpp:478:105: error: expected string-literal before ‘)’ token
Collecting environment information...
PyTorch version: 1.9.0a0+gitebfa927
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 16.04.7 LTS (x86_64)
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Clang version: Could not collect
CMake version: version 3.18.2
Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1080
GPU 1: GeForce GTX 1080
Nvidia driver version: 440.33.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.3
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.