Giter VIP home page Giter VIP logo

nvbit's Introduction

Announcement

We are anticipating a new major release of NVBit that will come out in the near future, current bugs are going to be deferred until after that release.


NVBit is released as an artifact via github, it can be downloaded at: https://github.com/NVlabs/NVBit/releases

A paper describing NVBit was published at MICRO 2019 and it can be found at: https://github.com/NVlabs/NVBit/releases/download/v1.0/MICRO_19_NVBit.pdf

For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing

NVBit (NVidia Binary Instrumentation Tool)

NVIDIA Corporation

NVBit is covered by the same End User License Agreement as that of the NVIDIA CUDA Toolkit. By using NVBit you agree to End User License Agreement described in the EULA.txt file.

NVBit is not part of the official CUDA toolkit, but instead is a research prototype from the Architecture Research Group at NVIDIA and as such is provided as-is with no guarantee of support.

Introduction

NVBit (NVidia Binary Instrumentation Tool) is a research prototype of a dynamic binary instrumentation library for NVIDIA GPUs.

NVBit provides a set of simple APIs that enable writing a variety of instrumentation tools. Example of instrumentation tools are: dynamic instruction counters, instruction tracers, memory reference tracers, profiling tools, etc.

NVBit allows writing instrumentation tools (which we call NVBit tools) that can inspect and modify the assembly code (SASS) of a GPU application without requiring recompilation, thus dynamic. NVBit allows instrumentation tools to inspect the SASS instructions of each function (__global__ or __device__) as it is loaded for the first time in the GPU. During this phase is possible to inject one or more instrumentation calls to arbitrary device functions before (or after) a SASS instruction. It is also possible to remove SASS instructions, although in this case NVBit does not guarantee that the application will continue to work correctly.

NVBit tries to be as low overhead as possible, although any injection of instrumentation function has an associated cost due to saving and restoring application state before and after jumping to/from the instrumentation function.

Because NVBit does not require application source code, any pre-compiled GPU application should work regardless of which compiler (or version) has been used (i.e. nvcc, pgicc, etc).

Requirements

  • SM compute capability: >= 3.5 && <= 8.6
  • Host CPU: x86_64, ppc64le, arm64
  • OS: Linux
  • GCC version: >= 5.3.0
  • CUDA version: >= 8.0 && <= 11.x
  • CUDA driver version: <= 495.xx
  • nvcc version for tool compilation >= 10.2

ARM64 version is tested on Jetson TX2 and Jetson Nano with JetPack 4.4.

nvbit's People

Contributors

dnellans avatar ovilla avatar x-y-z avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nvbit's Issues

Can constant value be read from nvbit?

Hi,

I see we could use nvbit could r/w register, but I wonder whether we could do global or constant memory reading inside the injection function?

Thx,
Lei

Segmentation fault

Hi,

Running the instr_count tool with
LD_PRELOAD=<path to nvbit tool>,/usr/lib/x86_64-linux-gnu/libstdc++.so.6 <your app> giving segmentation fault.

Instruction 'vote' without '.sync' is not supported

I have changed the arch number to 70 in tools/*/Makefile and here is the error that I see

ptxas warning : For profile sm_70 adjusting per thread register count of 16 to lower bound of 24
ptxas /tmp/tmpxft_00001c5e_00000000-5_opcode_hist.ptx, line 6401; error   : Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
ptxas /tmp/tmpxft_00001c5e_00000000-5_opcode_hist.ptx, line 6408; error   : Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4

How to get the addresses of kernel arguments?

It seems that NVBit has abilities to get the addresses of launched kernel arguments.

image

However, I couldn't find a way to know the number of kernel arguments to access void** kernelParams;

For example, when I access kernelParams from kernelParams[0] to kernelParams[N - 1], what is N (where to retrieve N)?

While information regarding kernel parameters can be found here:

image

The description just says it can be directly retrieved from the kernel's image (what is the meaning of the kernel's image here?)

Instrumenting a specific kernel

Thanks for making this really useful tool public. I want to know if it is possible to instrument a specific kernel of interest rather than a whole application. Or some mechanism in which the metrics can be collected kernel wise when a application has multiple kernels. Thank you.

Record instruction operands

Hi,

I am using the record_reg_vals example to trace the values of the operands of each instruction. However, I see that in some cases, not all operands are recorded. Like the following example where only R2 and R4 are recorded:

IMAD R2, R4, c[0x0][0x17c], R2

How can I modify the record_reg_val function to store all the instruction's operands, including constant memories as the example above?

Thanks,
Fernando

(Core Dumped) 'std::out_of_range' issue

Hello,
I'm trying to run the opcode_hist tool using the "hybrid-sort" app from rodinia 3.1, and this error shows up after analyzing a single kernel,

        terminate called after throwing an instance of 'std::out_of_range'
        what():  _Map_base::at

I'm running a V100, CUDA 11.1, NVBit 1.5.1, cuDNN 8.0.5 (though this program doesn't use any fancy libraries), GCC 7.5.0.
I'm using these arguments:

LD_PRELOAD=../../../../tracer_nvbit/nvbit_release/tools/opcode_hist/opcode_hist.so hybridsort-rodinia-3.1 r

It finishes fine using instr_count. Several other applications from the Parboil suite show the same issue.

How to read unified register number?

Hello,

When I took a look at the operand_t union, I do not see a struct for the unified register number. I see "struct reg", but I think this used for vector register. Is it also used for a unified register number?

   union {
        struct {
            uint64_t value;
        } imm_uint64;

        struct {
            double value;
        } imm_double;

        struct {
            int num;
            /* register properties .XXX */
            char prop[MAX_CHARS];
        } reg;

        struct {
            int num;
        } pred;

        struct {
            int id;
            bool has_imm_offset;
            int imm_offset;
            bool has_reg_offset;
            int reg_offset;
        } cbank;

        struct {
            bool has_ra;
            int ra_num;
            regModifierType ra_mod;
            bool has_ur;
            int ur_num;
            bool has_imm;
            int imm;
        } mref;

        struct {
            char array[MAX_CHARS];
        } generic;

    } u;
} operand_t;

trt_ampere_h1688cudnn_128x128_ldg8_relu_exp_small_nhwc_linkable_tn_v1 trace error

Hi,

When I use nvbit to trace one program containing TensorRT kernel, it report illegal memory access for the sample plugin like instr_count or instr_count_bb.

The kernel name is trt_ampere_h1688cudnn_128x128_ldg8_relu_exp_small_nhwc_linkable_tn_v1, and tensorrt version is 7.2.1, while nvbit is also the latest version.

Thx,
Lei

Issues with memory divergence example

Hi,

I was trying to implement the code for the memory divergence example shown in the paper in Listing 8. I encounter two issues.

  1. First "match_any_sync" function which is used here "int cnt = __popc(__match_any_sync(mask, cache_addr))" doesn't seem to be a valid function in the nvbit library. I am not sure how to resolve this and what to use in its place instead.

  2. Second "line 29" in the example "atomicAdd(&uniq_lines, 1.0f / cnt);".
    I feel like it should be atomicAdd(&uniq_lines, cnt) instead based on my understanding of memory divergence. Not sure if I am correct.

Thanks for the help.

ASSERT FAIL: sass_lib.h:1064:void SassInstr::decode(): FAIL !(opcode_end != std::string::npos)

Hello,
I'm trying to run the DeepBench benchmarks with the opcode_hist tool included in NVBit, and I've encountered this assert error:
ASSERT FAIL: sass_lib.h:1064:void SassInstr::decode(): FAIL !(opcode_end != std::string::npos)
I'm using CUDA 11.0, the latest NVBit release (1.5), and I'm using a V100.
These are the utilized application parameters: ./conv_bench train half 7 7 832 16 128 5 5 2 2 1 1
Nsight-compute reveals the offending kernel is called "volta_hcudnn_128x128_stridedB_splitK_small_nn_v1".
The program has no issues finishing without NVBit, or, weirdly enough, when using the instr_count tool.

Turing SM_75 instrumentation

Hello,

Is there any plan to support the Turing architecture with SM_75 capability?
right now, when I try to instrument Turing architecture, it gives me this error:
NVBit ERROR: SM 7.5 name GeForce RTX 2060 not supported

Thanks!

nvdisasm not found on PATH

I got the error message when I try to instrument pytorch program with mem_printf.so.

~$ LD_PRELOAD=/home/username/nvbit_release/tools/mem_printf/mem_printf.so python3 two_layer_net_tensor.py 
------------- NVBit (NVidia Binary Instrumentation Tool v1.4) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
        TOOL_VERBOSE = 0 - Enable verbosity inside the tool
----------------------------------------------------------------------------------------------------
ERROR:` nvdisasm not found on PATH!!!

But it is actually on PATH.

~$ which nvdisasm 
/usr/local/cuda-10.2/bin/nvdisasm

Use CUDA 9.2 with NVBit 1.1 will also leads to a similar error message.

Thanks!

Instructions after Exit termination

Hello,

As much as I understand, the SASS instruction "EXIT" means the thread has finished the kernel execution.
In accel-sim tracer, we rely on this fact, so, when we reach an Exit instruction with a full active predicate mask, we terminate the execution and we assume the thread has finished exuection. This also matches the fact that there are no any instructions traced after this.
So For example:

PC Active&Predicate Inst
02b0 00000000 EXIT => This means, do not exit yet, as the active mask is zeros.
4250 ffffffff EXIT => This means, do exit, as the active mask is full. And, we notice no instructions are coming after this.

We have traced hundreds of workloads and our assumption looks correct and we did not face any issues. However, we traced
Nvidia cudf library for data analytics and we do find some weird scenario that does not match our assumption:
The traces output are:

4230 ffffffff 0 ISETP.GE.AND 2 R5 R4 0 
4240 ffffffff 0 ISETP.GE.AND 2 R7 R2 0 
4250 ffffffff 0 EXIT 0 0 
4260 ffffffff 1 R20 IMAD.MOV.U32 2 R255 R255 0 
4270 ffffffff 0 ISETP.NE.AND 2 R31 R255 0 
4280 ffffffff 0 BSSY 0 0 
......

As you can see EXIT has an active mask and predicate with all ones, however, some instructions are coming after and the warp has not finished yet.

The CUDA kernel that is traced can be found here:
https://github.com/rapidsai/cudf/blob/c69b6f82adaa821c5201055ce3bd1672978b5704/cpp/src/io/parquet/page_data.cu#L1650

The Nvbit-based Accel-sim tracer takes into account the predicate mask as shown here:
https://github.com/accel-sim/accel-sim-framework/blob/4c2bf09a79d6b57bb10fe1898700930a5dd5531f/util/tracer_nvbit/tracer_tool/tracer_tool.cu#L529

Any help with this, please? Is our assumption about EXIT instruction correct?

Thanks!

undefined symbol

I got this error when running nvbit according to the instructions in Readme.

./test-apps/vectoradd/vectoradd: symbol lookup error: ./tools/instr_count/instr_count.so: undefined symbol: _ZTVNSt7__cxx1118basic_stringstreamIcSt11char_traitsIcESaIcEEE

Any suggestions?

Cannot find the instrumentation function

I'm writing a simple program to test the functionality of NVBit.

Here is a simplified version of my code :

extern "C" __device__ __noinline__ void my_function() {
    // do something here
}

int main() {
    // load a cubin file which contains the function that need to be instrumented.
    // ...
    CUfunction myFunc;
    cuModuleGetFunction(&myFunc, myModule, myFuncName);
    auto instrs = nvbit_get_instrs(cuContext, func);
    nvbit_insert_call(instrs[18], "my_function", IPOINT_BEFORE);
    nvbit_enable_instrumented(cuContext, func, true);
    // ...
}

The error is :

ASSERT FAIL: function.cpp:764:void Function::gen_new_code(std::unordered_map<std::__cxx11::basic_string<char>, Function*>&): FAIL !(instr_func_map.find(c.instr_func_name) != instr_func_map.end()) MSG: instrumentation function merged_kernel0 not found in binary!

I don't understand how nvbit_insert_call looks for the instrumentation function.

mem_printf

The output of mem_printf is something like this

0x0000000000000740 - opcode_id 5

I checked the source code. I would like to know if the address is virtual or physical? Global space? or something else?
Also, what is the corresponding mnemonic of opcode_id?

NVBIT execution error when compiled CUDA 11.1.105

I tried to compile and execute NVBIT 1.5.2 /1.5.1 with CUDA 11.1. The compilation is successful but when i try to launch any program (as example vectorAdd in testapps) with inst_count.so tool i receive the following error:

Cuda error in function '(vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n))' file 'vectoradd.cu' in line 81 : initialization error.

I am using Quadro P5000.

Problem with LDC instruction

Hello,

I cannot get the source register number for constant load instruction: LDC.64 R18, c[0x2][R12];

Below are the info about operand 1:
op[1].type = CBANK
op[1].is_neg/abs = 0/0
op[1].value[0] = 2 (2.000000e+00)
op[1].value[1] = 0 (0.000000e+00)

op[1].value[1] should give me 12 but actually give me 0.

Am I wrong to use the tool or might it be a problem in NVBit?

Thanks!

Decoded instruction for async-copy

Hi
With the printDecoded() function, I see the following information for an async-copy instruction:

Instr 418 @ 0x1a20 (6688) - LDGSTS.E.LTC128B.128.ZFILL [R217], [R22.64], P0 ;
  has_guard_pred = 0
  opcode = LDGSTS.E.LTC128B.128.ZFILL/LDGSTS
  memop = GLOBAL_TO_SHARED
  load/store = 0/0

Shouldn't be load/store = 1/1? The operation is both load and store.

nvbit_at_context_init_hook(): Assertion `cudaGetLastError() == cudaSuccess' failed

Hello, I am trying to run the vectoradd example in the README, but ran into the following error:

=> ./test-apps/vectoradd/vectoradd
Final sum = 100000.000000; sum/n = 1.000000 (should be ~1)

=> LD_PRELOAD=./tools/instr_count/instr_count.so ./test-apps/vectoradd/vectoradd
------------- NVBit (NVidia Binary Instrumentation Tool v1.3.1) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
      START_GRID_NUM = 0 - Beginning of the kernel gird launch interval where to apply instrumentation
        END_GRID_NUM = 4294967295 - End of the kernel launch interval where to apply instrumentation
    COUNT_WARP_LEVEL = 1 - Count warp level or thread level instructions
    EXCLUDE_PRED_OFF = 0 - Exclude predicated off instruction from count
   ACTIVE_FROM_START = 1 - Start instruction counting from start or wait for cuProfilerStart and cuProfilerStop
       MANGLED_NAMES = 1 - Print kernel names mangled or not
        TOOL_VERBOSE = 0 - Enable verbosity inside the tool
----------------------------------------------------------------------------------------------------
vectoradd: ../../core/nvbit_tool.h:82: void nvbit_at_context_init_hook(): Assertion `cudaGetLastError() == cudaSuccess' failed.
Aborted (core dumped)

Here is the system config:
CentOS Linux release 7.8.2003
cuda-10.2
cuda-9.2

=> g++ --version
g++ (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

=> which g++
/opt/rh/devtoolset-7/root/usr/bin/g++

=> ldd ./test-apps/vectoradd/vectoradd
	linux-vdso.so.1 =>  (0x00007ffde6bb1000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f16dc044000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f16dbe28000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f16dbc24000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f16db91d000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f16db61b000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f16db405000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f16db037000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f16dc24c000)

=> ldd ./tools/instr_count/instr_count.so
	linux-vdso.so.1 =>  (0x00007fff4257d000)
	libcuda.so.1 => /usr/lib64/nvidia/libcuda.so.1 (0x00007fd451819000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fd451611000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fd4513f5000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fd4511f1000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fd450eea000)
	libm.so.6 => /lib64/libm.so.6 (0x00007fd450be8000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fd4509d2000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fd450604000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fd452ac2000)
	libnvidia-fatbinaryloader.so.396.69 => /usr/lib64/nvidia/libnvidia-fatbinaryloader.so.396.69 (0x00007fd4503b8000)

=> nvidia-smi
Sat May 30 21:47:15 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.69                 Driver Version: 396.69                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1C:00.0 Off |                    0 |
| N/A   39C    P0    70W / 300W |   4665MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
...

nvbit_release/tools
PATH=/usr/local/cuda-10.2/bin:$PATH make
nvbit_release/test-apps/
PATH=/usr/local/cuda-9.2/bin:$PATH make

Can't use cuda-10.2 for vectoradd

=> ./vectoradd/vectoradd
Cuda error in function '(vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n))' file 'vectoradd.cu' in line 81 : CUDA driver version is insufficient for CUDA runtime version.

GDB

=> LD_PRELOAD=./tools/instr_count/instr_count.so gdb ./test-apps/vectoradd/vectoradd
------------- NVBit (NVidia Binary Instrumentation Tool v1.3.1) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
      START_GRID_NUM = 0 - Beginning of the kernel gird launch interval where to apply instrumentation
        END_GRID_NUM = 4294967295 - End of the kernel launch interval where to apply instrumentation
    COUNT_WARP_LEVEL = 1 - Count warp level or thread level instructions
    EXCLUDE_PRED_OFF = 0 - Exclude predicated off instruction from count
   ACTIVE_FROM_START = 1 - Start instruction counting from start or wait for cuProfilerStart and cuProfilerStop
       MANGLED_NAMES = 1 - Print kernel names mangled or not
        TOOL_VERBOSE = 0 - Enable verbosity inside the tool
----------------------------------------------------------------------------------------------------
GNU gdb (GDB) Red Hat Enterprise Linux 8.1.90.20180727-44.el7
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./test-apps/vectoradd/vectoradd...(no debugging symbols found)...done.
(gdb) set environment LD_PRELOAD ./tools/instr_count/instr_count.so
(gdb) start
Temporary breakpoint 1 at 0x403070
Starting program: nvbit_release/test-apps/vectoradd/vectoradd
------------- NVBit (NVidia Binary Instrumentation Tool v1.3.1) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
      START_GRID_NUM = 0 - Beginning of the kernel gird launch interval where to apply instrumentation
        END_GRID_NUM = 4294967295 - End of the kernel launch interval where to apply instrumentation
    COUNT_WARP_LEVEL = 1 - Count warp level or thread level instructions
    EXCLUDE_PRED_OFF = 0 - Exclude predicated off instruction from count
   ACTIVE_FROM_START = 1 - Start instruction counting from start or wait for cuProfilerStart and cuProfilerStop
       MANGLED_NAMES = 1 - Print kernel names mangled or not
        TOOL_VERBOSE = 0 - Enable verbosity inside the tool
----------------------------------------------------------------------------------------------------
warning: Unable to open "librpm.so.3" (librpm.so.3: cannot open shared object file: No such file or directory), missing debuginfos notifications will not be displayed
Missing separate debuginfo for /lib64/ld-linux-x86-64.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/27/ffd1fbc69569c776e666474eed723395e6d727.debug
Missing separate debuginfo for /lib64/librt.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/cc/d4be566dd5a8fc7fa62b224c14b698f51b0d0d.debug
Missing separate debuginfo for /lib64/libpthread.so.0
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/2b/482b3bae79def4e5bc9791bc6bbdae0e93e359.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /lib64/libdl.so.2
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/f2/c36986e11a291a0d4bcb3a81632b24ae2359ea.debug
Missing separate debuginfo for /lib64/libstdc++.so.6
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/e2/fb6e9c483d89e8e96d73c7ccf3e3a91e91bb81.debug
Missing separate debuginfo for /lib64/libm.so.6
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/08/5d924f5d23b9f15a8ad28b7231ee93c09e13f1.debug
Missing separate debuginfo for /lib64/libgcc_s.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/da/c0179f4555aefec9e97476201802fd20c03ec5.debug
Missing separate debuginfo for /lib64/libc.so.6
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/d7/8066a9c36f5fd63e2f6ac851ae3515c4c9792a.debug
------------- NVBit (NVidia Binary Instrumentation Tool v1.3.1) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
      START_GRID_NUM = 0 - Beginning of the kernel gird launch interval where to apply instrumentation
        END_GRID_NUM = 4294967295 - End of the kernel launch interval where to apply instrumentation
    COUNT_WARP_LEVEL = 1 - Count warp level or thread level instructions
    EXCLUDE_PRED_OFF = 0 - Exclude predicated off instruction from count
   ACTIVE_FROM_START = 1 - Start instruction counting from start or wait for cuProfilerStart and cuProfilerStop
       MANGLED_NAMES = 1 - Print kernel names mangled or not
        TOOL_VERBOSE = 0 - Enable verbosity inside the tool
----------------------------------------------------------------------------------------------------

Temporary breakpoint 1, 0x0000000000403070 in main ()
(gdb) c
Continuing.
[New Thread 0x7fffef5a2700 (LWP 455460)]
[New Thread 0x7fffeeda1700 (LWP 455465)]
vectoradd: ../../core/nvbit_tool.h:82: void nvbit_at_context_init_hook(): Assertion `cudaGetLastError() == cudaSuccess' failed.

Thread 1 "vectoradd" received signal SIGABRT, Aborted.
0x00007ffff68d2387 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff68d2387 in raise () from /lib64/libc.so.6
#1  0x00007ffff68d3a78 in abort () from /lib64/libc.so.6
#2  0x00007ffff68cb1a6 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007ffff68cb252 in __assert_fail () from /lib64/libc.so.6
#4  0x00007ffff7ada768 in nvbit_at_context_init_hook () from ./tools/instr_count/instr_count.so
#5  0x00007ffff7ae4a25 in Nvbit::create_ctx(CUctx_st*) () from ./tools/instr_count/instr_count.so
#6  0x00007ffff7ae9b4c in nvbitToolsCallbackFunc(void*, CUtools_cb_domain_enum, unsigned int, void const*) ()
   from ./tools/instr_count/instr_count.so
#7  0x00007ffff5c55138 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#8  0x00007ffff5ad0c0f in ?? () from /usr/lib64/nvidia/libcuda.so.1
#9  0x00007ffff5ad225f in ?? () from /usr/lib64/nvidia/libcuda.so.1
#10 0x00007ffff5a0b03c in ?? () from /usr/lib64/nvidia/libcuda.so.1
#11 0x00007ffff5b44ea6 in cuDevicePrimaryCtxRetain () from /usr/lib64/nvidia/libcuda.so.1
#12 0x000000000042e590 in cudart::contextStateManager::initPrimaryContext(cudart::device*) ()
#13 0x000000000042edfd in cudart::contextStateManager::initDriverContext() ()
#14 0x000000000042feec in cudart::contextStateManager::getRuntimeContextState(cudart::contextState**, bool) ()
#15 0x000000000042379c in cudart::doLazyInitContextState() ()
#16 0x0000000000407ca8 in cudart::cudaApiMalloc(void**, unsigned long) ()
#17 0x000000000044291c in cudaMalloc ()
#18 0x00000000004030c8 in main ()

NVBit can't test the cuda+c++ program?

Hi dear author,
It's truly a honor to write a letter to you, this article[1] says "DynamoRio are supported on CPUs, NVBit can support GPU architectures", but NVBit can't test cuda+c++ program, which I'm building[2], so NVBit can't work on the GPU programs now?

[1] https://research.nvidia.com/publication/2020-09_NVBit%3A-A-Dynamic
[2] https://github.com/wanlin405/Computer-Graphics/tree/master/task14-particle/task-particle1

thank you
best regards to you
William

Instrumentation function register usage limitation

Hi,

I met this error when I tried to inject a big device function.

ASSERT FAIL: function.cpp:774:void Function::gen_new_code(std::unordered_map<std::__cxx11::basic_string<char>, Function*>&): FAIL !(nregs <= 24) MSG: instrumentation function should not use more than 24 registers!

What is the concern of the register usage? The only reason I can imagine is that the instrumentation function might affect the occupancy of the origin kernel.

Can we relax this restriction? Maybe replace it with a warning?

Assertion on instrument_function_if_needed() on 3080

Hi,
With nvbit-1.5.3 on 3080 (CUDA-11.2), I get an assertion error with the tracer tool.

Writing results to ./traces/kernel-100.trace
python3: tracer_tool.cu:185: void instrument_function_if_needed(CUcontext, CUfunction): Assertion `mem_oper_idx == -1' failed.
Aborted (core dumped)

The program uses Python3 and CuDNN. Is there anyway to narrow the problem or report a bug?

Problem to use mem_trace for Optix SDK examples

Hello,

I try to use mem_trace for Optix SDK examples(optixHello, optixRaycasting), however, I found the program hangs after optixLaunch() API on CUDA_SYNC_CHECK(cudaDeviceSynchronize inside).

Then I try to change mem_trace and find out if I comment out "channel_dev.push(&ma, sizeof(mem_access_t));" inside instrument_mem(), the program will not hang.

So does NVbit ChannelDev not work for optix apps?

Thanks!

Is there a SASS inst with more than one memory ref?

Hello,

When I took a look at the mem_trace tool example that comes with v1.4. It seems like you are handling more than one memory reference per inst, as shown below. My question, is there case ever happen you can have more than one memory reference per instruction?
A second question, do you handle TEX memory reference?

            if (op->type == Instr::operandType::MREF) {
                /* insert call to the instrumentation function with its
                 * arguments */
                nvbit_insert_call(instr, "instrument_mem", IPOINT_BEFORE);
                /* predicate value */
                nvbit_add_call_arg_pred_val(instr);
                /* opcode id */
                nvbit_add_call_arg_const_val32(instr, opcode_id);
                /* memory reference 64 bit address */
                nvbit_add_call_arg_mref_addr64(instr, mref_idx);
                /* add pointer to channel_dev*/
                nvbit_add_call_arg_const_val64(instr,
                                               (uint64_t)&channel_dev);
                mref_idx++;
            }

cudaErrorPeerAccessUnsupported Error with NVBit

I am facing issue running applications with NVBit that use cudaDeviceEnablePeerAccess in the program. nvbit tools such as instr_count give cudaErrorPeerAccessUnsupported error. One such example is simpleP2P in cuda samples (cuda/samples/0_Simple/simpleP2P).
uvaproblem

simpleP2P runs successfully without NVBit.
simpleP2PwithoutNVBit

I am running nvbit on nvidia DGX-2 with cuda driver version 418.116.00
driver_version

Getting lineinfo with device code

If i add -lineinfo during nvcc compilation and disable all optimizations, would it be possible for nvbit to extract the lineinfo for each instruction?

More details:
I am trying to write a basic code coverage tool with nvbit.
I can generate cubin file with nvcc and use nvdisasm --print-line-info vectorAdd.cubin to get the intruction <=> line info correspondence, but the issue is that the instruction printed out from cubin and the instructions printed out from nvbit are not always matching.
If nvbit can parse the lineinfo directly then this would be awesome.

Thank you.

full instrumentation or sampling instrumentation?

For the mem-trace tool, I found the instrumentation overhead is big for the first execution of one kernel while it is neglectable for other execution. Does it mean that memory tracing is only instrumented once for one kernel? (sampling instrumentation)

CUBLAS abort

I'm trying to run the opcode_hist on 0_Simple/matrixMulCUBLAS from NVIDIA samples on Tesla K40c. This sample uses sgemm from CUBLAS to perform matrix multiplication. However the I'm having trouble with it. When I try to execute the following line:

eval LD_PRELOAD=.../nvbit_release/tools/opcode_hist/opcode_hist.so ./matrixMulCUBLAS

The output is as follows:

------------- NVBit (NVidia Binary Instrumentation Tool v1.4) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
        KERNEL_BEGIN = 0 - Beginning of the kernel launch interval where to apply instrumentation
          KERNEL_END = 4294967295 - End of the kernel launch interval where to apply instrumentation
        TOOL_VERBOSE = 0 - Enable verbosity inside the tool
    COUNT_WARP_LEVEL = 1 - Count warp level or thread level instructions
    EXCLUDE_PRED_OFF = 0 - Exclude predicated off instruction from count
----------------------------------------------------------------------------------------------------
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "Tesla K40c" with compute capability 3.5

GPU Device 0: "Tesla K40c" with compute capability 3.5

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...matrixMulCUBLAS: arch/gk11x_hal.cpp:173: void set_imm_relative_control_flow(uint64_t*, int64_t): Assertion `!((((imm)&0xFF000000) != 0) && (((imm)&0xFF000000) != 0xFF000000))' failed.
Aborted (core dumped)

I'm using nvcc 10.1. Without the instrumentation, the code runs as expected.

Segmentation Fault in Nvbit::compute_max_stack_size

I'm trying to use NVBit to profile an application. I obtain a Segmentation Fault after the first call to cudaMemcpyToSymbol. It seems that nvbit_at_init() and nvbit_at_cuda_event() are being called. I also tried CUDA_INJECTION64_PATH instead of LD_PRELOAD.

LD_PRELOAD=./mem_trace.so ./pbrt --gpu --pixel 1,1 ~/Downloads/pbrt-v4-scenes/smoke-plume/plume.pbrt 
pbrt version 4 (built Oct  3 2020 at 16:49:56)
Copyright (c)1998-2020 Matt Pharr, Wenzel Jakob, and Greg Humphreys.
The source code to pbrt (but *not* the book contents) is covered by the BSD License.
See the file LICENSE.txt for the conditions of the license.
------------- NVBit (NVidia Binary Instrumentation Tool v1.4) Loaded --------------
NVBit core environment variables (mostly for nvbit-devs):
            NVDISASM = nvdisasm - override default nvdisasm found in PATH
            NOBANNER = 0 - if set, does not print this banner
---------------------------------------------------------------------------------
         INSTR_BEGIN = 0 - Beginning of the instruction interval where to apply instrumentation
           INSTR_END = 4294967295 - End of the instruction interval where to apply instrumentation
        TOOL_VERBOSE = 0 - Enable verbosity inside the tool
----------------------------------------------------------------------------------------------------
Segmentation fault (core dumped)

Here's a stack trace from cuda-gdb. It seems there's a recursive loop of sorts in compute_max_stack_size? Any ideas why this might be?

#0  0x00007ffff7ed1059 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#1  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#2  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#3  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#4  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#5  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#6  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#7  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#8  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
#9  0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*)
    () from ./mem_printf.so
[snip]
#174500 0x00007ffff7ed1083 in Nvbit::compute_max_stack_size(Function*) () from ./mem_printf.so
#174501 0x00007ffff7edd81d in Nvbit::module_loaded(CUctx_st*, void const*, unsigned long, CUmod_st*) () from ./mem_printf.so
#174502 0x00007ffff7eddf1a in nvbitToolsCallbackFunc(void*, CUtools_cb_domain_enum, unsigned int, void const*) () from ./mem_printf.so
#174503 0x00007ffff6d1aef3 in ?? () from /usr/lib/libcuda.so.1
#174504 0x00007ffff6b61bcd in ?? () from /usr/lib/libcuda.so.1
#174505 0x00007ffff6a96698 in ?? () from /usr/lib/libcuda.so.1
#174506 0x00007ffff6a96dfc in ?? () from /usr/lib/libcuda.so.1
#174507 0x0000555555b46c04 in cudart::contextState::loadCubin(bool*, cudart::globalModule*) ()
#174508 0x0000555555b3c34e in cudart::globalModule::loadIntoContext(cudart::contextState*) ()
#174509 0x0000555555b4d324 in cudart::contextState::applyChanges() ()
#174510 0x0000555555b51aea in cudart::contextStateManager::initRuntimeContextState_nonreentrant(cudart::contextState**) ()
#174511 0x0000555555b51d84 in cudart::contextStateManager::getRuntimeContextState(cudart::contextState**, bool) ()
#174512 0x0000555555b32140 in cudart::cudaApiMemcpyToSymbol(void const*, void const*, unsigned long, unsigned long, cudaMemcpyKind) ()
#174513 0x0000555555b6dfa0 in cudaMemcpyToSymbol ()
#174514 0x00005555558babe0 in pbrt::InitLogging(pbrt::LogConfig, bool) ()
#174515 0x00005555557efb57 in pbrt::InitPBRT(pbrt::PBRTOptions const&) ()
#174516 0x000055555566ffe3 in main ()

System Configuration:
nvcc: release 11.0, V11.0.194
Driver Version: 450.57
CUDA Version: 11.0

I confirmed everything works properly with the vectoradd example, so I don't think it's an issue with my system configuration. Does anyone have any insight into what's going on here?

Get line numbers

Is it possible to obtain the line number of the CUDA code that corresponds to the SASS code if the binary was compiled using -lineinfo?

Address value changing

For the below code snippet, when instrumented using the mem_printf example, I notice some odd behaviour.

__device__ int dummy;
__global__ void test(int *tests)
{
    dummy = tests[0];
    tests[0] = 5;
}

int main()
{
    int *tests;
    cudaMalloc((void **)&tests, sizeof(int));
    test<<<1,1>>>(tests);
    cudaDeviceSynchronize();
}

The output I get is:
OPCODE MOV MAPS TO ID 0
OPCODE LDG.E.SYS MAPS TO ID 1
OPCODE STG.E.SYS MAPS TO ID 2
0x00007f3d02e00000 - opcode_id 1
0x00007f3d06600300 - opcode_id 2
0x00007f3d22000200 - opcode_id 2

The order of operations is as expected, first, we load tests[0], then we store into dummy, then we store into tests[0].

However, notice the addresses these operations are mapped to.
The first and third operation should have the same address, the address of tests[0], but they differ (0x00007f3d02e00000 and 0x00007f3d22000200).

I tested this for other programs and observed the same behavior; the first operation's address differs from subsequent operations for the same memory location.

Disable Instrumentation for Turing GPUs

We have an instrumentation tool written and we would like to selectively run instrumentation on non-Turing types of GPUs. We added this function to our tool:

bool is_gpu_turing() {
    int device_id;
    cudaDeviceProp prop;
    cudaGetDevice(&device_id);

    cudaGetDeviceProperties(&prop, device_id);
    int compute_capability = prop.major * 10 + prop.minor;
    if (compute_capability >= 75) {
        return true;
    }
    return false;
}

We added this function at the beginning of nvbit_at_init(), nvbit_at_function_first_load(), nvbit_at_cuda_event(), and nvbit_at_term() to skip instrumentation if the GPU is Turing.

However, we still got this failure:

NVBit ERROR: SM 7.5 name GeForce RTX 2080 Ti not supported

Is there a way for this to work inside the tool? Or should we try a different approach (e.g., disabling LD_PRELOAD in the shell script when turing is detected)?

Assertion

hi,
i want debug my own tools, so i add "-G" in my makefile. add run with vectoradd.
but i get this Assertion:

vectoradd: nvbit_imp.cpp:702: void Nvbit::func_loading(CUcontext, CUfunction): Assertion `sizeof(_text_nvbit_nvbit_write_reg75) <= function->nbytes' failed.

i want to konw the meaning to this assertion , and fix out the problem.

thanks.

Occupancy calculation using NVBit

Is there any way to call cudaOccupancyMaxActiveBlocksPerMultiprocessor within an NVBit tool on kernels being launched? I believe I can get the block size and dynamic shmem size, but I also need to get the kernel function to pass to the call. Is it possible to get that?

resource limitation for the injection function

Nice work for open sourcing such great work!

But I have one question regarding the resource usage for the injection function.
As the sample inst_count showing, it is suggested that we shall used like the managed varaible in the injection function, not the shared or const memory.
As I see the reason for why not using the shared/const memory, I don't understand why the usage of managed resource could not hurt target's running. What if target also use the managed memory and their virutal addressing happen to be the one as the injection function used?

Graphics Workloads Instrumentation

Hello,

Is it possible to instrument graphics/gaming workloads with Nvbit?
I have not tested that, but If I try that, would it work? Does the "LD_PRELOAD" trick will work with graphics workloads or they have different execution path than GPGPU workloads.

Thanks!

Issue with getSize() function member in Instr class

The member function "getSize()" in the "Instr" class that returns the number of loaded/stored bytes by memory instruction, seems to return incorrect results.
For example, instruction like this:

LDG.E.64.STRONG.CTA R12, [R2+0x4000];

I expect getSize() = 8 since it is LD of 64 bits (8 bytes), however getSize() = 4?!

Can NVBit work with nvprof ?

Hello,

I am trying to profile the execution time of the instrumented kernel (by NVbit) with nvprof. However, it seems NVBit callback is not invoked when using nvprof.
Specifically, I use the instr_count tool for instrumenting the kernel and vectoradd for testing. Then executing the following commands:

export LD_PRELOAD=/home/ice/nvbit_release/tools/ice_play/ice_play.so
./vectoradd                 # NVBit callback is invoked 
nvprof ./vectoradd          # NVBit callback is not invoked

Yours,
IceCY

Can't instrument relocatable device code

I have a program written in main.cu with a kernel that is executed calling some device functions declared in header.cuh. In header.cu, I've given the definitions.

When I compile this using -rdc flag, and try instrumenting the code, none of the instrumented device code gets executed.

When I use verbose mode during the instrumentation process, the instructions inside header.cu show up, meaning NVBit is aware that they exist. However, the instrumented device function never gets executed if the code is inside header.cu.

Multiple opcodes concatenated by .

I would like to know if there is any document about instructions concatenated by . in cuda references? I want to know what is the difference between IMAD and IMAD.WIDE? Specifically for the latter, is that a PTX or SASS instruction? I don't see any instructions named WIDE in the instruction reference.

How to use shared memory in the inject device function?

I want to use shared memory in the inject device function, but it cannot be compiled.

Device Code:

extern "C" __device__ __noinline__ void inject_kernel() {
    extern __shared__ int x[];
    printf("%d\n", x[0]);
}

Compiler Error:

ptxas error   : Allocating additional shared memory is not allowed when command line option '--compile-as-tools-patch' is specified  

Is there any way to use shared memory inside the inject function?

comparing instruction mix of nvprof and nvbit

I see that the instruction mix of nvprof is different from nvbit. Some categories described here are not present in nvprof. However, I see big differences for others.

The program has one kernel which is invoked one time. The opcodes are

  ATOMS.ADD = 508920645
  BAR.SYNC = 123816
  BRA = 3817166581
  BSSY = 1037738104
  BSYNC = 1881710254
  EXIT = 1809
  FFMA = 1037672176
  FMUL = 518836088
  FSETP.GE.AND = 2593803797
  FSETP.GEU.AND = 178011288
  FSETP.GEU.OR = 340824800
  FSETP.LTU.OR = 518836088
  IADD3 = 3114835742
  IMAD = 1027773014
  IMAD.IADD = 3114829917
  IMAD.MOV.U32 = 1552036486
  IMAD.SHL.U32 = 3631492254
  IMAD.WIDE.U32 = 6322263
  IMAD.X = 201
  ISETP.GE.U32.AND = 525168200
  ISETP.GE.U32.OR = 11256
  ISETP.GT.U32.AND = 2593823093
  ISETP.GT.U32.OR = 45024
  ISETP.LE.U32.OR = 178011288
  ISETP.LT.U32.AND = 518836088
  ISETP.NE.AND = 12864
  LDC = 3631475973
  LDG.E.SYS = 6317439
  LDS.U = 1556621025
  LEA = 508922454
  LEA.HI.X = 201
  LOP3.LUT = 33768
  NOP = 123816
  S2R = 3216
  SEL = 5187673522
  SHF.R.U32.HI = 2593818269
  SHFL.IDX = 3114747909
  STG.E.64.SYS = 201
  STS = 268335

According to the reference, only

BAR.SYNC = 123816
NOP = 123816
S2R = 3216

Are considered as MISC. However in the picture below, nvprof says the MISC instructions are more than 158M.

Untitled

Also in other types I see big differences. Is that normal? Any reason for that?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.