Giter VIP home page Giter VIP logo

ptxprofiler's Introduction

PTXprofiler

A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.

How to compile?

  • on Windows: compile with Visual Studio Community
  • on Linux: run chmod +x make.sh and ./make.sh path/to/kernel.ptx

How to use?

  1. Generate a .ptx file from your application; this works only with an Nvidia GPU. With the OpenCL-Wrapper, you can simply uncomment #define PTX in src/opencl.hpp and compile and run. A file kernel.ptx is created, containing the PTX assembly code.
  2. Run bin/PTXprofiler.exe path/to/kernel.ptx. For FluidX3D for example, this table is generated:
kernel name                     |flops  (float int    bit  )|copy  |branch|cache  (load  store)|memory (load  cached store)
--------------------------------|---------------------------|------|------|--------------------|---------------------------
initialize                      |   283    129     61     93|    33|     6|     0      0      0|   135     35      0    100
stream_collide                  |   363    261     35     67|    23|     2|     0      0      0|   153     77      0     76
update_fields                   |   160     56     37     67|    21|     2|     0      0      0|    93     77      0     16
voxelize_mesh                   |   170     91     34     45|    40|    11|    84     48     36|    37     36      0      1
transfer_extract_fi             |   460      0    221    239|   122|    63|     0      0      0|   180     80     20     80
transfer__insert_fi             |   483      0    247    236|   115|    47|     0      0      0|   180     80     20     80
transfer_extract_rho_u_flags    |    47      0     39      8|    23|     1|     0      0      0|    68     34      0     34
transfer__insert_rho_u_flags    |    47      0     39      8|    23|     1|     0      0      0|    68     34      0     34
  1. For each OpenCL/CUDA kernel, instructions are counted and listed:
    • GPUs compute floating-point, integer and bit manipulation operations on the same ALUs, so they are counted combined as flops, but also listed separately as float, int and bit.
    • Data movement operations are listed under copy.
    • Branches are listed under branch.
    • Total shared/local memory (L1 cache) accesses in Byte are listed under cache, with separate counters for load and store.
    • Total global memory (VRAM) accesses in Byte are listed under memory, with separate counters for load, cached (load from VRAM or L2 cache) and store.
  2. You can use the counted flops and memory accesses, together with the measured execution time of the kernel, to place it in a roofline model diagram.

Limitations

  • Matrix/tensor operations are not yet supported.
  • Non-unrolled loops are only counted for one iteration, but may be executed multiple times, duplicating the number of actually executed instructions inside the loop.

ptxprofiler's People

Contributors

projectphysx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.