cloudcores / cuassembler Goto Github PK

View Code? Open in Web Editor NEW

359.0 10.0 66.0 2.98 MB

An unofficial cuda assembler, for all generations of SASS, hopefully ：）

License: MIT License

Python 79.34% Cuda 3.04% Makefile 0.90% C++ 15.28% Batchfile 0.03% C 1.40%

cuda assembler sass nvidia

cuassembler's Introduction

CuAssembler: An unofficial CUDA assembler

What is CuAssembler

CuAssembler is an unofficial assembler for nvidia CUDA. It's an assembler, reads assemblies(sass) and writes machine codes(in cubin). It's not another compiler just like officially provided by nvidia such as nvcc for CUDA C, and ptxas for ptx.

The aim of CuAssembler is to bridge the gap between ptx(the lowest level officially supported and documented by nvidia) and the machine code. Some similar packages include asfermi and maxas, which can only handle some of CUDA instruction sets. CuAssembler currently supports Pascal/Volta/Turing/Ampere instruction set(SM60/61/70/75/80/86/...), but the mechanism could be easily extended to older and possibly future CUDA instruction sets, since most of the instruction sets could be probed automatically.

NOTE: This library is still in its infancy, there are still a lot of works to be done. Interfaces and architectures are subject to change, use it at your own risk.

When and how should CuAssembler be used

Many CUDA users will inspect the generated sass code by cuobjdump after doing optimization of CUDA c code. The easiest way to tune the sass code is to modify CUDA c code itself, and then recheck the generated sass code again. For many cases, this will lead you to good enough codes (If you are really good at this :) ). However, for those ninja programers that want to optimize the codes down to every instruction, it would be rather upset when they cannot command the compiler to generate the code they want. An alternative tuning method is to modify the intermediate ptx code, which is full of vague variables particularly tedious and difficult to follow, and the generated machine codes are still not always satisfying. CuAssembler allows the user to tune the generated sass code directly.

It should be emphasized that, for most CUDA programmers, CUDA C (sometimes ptx) is always the first choice. It is full featured with great deal of compiling optimizations, officially supported and well documented by nvidia. They know best of their hardware, hence the compiler is also capable of doing some architecture specific optimizations. When the generated sass code is far from expected, you are very likely to have plenty of space for high level languages to play with. There are also large amount of communities and forums which could turn to for help. Playing with assemblies is pretty pain-staking comparing with high level languages, you need to worry about everything that could be done automatically by the compiler. It becomes an eligible option only when you are already quite familiar with CUDA c and ptx, and have played all the tricks you know to optimize, but still find the generated codes are not satisfying. Even in this case, it's still much more convenient to start with CUDA c or ptx, and then do some minor amendments based on the generated sass. This is the main usage CuAssembler is designed for: providing an option for minor adjustment of the generated machine codes, which is not possible by official tools.

Another important usage of CuAssembler is for micro-benchmarking, i.e., probing some specific details of micro-architecture by some specifically designed small programs. Good code optimization usually needs quite deep understanding of the hardware, especially performance related figures, such as the latency and throughput of different instructions, the cache hierarchy, the latency and throughput of every level of caches, cache replacement policies, etc. Many micro-benchmarking could be done using CUDA c, but its more straightforward and flexible when using assemblies, since you can not only arrange the instructions in any order you want, but also set the control codes directly, which is no way to be done in CUDA c or ptx.

As an assembler, CuAssembler simply translates the assemblies to machine codes literally, and then embeds them to cubin so it can be loaded and executed. It's programers' responsibility to guarantee the correctness of the code semantically, such as explicit register allocation, proper arrangement of instructions, and correct usage of registers (e.g., register pair for 64bit variables always starts from even). So you should get familiar with those conventions first, otherwise it's not possible to write legal assembly codes, and this kind of error will be far from conspicuous to catch. Nevertheless, legal assemblies does not imply legal program. There are many kinds of resources involved in CUDA program, such as general purpose registers, predicates, shared memories, and many others. They should match with the hardware configurations, and should be eligible for launching the specified dimension of blocks. Checking rigorous correctness of the whole program needs comprehensive understanding of the launch model and instruction set, both grammatically and semantically, far from possible without official support. So, it is left to the user to guarantee the correctness of the program, with very limited help from the assembler.

A short HOWTO

CuAssembler is not designed for creating CUDA program from scratch, it has to work with other CUDA toolkits. A good start of cubin is needed, maybe generated by nvcc from CUDA C using option -cubin or --keep, or ptxas from hand-written or tuned PTX codes. Currently nvcc doesn't support resuming the linking with modified cubin directly(not likely even in the future, due to its vulnerability). Thus the generated cubin usually need to be loaded in driver api. However, nvcc has a --dryrun option that can list all the commands that really builds up the compiling steps, we may hack this script(actually, just the ptxas step for generating cubin from ptx). Then we can run this program just using runtime api, which is much simpler. However, this also implies a limitation of our approach, all the sections, symbols, global variables in cubin should kept as is, otherwise the hacking may not work properly.

Remember to keep other optimization works done before coding with CuAssembler, since any modification of the input cubin may invalidate the modification done in CuAssembler, then you may need to redo all the work again.

See the User Guide and Tutorial for basic tutorial and introduction of input formats.

Prerequisites

CUDA toolkit 10+: Version 10+ is needed to support sm_75 (turing instruction sets), and version 11+ for ampere. Actually only the stand-alone program nvdisasm will be used by CuAssembler when saving cubin as cuasm, and cuobjdump may be used to dump sass. If you start from cuasm, no CUDA toolkit will be required. NOTE: it is known that some instructions or modifiers may not show up in the disassembly text, at least in some versions. Thus you may need to check some new version, if they have been fixed. Since nvdisasm and cuobjdump is stand-alone, you don't need to fetch full toolkit, just these two programs will do the job.
Python 3.8+: Previous python versions may also supported, but not tested yet.
Sympy 1.4+: Integeral (or rational) matrices with arbitrary precision needed by solving the LAE, and carrying out the null space of V. NOTE: before 1.4, sympy seems to cache all big integers, which may work like memory leak when you've assembled many instructions.
pyelftools: elf toolkit for handling cubin files.

sympy and pyelftools can be obtained with pip install sympy pyelftools.

Settings and Simple Usage

PATH and PYTHONPATH: you may need to add CuAssembler bin path (CuAssembler/bin) to system PATH for scripts to work, and including the root dir in PYTHONPATH is required for import CuAsm. Thus you may add these lines to your .bashrc(change the path accordingly):

  export PATH=${PATH}:~/works/CuAssembler/bin
  export PYTHONPATH=${PYTHOPATH}:~/works/CuAssembler/

In the dir bin, CuAssembler offers several python scripts(cuasm/hnvcc/hcubin/dsass/...) to accelerate the development procedure. Running with python cuasm.py or simply cuasm.py is not simple enough, thus a simbol link can be created:

ln -s cuasm.py cuasm
chmod a+x cuasm

You may just put this symbol link under your current PATH instead of adding CuAssembler/bin to your system PATH.

NOTE: most scripts(except hnvcc) also work under windows, the *.bat files under bin is the command-line wrapper.

cuasm

usage: cuasm [-h] [-o OUTFILE] [-f LOGFILE] [-v | -q] [--bin2asm | --asm2bin] infile [infile ...]

    Convert cubin from/to cuasm files.

    NOTE 1: if the output file already exist, the original file will be renamed to "outfile~".
    NOTE 2: if the logfile already exist, original logs will be rolled to logname.1, logname.2, until logname.3.

positional arguments:
  infile                Input filename. If not with extension .cubin/.bin/.cuasm/.asm, direction option --bin2asm or --asm2bin should be specified.

options:
  -h, --help            show this help message and exit
  -o OUTFILE, --output OUTFILE
                        Output filename, inferred from input filename if not given.
  -f LOGFILE, --logfile LOGFILE
                        File name for saving the log, default to none.
  -v, --verbose         Verbose mode, showing almost every log.
  -q, --quiet           Quiet mode, no log unless errores found.
  --bin2asm             Convert from cubin to cuasm.
  --asm2bin             Convert from cuasm to cubin.

Examples:
    $ cuasm a.cubin
        disassemble a.cubin => a.cuasm, text mostly inherited from nvdisasm. If output file name is not given,
        the default name is replacing the ext to .cuasm

    $ cuasm a.cuasm
        assemble a.cuasm => a.cubin. If output file name is not given, default to replace the ext to .cubin

    $ cuasm a.cubin -o x.cuasm
        disassemble a.cubin => x.cuasm, specify the output file explicitly

    $ cuasm a.cubin x.cuasm
        same as `cuasm a.cubin -o x.cuasm`

    $ cuasm a.o --bin2asm
        disassemble a.o => a.cuasm, file type with extension ".o" is not recognized.
        Thus conversion direction should be specified explicitly by "--bin2asm/--asm2bin".

    $ cuasm a.cubin -f abc -v
        disassemble a.cubin => a.cuasm, save log to abc.log, and verbose mode

dsass

usage: dsass [-h] [-o OUTFILE] [-k] [-n] [-f LOGFILE] [-v | -q] infile [infile ...]

    Format sass with control codes from input sass/cubin/exe/...

    The original dumped sass by `cuobjdump -sass *.exe` will not show scoreboard control codes,
    which make it obscure to inspect the dependencies of instructions.
    This script will extract the scoreboard info and show them with original disassembly.

    CAUTION: the sass input should with exactly same format of `cuobjdump -sass`, otherwise
             the parser may not work correctly.

    NOTE 1: For cubins of sm8x, the cache-policy desc bit of some instruction will be set to 1
            to show desc[UR#] explicitly, other type of inputs(sass/exe/...) won't do the hack,
            which means some instructions may not be assembled normally as in cuasm files.
            This also implies for desc hacked sass, code of instructions may be not consistent either.

    NOTE 2: if the output file already exist, the original file will be renamed to "outfile~".
    NOTE 3: if the logfile already exist, original logs will be rolled to log.1, log.2, until log.3.

positional arguments:
  infile                Input filename, can be dumped sass, cubin, or binary contains cubin.

options:
  -h, --help            show this help message and exit
  -o OUTFILE, --output OUTFILE
                        Output filename, infered from input filename if not given.
  -k, --keepcode        Keep code-only lines in input sass, default to strip.
  -n, --nodeschack      Do not hack desc bit, no matter SM version it is.
  -f LOGFILE, --logfile LOGFILE
                        File name for saving the logs, default to none.
  -v, --verbose         Verbose mode, showing almost every log.
  -q, --quiet           Quiet mode, no log unless errores found.

Examples:
    $ dsass a.cubin
        dump sass from a.cubin, and write the result with control code to a.dsass

    $ dsass a.exe -o a.txt
        dump sass from a.cubin, and write the result with control code to a.txt

    $ dsass a.sass
        translate the cuobjdumped sass into a.dsass

    $ dsass a.cubin -f abc -v
        convert a.cubin => a.dsass, save log to abc.log, and verbose mode

    $ dsass a.cubin -k
        usually lines with only codes in source sass will be ignored for compact output.
        use option -k/--keepcode to keep those lines.

hnvcc

NOTE: hnvcc only works under linux.

Usage: hnvcc nvcc_args...

hnvcc is the hacked wrapper of nvcc.
The operation depends on the environment variable 'HNVCC_OP':
    Not-set or 'none' : call original nvcc
    'dump' : dump cubins to hack.fname.sm_#.cubin, backup existing files.
    'hack' : hack cubins with hack.fname.sm_#.cubin, skip if not exist 
    Others : error

CAUTION:
    hnvcc hack/dump need to append options "-keep"/"-keep-dir" to nvcc.
    If these options are already in option list, hnvcc may not work right.

Examples:
    $ hnvcc test.cu -arch=sm_75 -o test               
        call original nvcc

    $ HNVCC_OP=dump test.cu -arch=sm_75 -o test       
        dump test.sm_#.cubin to hack.test.sm_#.cubin

    $ HNVCC_OP=hack test.cu -arch=sm_75 -o test       
        hack test.sm_#.cubin with hack.test.sm_#.cubin

hcubin

usage: hcubin [-h] [-o OUTFILE] [-f LOGFILE] [-v | -q] infile [infile ...]

    Hack the sm8x cubin with valid cache-policy desc bit set.

    Currently the disassembly of nvdisasm will not show default cache-policy UR:

    /*00b0*/                   LDG.E R8, [R2.64] ;                      /* 0x0000000402087981 */
                                                                        /* 0x000ea8000c1e1900 */
    /*00c0*/                   LDG.E R9, desc[UR6][R2.64+0x400] ;       /* 0x0004000602097981 */
                                                                        /* 0x000ea8200c1e1900 */

    The first disassembly line should be `LDG.E R8, desc[UR4][R2.64] ;`,
    in which UR[4:5] is the default cache-policy UR and not showed, which may cause assembly confusion.

    But if the 102th bit(the "2" in last line 0x000ea8200c1e1900) is set,
    all cache-policy UR will be showed, that will complete the assembly input for the encoding.

    This script will set that bit for every instruction that needs desc shown.

positional arguments:
  infile                Input filename, should be a valid cubin file.

options:
  -h, --help            show this help message and exit
  -o OUTFILE, --output OUTFILE
                        Output filename, infered from input filename if not given.
  -f LOGFILE, --logfile LOGFILE
                        File name for saving the logs, default to none.
  -v, --verbose         Verbose mode, showing almost every log.
  -q, --quiet           Quiet mode, no log unless errores found.

Examples:
    $ hcubin a.cubin
        hack a.cubin into a.hcubin, default output name is replacing the ext to .hcubin

    $ hcubin a.cubin -o x.bin
        hack a.cubin into x.bin

    $ hcubin a.cubin x.bin
        same as `hcubin a.cubin -o x.bin`

Classes

CuAsmLogger: A logger class utilizing python logging module. Note all the logging is done by a private logger, thus other loggers are not likely to be affected, if they use their own logger.
CuAsmParser: A parser class that can parse the user modified .cuasm text file, and save the result as .cubin.
CubinFile: It can read in a .cubin file, rewrite it into an editable .cuasm text file.
CuInsAssembler: The class that handles the value matrix V and solution of w for a special instruction key, such as FFMA_R_R_R_R.
CuInsAssemblerRepos: Repository of CuInsAssembler for all known keys. Constructing a workable repos from scratch is very time consuming, and it requires quite wide range of inputs that cover all frequently used instructions. Thus a pre-gathered repos is available in DefaultInsAsmRepos.${arch}.txt. Note: the repository may be incomplete, but user can easily update it.
CuInsParser: The class to parse the instruction string to keys, values and modifiers.
CuInsFeeder: A simple instruction feeder reading instructions from sass dumped by cuobjdump.
CuKernelAssembler: Assembler for a kernel, which should handle all kernel wide parameters, mostly nvinfo attributes.
CuNVInfo: A simple class that handles NVInfo section of cubin. This class is far from complete and robust, thus some NVInfo attributes have very limited support in CuAssembler.
CuSMVersion：A class that provides a uniform interface of all SM versions. All other classes are not recommended to contain architecture dependent treatments (well, at least hopefully...). Thus for future architectures, most of the work should be in this class.

Future plan

Likely to support:

Better coverage of intructions, bugfixes for officially unsupported instructions.
Extend to more compute capabilities, sm_60/61/70/75/80/86 will be mostly concerned.
More robust correctness check with aid of nvdisasm.
Automatically set control codes.
Alias and variable support, for easier programming, may be achieved by preprocessing?

Less likely to support, but still on the plan:

Register counting, and possibly register allocation
More robust parsing and user friendly error reporting.
Control flow support? May also be achieved by preprocessing in python?
And others...

cuassembler's People

Contributors

Stargazers

Watchers

cuassembler's Issues

Assembling failed (NewModi): Unknown modifiers: ({'2_R.64'})

mhi, big god
my CuAssembler raise exception when I test the "TestData"
my nvcc is 11.3, and my arch is sm_86
it throw following exception when I executed "make hack":

`2023-09-16 23:27:38,221 - ERROR - Assertion failed in:

File hack.cudatest.sm_86.cuasm:2876 :

    [B------:R-:W2:-:S01]         /*0050*/                   LDG.E R5, [R22.64] ;

Error when assembling instruction "[B------:R-:W2:-:S01] LDG.E R5, [R22.64] ;":

    Assembling failed (NewModi): Unknown modifiers: ({'2_R.64'})

Known Records:

    LDG.E R6, [R4] ;

    LDG.E R0, [R4] ;

    LDG.E R25, [R18] ;

    @P1 LDG.E R58, [R58] ;

    @P2 LDG.E.128 R36, [R48] ;

    @P2 LDG.E.128 R68, [R76+0x80] ;

    @P1 LDG.E.64 R96, [R96] ;

    @P0 LDG.E.U16 R38, [R38] ;

    @P0 LDG.E.LTC128B R42, [R88] ;

    @!P0 LDG.E.STRONG.GPU R40, [R34] ;

    LDG.U16.CONSTANT R17, [R17] ;

    LDG.U16.CONSTANT R9, [R9+-0x40] ;

    LDG.U8.CONSTANT R17, [R17] ;

    LDG.U16 R13, [R6] ;

    @P2 LDG.E.EL.LTC128B.STRONG.GPU R190, [R188] ;

    @P2 LDG.E.EL.LTC128B.STRONG.GPU R190, [R188] ;`

and the codes around line2876 in "hack.cudatest.sm_86.cuasm" as following:

`2871 [B------:R-:W-:Y:S02] /0000/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;

  2872      [B------:R-:W-:-:S01]         /*0010*/                   IMAD.MOV.U32 R22, RZ, RZ, c[0x0][0x160] ;

  2873      [B------:R-:W-:-:S01]         /*0020*/                   ULDC.64 UR36, c[0x0][0x118] ;

  2874      [B------:R-:W-:-:S01]         /*0030*/                   IMAD.MOV.U32 R23, RZ, RZ, c[0x0][0x164] ;

  2875      [B------:R-:W-:Y:S04]         /*0040*/                   IADD3 R1, R1, -0x28, RZ ;

  2876      [B------:R-:W2:-:S01]         /*0050*/                   LDG.E R5, [R22.64] ;

  2877      [B------:R-:W-:-:S02]         /*0060*/                   MOV R2, 32@lo(flist) ;

  2878      [B------:R-:W-:-:S01]         /*0070*/                   MOV R3, 32@hi(flist) ;

  2879      [B------:R-:W0:-:S04]         /*0080*/                   S2R R17, SR_CTAID.X ;

  2880      [B------:R-:W0:-:S01]         /*0090*/                   S2R R0, SR_TID.X ;

  2881      [B------:R-:W-:-:S02]         /*00a0*/                   IMAD.MOV.U32 R18, RZ, RZ, 0x4 ;`

and line2876 in "hack.cudatest.sm_86.cuasm" as following:
2876 [B------:R-:W2:-:S01] /*0050*/ LDG.E R5, [R22.64] ;

it will be ok when I change arch to sm_60/sm_75。
it will throw above exceptions if I change arch to sm_80/86
but, my real hardware is sm_86, so, I cannot pass these exceptions.
I don't know how to fix this trouble, would you like to help me to fix this trouble?
or, would you like to tell me the reason?

Compile CUASM error

About Ada (fp8 tensor core) and tensor core mma (and mma.sp) support..

Hi,
Just want to know status on tensor core support..
Thanks..

how to understand "L2Bank"

Hi, big god.
I found the "L2Bank" in your microbenchmark.
I don't know the principle and theory of your code. would you like to teach me?

I found some reference document about "L2Bank" from internet, but, the "L2Bank" is about Ampere.
and, I noticed that the ARCH is sm_50 and sm_75 in your makefile
So, I'm very confused, what is this "L2Bank"?

I run your microbenchmark on my SM_86, I cannot figure out what it want to demostrate,
So, how to understand it ?

I cannot run CubinFile

我的python代码如下:
binname = 'D:\WorkSpace\cuda_workspace\instruction_test\simpletest\x64\Release\kernel.sm_61.cubin'
cf = CubinFile(binname)
asmname = binname.replace('.cubin', '.cuasm')
cf.saveAsCuAsm(asmname)

各种软件版本如下:
python is 3.7.4
cuda tool kit is 10.2
Sympy is 1.4
pyelftools is the latest

总是提示“Exception: The segment range (0x2a4, 0x4fc) doesnot align with sections!”
即便把sm_61.cubin换成相应的sm_50或者sm_75都是类似的错误提示，不知道大神能不能帮忙看看什么原因？
相应的cu文件和cubin文件在附件中
simpletest.zip

Can we eliminate all negative numbers in the weight matrixs？

I found that there are many negative numbers in the weight matrixs. From what I see， this is because a certain modifier is omitted for some instruction . Such as the F2I_R_R instruction of SM_70, it has many data type modifers , for example "U64, S64, U32, S16, U8, S8", but the default "S32" type is omitted. If we add the "S32" modifier back，then after resolving the weight， we will get a matrix without negative number. Is there a reliable way to achieve this function?

Insufficient basis, try CuAsming more instructions!

2023-01-19 00:42:36,214 -          - Running CuAsmParser.parse...
2023-01-19 00:42:36,215 -    ENTRY -     Parsing file BuildSteps.sm_75.cuasm
2023-01-19 00:42:36,224 -          -     Running CuAsmParser.__preScan...
2023-01-19 00:42:37,001 -  WARNING - Line 19808: Weak symbol found! The implementation is not complete, please be cautious...
2023-01-19 00:42:37,005 -          -     Running CuAsmParser.__gatherTextSectionSizeLabel...
2023-01-19 00:42:37,006 -          -     Running CuAsmParser.__buildInternalTables...
2023-01-19 00:42:37,007 -          -     Running CuAsmParser.__evalFixups...
2023-01-19 00:42:37,009 -          -     Running CuAsmParser.__parseKernels...
2023-01-19 00:42:37,013 -     PROC -         Parsing kernel text of ".text._Z10DoEatStepsILi3EEvi"...
2023-01-19 00:42:37,057 -    ERROR - Assertion failed in:
    File BuildSteps.sm_75.cuasm:18452 :
        [B------:R-:W-:-:S02]         /*0d80*/              @!P0 IADD3.X R2, P1, R2, R15, RZ, P1, !PT ;
    Error when assembling instruction "[B------:R-:W-:-:S02] @!P0 IADD3.X R2, P1, R2, R15, RZ, P1, !PT ;":
        Assembling failed (NewVals): Insufficient basis, try CuAsming more instructions!
    Known Records:
        IADD3.X R4, P1, R17, R4, RZ, P1, !PT ;
        IADD3.X R16, P0, R5, R4, RZ, P0, !PT ;
        IADD3.X R2, P0, RZ, R0, RZ, P0, !PT ;
        IADD3.X R4, P0, RZ, R4, RZ, P0, !PT ;
        IADD3.X RZ, P0, R4, R4, RZ, P0, !PT ;
        IADD3.X R2, P0, R3.reuse, R3, RZ, P0, !PT ;
        IADD3.X R13, P1, RZ, R4, RZ, P0, !PT ;
        IADD3.X R44, P5, R5, ~R28, RZ, P5, !PT ;
        IADD3.X R24, P3, ~R0, R37, RZ, P0, !PT ;

How do I "CuAsm more instructions?"

BuildSteps.sm_75.cuasm is the unmodified output from cuasm.cmd BuildSteps.sm_75.cubin.

FYI: The 'weak' symbol is .weak $_Z11DoInitFirstj$__cuda_sm20_rem_u64.

Some confusions about stall count

I am trying to interleave the instructions of LD and FFMA for maximum throughput. For LDG instruction, I notice the stall count is always 4, such as:
[----:B------:R-:W2:-:S04] /0360/ LDG.E.SYS R48, [R2] ;
[----:B------:R-:W2:-:S04] /0370/ LDG.E.SYS R50, [R4] ;
[----:B------:R-:W3:-:S04] /0380/ LDG.E.SYS R52, [R2+0x4] ;
[----:B------:R-:W4:-:S04] /0390/ LDG.E.SYS R54, [R2+0x8] ;
[----:B------:R-:W5:-:S04] /03a0/ LDG.E.SYS R56, [R2+0xc] ;
[----:B------:R-:W5:-:S04] /03b0/ LDG.E.SYS R58, [R2+0x10] ;
However, the gloabl load instruction actually holds a variable latency. The instruction using the loaded oprand by LDG has to be synchronized with wrtie barrier instead of stall count. So why does the stall count of LDG is set as 4.

Another confusion is about the stall count for fix latency instruction, such as:
[-R--:B------:R-:W-:-:S02] /0540/ FFMA R22, R54, R66.reuse, R22 ;
[-R--:B------:R-:W-:-:S02] /0550/ FFMA R20, R56, R66.reuse, R20 ;
[-R--:B------:R-:W-:-:S02] /0560/ FFMA R18, R58, R66.reuse, R18 ;
[-R--:B------:R-:W-:-:S02] /0570/ FFMA R16, R60, R66.reuse, R16 ;
[-R--:B------:R-:W-:-:S02] /0580/ FFMA R14, R62, R66.reuse, R14 ;
As we can see the stall count is set as 2. The author of Maxas declares the pipline of a arithmetic instruction is 6 cycles in maxwell arch. However, the author implementing dual issue as following:
--:-:-:-:0 FFMA cx02y00, j0Ax02, j0By00, cx02y00;
--:-:-:-:1 LDS.U.128 j1Ax00, [readAs + 4x<1128 + 00>];
--:-:-:-:1 FFMA cx02y01, j0Ax02, j0By01, cx02y01;
--:-:-:-:0 FFMA cx00y01, j0Ax00, j0By01, cx00y01;
--:-:-:-:1 LDS.U.128 j1By00, [readBs + 4x<1128 + 00>];
--:-:-:-:1 FFMA cx00y00, j0Ax00, j0By00, cx00y00;
--:-:-:-:0 FFMA cx03y00, j0Ax03, j0By00, cx03y00;
--:-:-:-:1 LDS.U.128 j1Ax64, [readAs + 4x<1128 + 64>];
--:-:-:-:1 FFMA cx03y01, j0Ax03, j0By01, cx03y01;
--:-:-:-:0 FFMA cx01y01, j0Ax01, j0By01, cx01y01;
--:2:1:-:2 LDS.U.128 j1By64, [readBs + 4x<1128 + 64>]; // Set Dep 1,2 Stall 2

02:-:-:-:1 IADD readAs, readAs, 16; // Wait Dep 2
--:-:-:-:1 IADD readBs, readBs, 16;

01:-:-:-:1 FFMA cx02y00, j1Ax02, j1By00, cx02y00; // Wait Dep 1
cited from Maxas
I understand the first 0 stall count is for dual issue. After a dual issue, there is a FFMA instruction(3rd instruction) followed by LDS instruction, and its stall count is 1, following by this FFMA another dual issue is triggered. My understanding is this 1 stall count is for issue this instruction and all the instruction shown in this code are stored in instruction buffer without wating them to finish. The doubtion is, in my assembly code, I don't use such strategy, so why the stall count is set as 2? Is this implies the latency of FFMA is 2 clock cycles. If this is not the case, how does a program ensure the FFMA computation is finish and safe to use in the following inscturion?
In some code, like:
[----:B------:R-:W-:-:S02] /0470/ IADD3 R44, R44, 0x8, RZ ;
[----:B------:R-:W-:-:S02] /0480/ ISETP.GE.AND P0, PT, R86, 0x80, PT ;
[----:B------:R-:W-:-:S01] /0490/ IADD3 R46, R46, 0x8, RZ ;
[----:B--2---:R-:W-:-:S02] /04a0/ FFMA R42, R48, R50, R42 ;
You can see for a fix latency instruction, such as IADD3 in this code, the stall count of the instruction is also varied in different scenario. For stall count as 1, I would assume this different is to interleave the instruction issue. But agian, how should a program make sure the oprand is ready for following instruction.

Error when assembling instruction "[B------:R3:W4:-:S06] @P3 LDG.E.EL.128 R12, desc[UR8][R16.64] ;":

Hi,

Many thanks for releasing this assembler! I was trying to disassemble pytorch's kernel, but and then I saw this error:

Error when assembling instruction "[B------:R3:W4:-:S06] @p3 LDG.E.EL.128 R12, desc[UR8][R16.64] ;":
Assembling failed (NewModi): Unknown modifiers: ({'0_EL'})

May I know is there any idea how to resolve this error?

Best,

about the "L2::128B"

Hi, big god
I meet a new trouble about "LDG"，the assembler raise exception like this:
` File hack.main.sm_86.cuasm:795 :

    [B------:R-:W2:-:S04]         /*0080*/                   LDG.E.LTC128B.CONSTANT R4, desc[UR4][R2.64] ;

Error when assembling instruction "[B------:R-:W2:-:S04] LDG.E.LTC128B.CONSTANT R4, desc[UR4][R2.64] ;":

    Assembling failed (NewModi): Unknown modifiers: ({'0_LTC128B'})`

my LDG instruction in ptx like this:
` asm volatile (

    "ld.global.nc.L2::128B.v4.b32 {%0,  %1,  %2,  %3},  [%4];\n"

    : "=r"(reg0.x), "=r"(reg0.y), "=r"(reg0.z), "=r"(reg0.w)

    : "l"(src_ptr)

);`

this exception will be removed if I remove the "L2::128B"

so, would you like to teach me how to solve this trouble?

Cool project

This is a really cool and ambitious project!
I just wanted to share that I found an instruction description for all SASS instructions recently, see DocumentSASS. Perhaps the process of creating an assembler can be automated from these files.

Best,
Sebastian

Registers remapping in matrix multiplication

Hi,
Im researching the paper "Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking". I follow the step of the first part of the article which is registers remapping. In the paper, the registers' indices used for holding the tiles of A and B matrix are interleved. As shown in following picture.

The regs mapping is generated by NVCC 9.0 as described by author.

As I use cuAssembler with NVCC 10.2, the generated regs mapping is as follow.

You can see all the regs used for tile A and B are with even indices, which means all the regs used for tile C storage should be with odd indices (for avoiding reg bank conflict as FFMA op). But, as detailed in the second pic, the regs for C storage are mixed with odd and even indices, indicating the odd regs are not enough to prevent bank conflict. Thus, I try two potential methods to solve such issue. I firstly try changing some odd regs for tile A and B storage, insted of tile C, however, this results in cuda instruction error (CUDA error at matrixMul.cu:144 code=715(cudaErrorIllegalInstruction)) as I run the app. Then, I try to modify some even reg indices to to odd, for instance, R42, R40, R38 to R95, R97, R99. Unfortunately, this modification give me same error msg.
So I have three questions:

Is it possible to change regs' attribute once the compiler has defined them. Such as:
[----:B------:R-:W-:-:S01] /0170/ CS2R R42, SRZ ;
[----:B------:R-:W-:-:S02] /02f0/ MOV R93, RZ ;
[----:B------:R-:W2:-:S04] /0360/ LDG.E.SYS R48, [R2] ;
[----:B------:R-:W2:-:S04] /0370/ LDG.E.SYS R50, [R4];
[----:B--2---:R-:W-:-:S02] /04a0/ FFMA R42, R48, R50, R42 ;
to
[----:B------:R-:W-:-:S01] /0170/ CS2R R42, SRZ ;
[----:B------:R-:W-:-:S02] /02f0/ MOV R48, RZ ;
[----:B------:R-:W2:-:S04] /0360/ LDG.E.SYS R93, [R2] ;
[----:B------:R-:W2:-:S04] /0370/ LDG.E.SYS R50, [R4];
[----:B--2---:R-:W-:-:S02] /04a0/ FFMA R42, R93, R50, R42;
Is it possible to use more reg beyond the reg num delcared by compiler. For instance, the maximum reg index in my case is R93, but I would like to use R95.
If 1 and 2 are not possible, is there any other method can be used to optimize bank conflict in sunch case.

Many thanks!

Errors on CUDA 11.6

Hi, I tried to use CuAssembler on the sample CuTest code. It works well on the build-in cubin files. However, when I recompiled it using NVCC11.6 and the same arguments, the new cubin file can not be disassembled correctly and gots the following error:

 2022-05-12 23:33:29,857 -          - Running CubinFile.loadCubin...
2022-05-12 23:33:29,857 -    ENTRY -     Loading cubin file cudatest.sm_75.cubin...
2022-05-12 23:33:29,858 -    ERROR - Abnormal elf layout detected! No program header found!
Traceback (most recent call last):
  File "a.py", line 4, in <module>
    cf = CubinFile(binname)
  File "/home//sync_test/CuAssembler/CuAsm/CubinFile.py", line 28, in __init__
    self.loadCubin(cubinname)
  File "/home//sync_test/CuAssembler/CuAsm/CuAsmLogger.py", line 143, in wrapper
    ret = func(*args, **kwargs)
  File "/home//sync_test/CuAssembler/CuAsm/CubinFile.py", line 72, in loadCubin
    raise Exception(msg)
Exception: Abnormal elf layout detected! No program header found!

Any plan to support new compilers?

make dump issue

Hello,
When I try cuassember followed your suggestion , I run 'make dump' but I met next issue , could you kindly help ?
HNVCC_OP=dump hnvcc test_war.cu -arch=sm_60 -o test_war
HNVCC_OP is %s HNVCC_OP

Getting command list...

Creating keep dir hnvcc_keep_dir

Error when calling run(args) with args=['cicc', '--c++14', '--gnu_version=70500', '--orig_src_file_name', 'test_war.cu', '--allow_managed', '-arch', 'compute_60', '-m64', '-ftz=0', '-prec_div=1', '-prec_sqrt=1', '-fmad=1', '--include_file_name', 'test_war.fatbin.c', '-tused', '-nvvmir-library', '/usr/local/cuda-11.0/bin/../nvvm/libdevice/libdevice.10.bc', '--gen_module_id_file', '--module_id_file_name', 'hnvcc_keep_dir/test_war.module_id', '--gen_c_file_name', 'hnvcc_keep_dir/test_war.cudafe1.c', '--stub_file_name', 'hnvcc_keep_dir/test_war.cudafe1.stub.c', '--gen_device_file_name', 'hnvcc_keep_dir/test_war.cudafe1.gpu', 'hnvcc_keep_dir/test_war.cpp1.ii', '-o', 'hnvcc_keep_dir/test_war.ptx']!
[Errno 2] No such file or directory: 'cicc'

Exception: ('Duplicate symbol @%#x with name %s!', 24, '')

libcublasLt_static.1028.sm_75.zip
libcublasLt_static.1028.sm_75_cubin.zip
I successfully create the assembly code from libcublasLt_static.1028.sm_75.cubin with cuasm libcublasLt_static.1028.sm_75.cubin.
When I try to regenerate the cubin with cuasm libcublasLt_static.1028.sm_75.cuasm -o new_libcublasLt_static.1028.sm_75.cubin it shows me the following error:
2023-01-27 16:55:59,995 - - Running CuAsmParser.parse...
2023-01-27 16:55:59,996 - ENTRY - Parsing file libcublasLt_static.1028.sm_75.cuasm
2023-01-27 16:56:00,025 - - Running CuAsmParser.__preScan...
2023-01-27 16:56:03,810 - - Running CuAsmParser.__gatherTextSectionSizeLabel...
2023-01-27 16:56:03,814 - - Running CuAsmParser.__buildInternalTables...
Traceback (most recent call last):
File "/home1/public/manospavl/binary_modification/CuAssembler/bin/cuasm", line 153, in
doProcess(infile, outfile, direction)
File "/home1/public/manospavl/binary_modification/CuAssembler/bin/cuasm", line 94, in doProcess
cuasm2cubin(src, dst)
File "/home1/public/manospavl/binary_modification/CuAssembler/bin/cuasm", line 72, in cuasm2cubin
cap.parse(asmname)
File "/home1/public/manospavl/binary_modification/CuAssembler/CuAsm/CuAsmLogger.py", line 210, in wrapper
ret = func(*args, **kwargs)
File "/home1/public/manospavl/binary_modification/CuAssembler/CuAsm/CuAsmParser.py", line 766, in parse
self.__buildInternalTables()
File "/home1/public/manospavl/binary_modification/CuAssembler/CuAsm/CuAsmLogger.py", line 241, in wrapper
ret = func(*args, **kwargs)
File "/home1/public/manospavl/binary_modification/CuAssembler/CuAsm/CuAsmParser.py", line 929, in __buildInternalTables
self.__mSectionDict['.symtab'].getData())
File "/home1/public/manospavl/binary_modification/CuAssembler/CuAsm/CuAsmParser.py", line 147, in buildSymbolDict
raise Exception('Duplicate symbol @%#x with name %s!', p, name)
Exception: ('Duplicate symbol @%#x with name %s!', 24, '')

An error occurred when parsing the cuasm file into a cubin file

Hello, I am using CuAssembler's example.

__global__ void vectorAdd(const float* a, const float* b, float* c)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    c[idx] = a[idx] + b[idx];
}

The previous steps are normal. But when I tried to convert the cuasm file to cubin file, I get an error, as follows:

2022-01-28 05:55:42,506 -          - Running CuAsmParser.parse...
2022-01-28 05:55:42,506 -    ENTRY -     Parsing file vecadd.sm_70.cuasm
2022-01-28 05:55:42,507 -          -     Running CuAsmParser.__preScan...
Traceback (most recent call last):
  File "asm2bin.py", line 6, in <module>
    cap.parse(asmname)
  File "/CuAssembler/CuAsm/CuAsmLogger.py", line 143, in wrapper
    ret = func(*args, **kwargs)
  File "CuAssembler/CuAsm/CuAsmParser.py", line 734, in parse
    self.__preScan()
  File "CuAssembler/CuAsm/CuAsmLogger.py", line 174, in wrapper
    ret = func(*args, **kwargs)
  File "CuAssembler/CuAsm/CuAsmParser.py", line 842, in __preScan
    self.__dirDict[cmd](args)
  File "CuAssembler/CuAsm/CuAsmParser.py", line 652, in <lambda>
    '.__elf_flags'            : (lambda args: self.__dir_elfheader('flags'           , args)),
  File "CuAssembler/CuAsm/CuAsmParser.py", line 1406, in __dir_elfheader
    self.__mCuInsAsmRepos.setToDefaultInsAsmDict()
  File "/CuAssembler/CuAsm/CuInsAssemblerRepos.py", line 38, in setToDefaultInsAsmDict
    fname = Config.getDefaultInsAsmReposFile(self.__mSMVersion.getVersionNumber())
  File "/CuAssembler/CuAsm/config.py", line 52, in getDefaultInsAsmReposFile
    raise IOError('File %s not found!' % repos_name)
OSError: File DefaultInsAsmRepos.sm_70.txt not found!

The software versions are as follows:
python 3.7
CUDA 11.2

GPU used is V100

How should this situation be resolved?

No effect after adjusting cuasm

I used vectorAdd from cuda samples for experiment. Follow the steps in User Guide, I got my cuasm. Here is the critical part of my cuasm
[----:B------:R-:W-:-:S06] /0008/ MOV R1, c[0x0][0x20] ;
[----:B------:R-:W0:-:S01] /0010/ S2R R0, SR_CTAID.X ;
[----:B------:R-:W1:-:S15] /0018/ S2R R2, SR_TID.X ;
[R---:B0-----:R-:W-:-:S01] /0028/ XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ ;
[R---:B-1----:R-:W-:-:S06] /0030/ XMAD R2, R0.reuse, c[0x0] [0x8], R2 ;
[----:B------:R-:W-:-:S06] /0038/ XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2 ;
[----:B------:R-:W-:Y:S13] /0048/ ISETP.GE.AND P0, PT, R0, c[0x0][0x158], PT ;
[----:B------:R-:W-:Y:S10] /0050/ NOP ;
[----:B------:R-:W-:-:S13] /0058/ @p0 EXIT ;
[R---:B------:R-:W-:-:S01] /0068/ SHL R6, R0.reuse, 0x2 ;
[----:B------:R-:W-:-:S05] /0070/ SHR R0, R0, 0x1e ;
[R---:B------:R-:W-:-:S06] /0078/ IADD R4.CC, R6.reuse, c[0x0][0x140] ;
[R---:B------:R-:W-:-:S02] /0088/ IADD.X R5, R0.reuse, c[0x0][0x144] ;
[----:B------:R-:W-:-:S00] /0090/ { IADD R2.CC, R6, c[0x0][0x148] ;
[----:B------:R-:W-:-:S06] /0098/ LDG.E R4, [R4] }
[----:B------:R-:W-:Y:S02] /00a8/ IADD.X R3, R0, c[0x0][0x14c] ;
[----:B------:R-:W5:-:S01] /00b0/ LDG.E R2, [R2] ;
[----:B------:R-:W-:-:S06] /00b8/ IADD R6.CC, R6, c[0x0][0x150] ;
[----:B------:R-:W-:-:S07] /00c8/ IADD.X R7, R0, c[0x0][0x154] ;
[----:B-----5:R-:W-:-:S06] /00d0/ FADD R0, R2, R4 ;
[----:B------:R-:W-:-:S02] /00d8/ FADD R0, R0, 1 ;
[----:B------:R-:W-:-:S01] /00e8/ STG.E [R6], R0 ;
[----:B------:R-:W-:Y:S04] /00f0/ NOP ;
[----:B------:R-:W-:-:S15] /00f8/ EXIT ;

I simply modified [----:B------:R-:W-:-:S02] /*00d8*/ FADD R0, R0, 1 ; to [----:B------:R-:W-:-:S02] /*00d8*/ FADD R0, R0, 2 ;, but the results didn't change. Is there any possible reason?

Here is my script for compile
fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " -no-asm "--image3=kind=elf,sm=50,file=new_cudatest.sm_50.cubin" "--image3=kind=ptx,sm=50,file=cudatest.ptx" --embedded-fatbin="cudatest.fatbin.c"

gcc -E -x c++ -D__CUDACC__ -D__NVCC__ -I"/usr/local/cuda-10.2/samples/common/inc" "-I/usr/local/cuda-10.2/bin/../targets/aarch64-linux/include" -D__CUDACC_VER_MAJOR__=10 -D__CUDACC_VER_MINOR__=2 -D__CUDACC_VER_BUILD__=89 -include "cuda_runtime.h" "cudatest.cu" -o "cudatest.cpp4.ii"

cudafe++ --c++14 --gnu_version=70500 --allow_managed --unsigned_chars --m64 --parse_templates --gen_c_file_name "cudatest.cudafe1.cpp" --stub_file_name "cudatest.cudafe1.stub.c" --module_id_file_name "cudatest.module_id" "cudatest.cpp4.ii"

gcc -D__CUDA_ARCH__=500 -c -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -I"/usr/local/cuda-10.2/samples/common/inc" "-I/usr/local/cuda-10.2/bin/../targets/aarch64-linux/include" "cudatest.cudafe1.cpp" -o "cudatest.o"
nvlink --arch=sm_50 --register-link-binaries="cudatest_dlink.reg.c" -m64 "-L/usr/local/cuda-10.2/bin/../targets/aarch64-linux/lib/stubs" "-L/usr/local/cuda-10.2/bin/../targets/aarch64-linux/lib" -cpu-arch=AARCH64 "cudatest.o" -lcudadevrt -o "cudatest_dlink.sm_50.cubin"

fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " -no-asm -link "--image3=kind=elf,sm=50,file=cudatest_dlink.sm_50.cubin" --embedded-fatbin="cudatest_dlink.fatbin.c"
gcc -c -x c++ -DFATBINFILE=""cudatest_dlink.fatbin.c"" -DREGISTERLINKBINARYFILE=""cudatest_dlink.reg.c"" -I. -D__NV_EXTRA_INITIALIZATION= -D__NV_EXTRA_FINALIZATION= -D__CUDA_INCLUDE_COMPILER_INTERNAL_HEADERS__ -I"/usr/local/cuda-10.2/samples/common/inc" "-I/usr/local/cuda-10.2/bin/../targets/aarch64-linux/include" -D__CUDACC_VER_MAJOR__=10 -D__CUDACC_VER_MINOR__=2 -D__CUDACC_VER_BUILD__=89 "/usr/local/cuda-10.2/bin/crt/link.stub" -o "cudatest_dlink.o"

g++ -Wl,--start-group "cudatest_dlink.o" "cudatest.o" "-L/usr/local/cuda-10.2/bin/../targets/aarch64-linux/lib/stubs" "-L/usr/local/cuda-10.2/bin/../targets/aarch64-linux/lib" -lcudadevrt -lcudart_static -lrt -lpthread -ldl -Wl,--end-group -o "cudatest"

raise Exception('Unknown implicit section %s !'%secname)

C:\Python39>python "G:\CuAssembler\bin\cuasm.py" --bin2asm "G:\elf3.dat"
,023 - - Running CubinFile.loadCubin...
,023 - ENTRY - Loading cubin file G:\elf3.dat...
,044 - - Running CubinFile.disassembleCubin...
,044 - PROC - Disassembling G:\elf3.dat...
,779 - - Func CubinFile.disassembleCubin completed! Time= 12.7356 secs.
,044 - - Func CubinFile.loadCubin completed! Time= 13.0207 secs.
,044 - - Running CubinFile.saveAsCuAsm...
,044 - ENTRY - Saving to cuasm file G:\elf3.cuasm...
Traceback (most recent call last):
File "G:\CuAssembler\bin\cuasm.py", line 153, in
doProcess(infile, outfile, direction)
File "G:\CuAssembler\bin\cuasm.py", line 92, in doProcess
cubin2cuasm(src, dst)
File "G:\CuAssembler\bin\cuasm.py", line 68, in cubin2cuasm
cf.saveAsCuAsm(asmname)
File "G:\CuAssembler\CuAsm\CuAsmLogger.py", line 210, in wrapper
ret = func(*args, **kwargs)
File "G:\CuAssembler\CuAsm\CubinFile.py", line 526, in saveAsCuAsm
self.__writeImplicitSectionAsm(fout, secname)
File "G:\CuAssembler\CuAsm\CubinFile.py", line 445, in __writeImplicitSectionAsm
raise Exception('Unknown implicit section %s !'%secname)
Exception: Unknown implicit section .nv.info._ZNK45_GLOBAL__N__22_octane_kernels_cpp1_ii_kernel011Wavelengths33convertFilterSpectrumToLinearSrgbE6float4$1021 !

Parsing of cuasm file failed

Hi,

I ran into the following issue when trying to use CuAssembler:

Error when assembling instruction "[----:B------:R-:W-:Y:S04] SHF.L.W.U32.HI R11, R4, 0x13, R4 ;":("Assembling failed (NewModi): Unknown modifiers: ({'0_W'})",)

I am performing the following steps (without modifying the cuasm file):

disassemble my .cubin file into a .cuasm file using CuAsm.CubinFile
assemble .cuasm to .cubin using CuAsm.CuAsmParser

I checked the repo file and it should contain 0_W in the InsModiSet for SHF.
In addition, I also tried to merge the repos file generated from my cuda binary with the default repos file for the sm_75 architecture and then use the updated repository in the CuAsmParser. But that doesn't seem to help.

Could you give me a hint on how to proceed with this error? Is this a limitation of the project in its current state?

Additional information on my setup:

- arch: sm_75
- OS: Ubuntu 20.04
- nvcc version: 10.1

An error occurred when analyzing kernel function calls transitioning from cuasm to cubin using Nsight Compute

When I convert the kernel to sass assembly, I use this tool to convert cuasm to cubin, and call the cubin in the c++program. The direct execution result is correct, but an error occurs when using Nsight Compute to analyze:
==Error==LaunchFailed
==Error==LaunchFailed
==PROF==Trying to shutdown target application
==Error==The application returned an error code (9)
==Error==An error occurred while trying to profile
I encountered the same error on both the 2022 and 2024 versions of Nsight Compute, however, the cubin file generated directly using NVCC can be analyzed normally using Nsight Compute