Giter VIP home page Giter VIP logo

nanobench's Introduction

nanoBench

nanoBench is a Linux-based tool for running small microbenchmarks on recent Intel and AMD x86 CPUs. The microbenchmarks are evaluated using hardware performance counters. The reading of the performance counters is implemented in a way that incurs only minimal overhead.

There are two variants of the tool: A user-space implementation and a kernel module. The kernel module makes it possible to benchmark privileged instructions, to use uncore performance counters, and it can allow for more accurate measurement results as it disables interrupts and preemptions during measurements. The disadvantage of the kernel module compared to the user-space variant is that it is quite risky to allow arbitrary code to be executed in kernel space. Therefore, the kernel module should not be used on a production system.

nanoBench is used for running the microbenchmarks for obtaining the latency, throughput, and port usage data that is available on uops.info.

More information about nanoBench can be found in the paper nanoBench: A Low-Overhead Tool for Running Microbenchmarks on x86 Systems.

Installation

User-space Version

sudo apt install msr-tools
git clone https://github.com/andreas-abel/nanoBench.git
cd nanoBench
make user

nanoBench might not work if Secure Boot is enabled. Click here for instructions on how to disable Secure Boot.

Kernel Module

Note: The following is not necessary if you would just like to use the user-space version.

sudo apt install python3 python3-pip
pip3 install plotly
git clone https://github.com/andreas-abel/nanoBench.git
cd nanoBench
make kernel

To load the kernel module, run:

sudo insmod kernel/nb.ko # this is necessary after every reboot

Usage Examples

The recommended way for using nanoBench is with the wrapper scripts nanoBench.sh (for the user-space variant) and kernel-nanoBench.sh (for the kernel module). The following examples work with both of these scripts. For the kernel module, we also provide a Python wrapper: kernelNanoBench.py.

For obtaining repeatable results, it can help to disable hyper-threading. This can be done with the disable-HT.sh script.

Example 1: The ADD Instruction

The following command will benchmark the assembler code sequence "ADD RAX, RBX; ADD RBX, RAX" on a Skylake-based system.

sudo ./nanoBench.sh -asm "ADD RAX, RBX; ADD RBX, RAX" -config configs/cfg_Skylake_common.txt

It will produce an output similar to the following.

CORE_CYCLES: 2.00
INST_RETIRED: 2.00
UOPS_ISSUED: 2.00
UOPS_EXECUTED: 2.00
UOPS_DISPATCHED_PORT.PORT_0: 0.49
UOPS_DISPATCHED_PORT.PORT_1: 0.50
UOPS_DISPATCHED_PORT.PORT_2: 0.00
UOPS_DISPATCHED_PORT.PORT_3: 0.00
UOPS_DISPATCHED_PORT.PORT_4: 0.00
UOPS_DISPATCHED_PORT.PORT_5: 0.50
UOPS_DISPATCHED_PORT.PORT_6: 0.51
UOPS_DISPATCHED_PORT.PORT_7: 0.00
...

The tool will unroll the assembler code multiple times, i.e., it will create multiple copies of it. The results are averages per copy of the assembler code for multiple runs of the entire generated code sequence.

The config file contains the required information for configuring the programmable performance counters with the desired events. We provide example configuration files for recent Intel and AMD microarchitectures in the config folder.

The assembler code sequence may use and modify any general-purpose or vector registers (unless the -loop or -no_mem options are used), including the stack pointer. There is no need to restore the registers to their original values at the end.

R14, RDI, RSI, RSP, and RBP are initialized with addresses in the middle of dedicated memory areas (of 1 MB each), that can be freely modified by the assembler code. When using the kernel module, the size of the memory area that R14 points to can be increased using the set-R14-size.sh script; more details on this can be found here.

All other registers have initially undefined values. They can, however, be initialized as shown in the following example.

Example 2: Load Latency

sudo ./nanoBench.sh -asm_init "MOV RAX, R14; SUB RAX, 8; MOV [RAX], RAX" -asm "MOV RAX, [RAX]" -config configs/cfg_Skylake_common.txt

The asm-init code is executed once in the beginning. It first sets RAX to R14-8 (thus, RAX now contains a valid memory address), and then sets the memory at address RAX to its own address. Then, the asm code is executed repeatedly. This code loads the value at the address in RAX into RAX. Thus, the execution time of this instruction corresponds to the L1 data cache latency.

We will get an output similar to the following.

CORE_CYCLES: 4.00
INST_RETIRED: 1.00
UOPS_ISSUED.ANY: 1.00
UOPS_EXECUTED.THREAD: 1.00
UOPS_DISPATCHED_PORT.PORT_0: 0.00
UOPS_DISPATCHED_PORT.PORT_1: 0.00
UOPS_DISPATCHED_PORT.PORT_2: 0.50
UOPS_DISPATCHED_PORT.PORT_3: 0.50
...
MEM_LOAD_RETIRED.L1_HIT: 1.00
MEM_LOAD_RETIRED.L1_MISS: 0.00
...

Generated Code

We will now take a look behind the scenes at the code that nanoBench generates for evaluating a microbenchmark.

int run(code, code_init, local_unroll_count):
    int measurements[n_measurements]

    for i=-warm_up_count to n_measurements
        save_regs
        code_init
        m1 = read_perf_ctrs // stores results in memory, does not modify registers
        code_late_init
        for j=0 to loop_count // this line is omitted if loop_count=0
            code // (copy #1)
            code // (copy #2)
             ⋮
            code // (copy #local_unroll_count)
        m2 = read_perf_ctrs
        restore_regs
        if i >= 0: // ignore warm-up runs
            measurements[i] = m2 - m1

    return agg(measurements) // apply selected aggregate function

run(...) is executed twice: The first time with local_unroll_count = unroll_count, and the second time with local_unroll_count = 2 * unroll_count. If the -basic_mode options is used, the first execution is with no instructions between m1 = read_perf_ctrs and m2 = read_perf_ctrs, and the second with local_unroll_count = unroll_count.

The result that is finally reported by nanoBench is the difference between these two executions divided by max(loop_count * unroll_count, unroll_count).

Before the first execution of run(...), the performance counters are configured according to the event specifications in the -config file. If this file contains more events than there are programmable performance counters available, run(...) is executed multiple times with different performance counter configurations.

Command-line Options

Both nanoBench.sh and kernel-nanoBench.sh support the following command-line parameters. All parameters are optional. Parameter names may be abbreviated if the abbreviation is unique (e.g., -l may be used instead of -loop_count).

Option Description
-asm <code> Assembler code sequence (in Intel syntax1) containing the code to be benchmarked.
-asm_init <code> Assembler code sequence (in Intel syntax1) that is executed once in the beginning of every benchmark run.
-asm_late_init <code> Assembler code sequence (in Intel syntax1) that is executed once immediately before the code to be benchmarked.
-asm_one_time_init <code> Assembler code sequence (in Intel syntax1) that is executed once before the first benchmark run.
-code <filename> A binary file containing the code to be benchmarked as raw x86 machine code. This option cannot be used together with -asm.
-code_init <filename> A binary file containing code to be executed once in the beginning of every benchmark run. This option cannot be used together with -asm_init.
-code_late_init <filename> A binary file containing code to be executed once immediately before the code to be benchmarked. This option cannot be used together with -asm_late_init.
-code_one_time_init <code> A binary file containing code to be executed once before the first benchmark run. This option cannot be used together with -asm_one_time_init.
-config <file> File with performance counter event specifications. Details are described below.
-fixed_counters Reads the fixed-function performance counters.
-n_measurements <n> Number of times the measurements are repeated. [Default: n=10]
-unroll_count <n> Number of copies of the benchmark code inside the inner loop. [Default: n=1000]
-loop_count <n> Number of iterations of the inner loop. If n>0, the code to be benchmarked must not modify R15, as this register contains the loop counter. If n=0, the instructions for the loop are omitted; the loop body is then executed once. [Default: n=0]
-warm_up_count <n> Number of runs of the generated benchmark code sequence (in each invocation of run(...)) before the first measurement result gets recorded . This can, for example, be useful for excluding outliers due to cold caches. [Default: n=5]
-initial_warm_up_count <n> Number of runs of the benchmark code sequence before the first invocation of run(...). This can be useful for benchmarking instructions that require a warm-up period before they can execute at full speed, like AVX2 instructions on some microarchitectures. [Default: n=0]
-alignment_offset <n> By default, the code to be benchmarked is aligned to 64 bytes. This parameter allows to specify an additional offset. [Default: n=0]
-avg Selects the arithmetic mean (excluding the top and bottom 20% of the values) as the aggregate function. [This is the default]
-median Selects the median as the aggregate function.
-min Selects the minimum as the aggregate function.
-max Selects the maximum as the aggregate function.
-range Outputs the range of the measured values (i.e., the minimum and the maximum).
-basic_mode The effect of this option is described in the Generated Code section.
-no_mem If this option is enabled, the code for read_perf_ctrs does not make any memory accesses and stores all performance counter values in registers. This can, for example, be useful for benchmarks that require that the state of the data caches does not change after the execution of code_init. If this option is used, the code to be benchmarked must not modify registers R8-R11 (Intel) and R8-R13 (AMD). Furthermore, read_perf_ctrs will modify RAX, RCX, and RDX.
-no_normalization If this option is enabled, the measurement results are not divided by the number of repetitions.
-remove_empty_events If this option is enabled, the output does not contain events that did not occur.
-df If this option is enabled, the front-end buffers are drained after code_init, after code_late_init, and after the last instance of code by executing an lfence, followed by a long sequence of 1-Byte NOP instructions, followed by a long sequence of 15-Byte NOP instructions.
-cpu <n> Pins the measurement thread to CPU n. [Default: Pin the thread to the CPU it is currently running on.]
-verbose Outputs the results of all performance counter readings. In the user-space version, the results are printed to stdout. The output of the kernel module can be accessed using dmesg.

1 As an extension, the tool also supports statements of the form |n (with 1≤n≤15) that are translated to n-byte NOPs, and statements of the form n*|x| that unroll x n times (nesting is not supported).

The following parameters are only supported by nanoBench.sh.

Option Description
-usr <n> If n=1, performance events are counted when the processor is operating at a privilege level greater than 0. [Default: n=1]
-os <n> If n=1, performance events are counted when the processor is operating at privilege level 0. [Default: n=0]
-debug Enables the debug mode (see below).

The following parameter is only supported by kernel-nanoBench.sh.

Option Description
-msr_config <file> File with performance counter event specifications for counters that can only be read with the RDMSR instruction, such as uncore counters. Details are described below.

Cycle-by-Cycle Measurements

The cycleByCycle.py script provides the option to perform cycle-by-cycle measurements on recent Intel CPUs. This is achieved by enabling the Freeze_Perfmon_On_PMI feature, by setting the value of the core cycles counter to N cycles below overflow, and by repeating the measurements multiple times with different values for N. This approach is based on Brandon Falk's Sushi Roll technique.

As an example, the script can be used as follows.

sudo ./cycleByCycle.py -asm "MOVQ XMM0, RAX; MOVQ RAX, XMM0" -config configs/cfg_Skylake_common.txt -unroll 10

cycleByCycle.py supports mostly the same options as kernel-nanoBench.sh, with the following exceptions. The -fixed_counters and -msr_config options are not available. The -basic_mode, -df, and -no_normalization options are used by default. The default for the -unroll_count parameter is 1, and the default aggregate function is the median.

cycleByCycle.py supports the following additional parameters.

Option Description
-html <filename> Generates an HTML file with a graphical representation of the measurement data. The filename is optional. [Default: graph.html]
-csv <filename> Generates a CSV file that contains the measurement data. The filename is optional. [Default: stdout]
-end_to_end By default, cycleByCycle.py tries to remove the overhead that comes from the instructions that enable/disable the performance counters, and from the instructions that drain the front end before/after the code of the benchmark is executed. However, this does not always work properly. In such cases, the -end_to_end option can be used; with this option, the output includes all of the overhead.

Performance Counter Config Files

We provide provide performance counter configuration files (for counters that can be read with the RDPMC instruction) for most recent Intel and AMD CPUs in the configs folder. These files can be adapted/reduced to the events you are interested in.

The format of the entries in the configuration files is

EvtSel.UMASK(.CMSK=...)(.AnyT)(.EDG)(.INV)(.TakenAlone)(.CTR=...)(.MSR_3F6H=...)(.MSR_PF=...)(.MSR_RSP0=...)(.MSR_RSP1=...) Name

You can find details on the meanings of the different parts of the entries in chapter 18 of Intel's System Programming Guide and at https://download.01.org/perfmon/readme.txt.

MSR Performance Counter Config Files

Some performance counters, such as the uncore counters or the RAPL counters on Intel CPUs, cannot be read with the RDPMC instruction, but only with the RDMSR instruction. The entries in the corresponding configuration files have the following format:

msr_...=...(.msr_...=...)* msr_... Name

For example, the line

msr_0xE01=0x20000000.msr_700=0x408F34 msr_706 LLC_LOOKUP_CBO_0

can be used to count the number of last-level cache lookups in C-Box 0 on a Skylake system. Details on this can be found in Intel's uncore performance monitoring reference manuals, e.g., here.

Pausing Performance Counting

If the -no_mem option is used, nanoBench provides a feature to temporarily pause performance counting (however, this feature is not available for cycle-by-cycle measurements). This is enabled by including the magic byte sequences 0xF0B513B1C2813F04 (for stopping the counters), and 0xE0B513B1C2813F04 (for restarting them) in the code of the microbenchmark.

Using this feature incurs a certain timing overhead that will be included in the measurement results. It is therefore, in particular, useful for microbenchmarks that do not measure the time, but e.g., cache hits or misses, such as the microbenchmarks generated by the tools in tools/CacheAnalyzer.

Debug Mode

If the debug mode is enabled, the generated code contains a breakpoint right before the line m2 = read_perf_ctrs, and nanoBench is run using gdb. This makes it possible to analyze the effect of the code to be benchmarked on registers and on the memory. The command info all-registers can, for example, be used to display the current values of all registers.

Supported Platforms

nanoBench should work with all Intel processors supporting architectural performance monitoring version ≥ 2, as well as with AMD Family 17h processors. Cycle-by-cycle measurements are only available on Intel CPUs with at least four programmable performance counters.

The code was developed and tested using Ubuntu 18.04 and 20.04.

nanobench's People

Contributors

0xhilbert avatar andreas-abel avatar anthonywharton avatar bjzhjing avatar caizixian avatar d-we avatar eigenform avatar oleksiioleksenko avatar tfc avatar wolfpld avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nanobench's Issues

pinsrw latency overestimated(?) because dep chain competes for the same port

https://uops.info/html-lat/SKX/PINSRW_XMM_R32_I8-Measurements.html#lat1-%3E1 experiments only use pinsrw xmm, r32, imm alone, or pinsrw with an XMM->XMM dep chain created by shufpd or pshufd.

But pinsrw itself is 2 uops for port 5 on Intel. Presumably a movd-equivalent uop to feed a 2-input shuffle. One would expect that the GP->XMM (movd) uop could run early if there was a free port, leaving the critical path latency from 1->1 being only 1 cycle.

But resource conflicts with the dep chain prevent this from being demonstrated. Perhaps pand xmm0,xmm0 would be a better choice for at least one of the experiments, or orps xmm0, xmm0. (I guess shufpd and pshufd are looking for bypass latency between integer and FP shuffles?)

Complex instruction throughput often understimated in Haswell

If you look at 3 fused domain uop instructions (no memory operands) in Haswell, many have 1.0/2.0 for expected/measured throughput. Most of these are 1.0/1.0 on Skylake, as below:

image

Did the the instruction throughput actually improve so much in Skylake for these instructions? I don't think so!

The effect comes from a combination of factors. One is that nearly all of these tests run out of the MITE (legacy) decoder (as reported by the perf counters). This is mostly because the uops are "dense" enough that they exceed the 18 uops in 32 byte rule.

Then, decoding limitations on Haswell kick in. Haswell can't decode in a 3-1 pattern (but Skylake can), so the tests that interleave dependency breaking instructions with the payload instruction, like:

  0:	48 31 c0             	xor    rax,rax
  3:	41 f7 e0             	mul    r8d

end up taking 2 cycles to decode the two instructions. That's why itput of 2.0 appear all of over the place in the Haswell results. Most of the other test variants can't crack 2.0 cycles because of dependency chains.

One approach get getting close to the true throughput would be to avoid breaking out of the uop cache: e.g., by using an occasional large nop to space out the instructions. For cases where you have unrolled 4 uops with 4 dependency breaking instructions, you could group the dependency breaking (1 uop) and payload (complex) for better decoding. E.g.,:

xor eax, eax
xor ebx, ebx
xor ecx, ecx
xor edx, edx
xadd eax, eax
xadd ebx, ebx
xadd ecx, ecx
xadd edx, edx

will decode more efficiently (5 cycles) than with full interleaving (8 cycles).

Definition of latency

What is the definition of latency that you want to use exactly?

In particular, consider a hypothetical operation foo arg1, arg2, arg3 which is 3p0. This uop will have a throughput of 3 due to p0 pressure. Can this op have any latency less than 3? I think yes.

For example, the op might only have a 1 cycle delay from arg2->arg1, because the two uops only uses arg3, and then the second uop uses arg2 and arg3.

However testing back-to-back foo ops will never show it because of the throughput limit. I think you are probably well aware of this since I notice lots of filler uops in tests, like:

   0:	c4 42 38 f2 ca       	andn   r9d,r8d,r10d
   5:	4d 63 c1             	movsxd r8,r9d
   8:	4d 63 c8             	movsxd r9,r8d
   b:	4d 63 c1             	movsxd r8,r9d

All the movsxd given enough breathing room to avoid lots of problems of this type.

However, consider gathers. For 1->1 latency testing this is used:

vpgatherdd ymm0,DWORD PTR [r14+ymm14*1],ymm1

No breathing room, so all these results just end up reporting the throughput number (5 in this case).

The following test:

vpgatherdd ymm0,DWORD PTR [r14+ymm14*1],ymm1
vpor ymm0,ymm0,ymm0
vpor ymm0,ymm0,ymm0
vpor ymm0,ymm0,ymm0
vpor ymm0,ymm0,ymm0

also runs in 5 cycles, so we see the true 1->1 latency is 1 cycle.

vpternlogd latencies on Zen4

On Zen 4, summary of vpternlogd latency experiments is given as

Latency operand 1 → 1: 1
Latency operand 2 → 1: 2
Latency operand 3 → 1: 1

https://uops.info/html-lat/ZEN4/VPTERNLOGD_ZMM_ZMM_ZMM_I8-Measurements.html

but I don't see a substantial difference in 3 → 1 vs. 2 → 1 experiments, or a difference w.r.t its vpternlogq sibling, where all latencies are listed as 1. Shouldn't both dword and qword variants be listed with latency 2 for operands 2 and 3? What am I missing?

If I'm reading Agner's testing harness right, his latency experiment times

vpternlogd zmm0, zmm1, zmm2
vpternlogd zmm2, zmm1, zmm0

repeated 50 times. He lists latency of ternlog on Zen 4 as 1 cycle in all cases (but if latency from second operand is indeed 2, his experiment wouldn't uncover that).

(unfortunately I do not have access to a Zen 4 machine to run more experiments)

Read MSR_PKG_ENERGY_STATUS fails

Hi, I am trying to read the MSR_PKG_ENERGY_STATUS using the user-mode nanobenchmark (as declared in configs/msr_RAPL.txt).

However, it fails with error:
'invalid configuration: msr_611'.

I inspected the code and I suspect that it comes from the following line:

char* evt_num = strsep(&tok, ".");

where it doesn't find any '.' in the configuration line.

Has been made any architectural decision to only support the MSR_3F6H, MSR_PF, MSR_RSP0, MSR_RSP1 in the userspace? Is MSR_PKG_ENERGY_STATUS supported in the kernel space?

Thanks a lot!

CacheAnalyzer process killed, kernel module issues

Hi, I'm trying to use the cache analyzer tool. However, the process is getting killed due to errors in the kernel module, and the PC usually slowly dies and needs a restart. Here is a segment of the dmesg after running 'sudo ./cacheSeq.py -level 2 -sets 10-14,20,35 -seq "A B C D A? C! B?"'

[  122.924677] nb: module verification failed: signature and/or required key missing - tainting kernel
[  122.925359] Initializing nanoBench kernel module...
[  123.037080] Vendor ID: GenuineIntel
[  123.037089] Brand: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
[  123.037092] DisplayFamily_DisplayModel: 06_9EH
[  123.037095] Stepping ID: 9
[  123.037097] Performance monitoring version: 4
[  123.037099] Number of fixed-function performance counters: 3
[  123.037101] Number of general-purpose performance counters: 4
[  123.037102] Bit widths of fixed-function performance counters: 48
[  123.037104] Bit widths of general-purpose performance counters: 48
[  133.965640] No physically contiguous memory area of the requested size found.
[  133.965644] Try rebooting your computer.
[  246.783643] msr_str: 0xe01
[  246.783646] msr_str: 0x700
[  246.783648] msr_str: 0xe01
[  246.783649] msr_str: 0x710
[  246.783650] msr_str: 0xe01
[  246.783651] msr_str: 0x720
[  246.783652] msr_str: 0xe01
[  246.783653] msr_str: 0x730
[  246.941670] BUG: unable to handle page fault for address: ffffafa028927e71
[  246.941674] #PF: supervisor instruction fetch in kernel mode
[  246.941676] #PF: error_code(0x0010) - not-present page
[  246.941677] PGD 100000067 P4D 100000067 PUD 0 
[  246.941680] Oops: 0010 [#1] SMP PTI
[  246.941682] CPU: 4 PID: 2321 Comm: python3 Tainted: G           OE     5.15.0-53-generic #59~20.04.1-Ubuntu
[  246.941685] Hardware name: Dell Inc. OptiPlex 7050/0NW6H5, BIOS 1.8.3 03/23/2018
[  246.941686] RIP: 0010:0xffffafa028927e71
[  246.941689] Code: Unable to access opcode bytes at RIP 0xffffafa028927e47.

I've attempted this with an Intel i7-9750H, i9-12900k, and now an i7-7700. Using the i7-7700, I'm testing on a fresh install of Ubuntu 20, Kernel version 5.15. For the set-R14-size.sh script, it almost always fails (even after reboot) when using 'sudo ./set-R14-size.sh 1G'. However, if I do more memory, it seems that the allocation sometimes succeeds. Before the dmesg above, I tried 1G, then around 1200M. This seems a bit strange, could it be the issue? Or is there anything else that I'm obviously missing? Here is an example comamnd sequence that I'm using after boot:

cd nanoBench
make kernel
sudo insmod kernel/nb.ko
sudo ./set-R14-size.sh 1200M
cd tools/CacheAnalyzer
sudo ./cacheSeq.py -level 2 -sets 10-14,20,35 -seq "A B C D A? C! B?"

CPU 0 cannot read MSR 0x00000396

When I run nanoBench/tools/CacheAnalyzer/cacheInfp.py, an error of rdmsr: CPU 0 cannot read MSR 0x00000396
Error:
appears, how can I solve it? Thanks

Consider submitting to the Linux kernel

Please consider submitting the module to the Linux kernel.

It can be a lot of work to clean up initially and can make iterating on it harder later on, but on the other hand it won't be broken due to changes in the internal API, it will be seen/used by more people and other people will likely maintain it if you decide not to at some point.

Thank you!

Missing latency entry for gathers

You measure many latency stats for gathers which is awesome (and a very important formalization of the way we think about latency), but I think you are missing the most important one.

That is is the 2 -> 1 (address) latency but through the vector index register, not the base register. That's probably the most common latency chain you'll have in practice because it generalizes the notion of pointer chasing. That is, a loop like:

vpgatherdd ymm0,DWORD PTR [r14+ymm14*1],ymm1
vpor ymm14,ymm0,ymm0

On my SKL machine I measure the same latency (22) for this: same as for the 3->1 latency.

Can't build on kernel 4.4

The kernel module fails to build on Ubuntu 16.04 with kernel 4.4:

cd kernel; make
make[1]: Entering directory '/home/travis/dev/nanoBench/kernel'
make -C /lib/modules/4.4.0-170-generic/build M=/home/travis/dev/nanoBench/kernel modules
make[2]: Entering directory '/usr/src/linux-headers-4.4.0-170-generic'
  CC [M]  /home/travis/dev/nanoBench/kernel/nb_km.o
/home/travis/dev/nanoBench/kernel/nb_km.c:18:10: fatal error: linux/set_memory.h: No such file or directory
 #include <linux/set_memory.h>
          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.

I think the set_memory functions were instead in asm/cacheflush.h on earlier kernels (I don't know exactly when the cutover happend).

Maybe there is a way to use the correct header, e.g., depending which header is available?

Getting Started - process getting killed

I am trying to setup nanoBench and use the kernel space implementation on a Haswell machine.

I built the code for the driver and installed it. However, when I run the example the process gets killed.

sudo ./kernel-nanoBench.sh -asm "ADD RAX, RBX; add RBX, RAX" -config configs/cfg_Haswell_common.txt ./kernel-nanoBench.sh: line 122: 21457 Killed $taskset cat /proc/nanoBench

Integration with any source code

Hi

I was wondering if there is any pipeline support to handle high-level source code (preferably, hooks like IACA that can be inserted in the source) to get the stats of a region and or functions.

If yes any help with the same will be really helpful!

Thanks!

Performance counters are not correctly measured in AMD ZEN series

Hi.

I have tried to measure the performance counters related to decoder parts (i.e., uops dispatched from legacy x86 decoder <DeDisUopsFromDecoder.DecoderDispatched> or micro-op cache <DeDisUopsFromDecoder.OpCacheDispatched>).
I have tested with a simple code snippet consisting of 8 multi-byte nops (each multi-byte nop is 4 bytes) without unrolling. I thought this code snippet results in a series of micro-op cache hits; however, the results show all uops are dispatched from the legacy x86 decoder, not micro-op cache.

command

sudo ./kernel-nanoBench.sh -basic_mode -unroll_count 1 -loop_count 100000 -cpu 1 -asm "nop ax; nop ax; nop ax; nop ax; nop ax; nop ax; nop ax; nop ax" -config configs/cfg_Zen_all.txt | grep -i "dedisuops"

results (I slightly modified the source code to dump absolute measured counters)

DeDisUopsFromDecoder.DecoderDispatched: 10.00 (1000019)
DeDisUopsFromDecoder.OpCacheDispatched: 0.00 (0)

I cannot understand why every instruction is decoded by the legacy x86 decoder.

I also checked with a simple test program consisting of the same code pattern (see below).
test.s build command: <nasm -f elf64 test.s -o test.o; ld test.o -o test>

global _start

_start:
        mov rdi, 100000
        call test_uop_cache_hit
    mov rax, 60
    mov rdi, 0
    syscall

test_uop_cache_hit:
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax
    nop ax

    dec rdi
    jnz test_uop_cache_hit
    ret

Then, I checked the performance counters with the perf tool.

$perf stat -e cycles,instructions,r01AA,r02AA,r03AA ./test

 Performance counter stats for './test':

            298349      cycles                                                      
           1037949      instructions              #    3.48  insn per cycle                                            
             86233      r01AA                                                       
            999280      r02AA                                                       
           1085721      r03AA                                                       

       0.000433346 seconds time elapsed

The results show major uops are decoded by micro-op cache (r01AA => decoded by the legacy x86 decoder // r02AA => decoded by micro-op cache // r03AA => all uops).

Why nanoBench and perf show different results?

Sincerely.
Joonsung Kim.

mov r32,same on Alder Lake, Zen

This is a report regarding the uops.info table, specifically latency figures for in-place zero extension.

There are separate experiments for mov r32, <other> (latency 0) and mov r32, <same> (latency 1) on Intel CPUs starting from Ivy Bridge but excluding Alder Lake. It appears on Alder Lake the behavior is unchanged, in-place zero extension is not move-eliminated.

https://uops.info/table.html?search=mov_%20(r32%2C%20r32)&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_HSW=on&cb_ADLP=on&cb_ZEN2=on&cb_measurements=on&cb_doc=on&cb_base=on

https://uops.info/html-instr/MOV_8B_R32_R32.html
https://uops.info/html-instr/MOV_89_R32_R32.html

My experiments indicate that AMD Zen 2 successfully eliminates in-place zero-extension, for example, the following runs at one cycle per iteration:

.loop:
        mov     eax, eax
        inc     rax
        dec     ecx
        jnz     .loop

Many thanks for making and maintaining this compendium.

Gather and uops/port stats

I'm not sure if this the place to file this issue: I didn't find a github page for uops.info specifically, but if there's a better place let me know.

There is something weird with port reporting for gather ops.

Consider VPGATHERDD for example. It is reported as 1*p0+3*p23+1*p5 but this page and other pages clearly show it sends 8 uops total to p23.

Also, this:

With blocking instructions for ports {2, 3}:

    Code:

       0:	c4 c1 7a 6f 56 40    	vmovdqu xmm2,XMMWORD PTR [r14+0x40]
       6:	c4 c1 7a 6f 5e 40    	vmovdqu xmm3,XMMWORD PTR [r14+0x40]
       c:	c4 c1 7a 6f 66 40    	vmovdqu xmm4,XMMWORD PTR [r14+0x40]
      12:	c4 c1 7a 6f 6e 40    	vmovdqu xmm5,XMMWORD PTR [r14+0x40]
      18:	c4 c1 7a 6f 76 40    	vmovdqu xmm6,XMMWORD PTR [r14+0x40]
      1e:	c4 c1 7a 6f 7e 40    	vmovdqu xmm7,XMMWORD PTR [r14+0x40]
      24:	c4 41 7a 6f 46 40    	vmovdqu xmm8,XMMWORD PTR [r14+0x40]
      2a:	c4 41 7a 6f 4e 40    	vmovdqu xmm9,XMMWORD PTR [r14+0x40]
      30:	c4 41 7a 6f 56 40    	vmovdqu xmm10,XMMWORD PTR [r14+0x40]
      36:	c4 41 7a 6f 5e 40    	vmovdqu xmm11,XMMWORD PTR [r14+0x40]
      3c:	c4 c1 7a 6f 56 40    	vmovdqu xmm2,XMMWORD PTR [r14+0x40]
      42:	c4 c1 7a 6f 5e 40    	vmovdqu xmm3,XMMWORD PTR [r14+0x40]
      48:	c4 c1 7a 6f 66 40    	vmovdqu xmm4,XMMWORD PTR [r14+0x40]
      4e:	c4 c1 7a 6f 6e 40    	vmovdqu xmm5,XMMWORD PTR [r14+0x40]
      54:	c4 c1 7a 6f 76 40    	vmovdqu xmm6,XMMWORD PTR [r14+0x40]
      5a:	c4 c1 7a 6f 7e 40    	vmovdqu xmm7,XMMWORD PTR [r14+0x40]
      60:	c4 41 7a 6f 46 40    	vmovdqu xmm8,XMMWORD PTR [r14+0x40]
      66:	c4 41 7a 6f 4e 40    	vmovdqu xmm9,XMMWORD PTR [r14+0x40]
      6c:	c4 41 7a 6f 56 40    	vmovdqu xmm10,XMMWORD PTR [r14+0x40]
      72:	c4 41 7a 6f 5e 40    	vmovdqu xmm11,XMMWORD PTR [r14+0x40]
      78:	c4 82 75 90 04 36    	vpgatherdd ymm0,DWORD PTR [r14+ymm14*1],ymm1

    Init:

    VZEROALL;
    VPGATHERDD YMM0, [R14+YMM14], YMM1;
    VXORPS YMM14, YMM14, YMM14;
    VPGATHERDD YMM1, [R14+YMM14], YMM0

    warm_up_count: 100
    Show nanoBench command
    Results:
        Instructions retired: 21.00
        Core cycles: 15.00
        Reference cycles: 13.68
        UOPS_PORT2: 14.00
        UOPS_PORT3: 14.00

⇨ 3 μops that can only use ports {2, 3}

I don't understand the conclusion. I think the idea is that you have 20 instructions which send 1 uop each to p23, which would nominally execute in 10 cycles (20/2), and then you see much adding the instruction under test increases the runtime, assuming the bottleneck is port pressure. Here, you get to 15 cycles, a difference of 5 cycles. How does that equal 3 uops?

Thanks again for uops.info, it is great :).

{rd,wr}{fs,gs}base

Could you add timings for rdfsbase, rdgsbase, wrfsbase, and wrgsbase to the uops.info tables?

I've seen someone use these to use FS and GS as additional address registers in complicated numerical code, so they might be relevant for performance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.