dr-noob / peakperf Goto Github PK

Achieve peak performance on x86 CPUs and NVIDIA GPUs

License: GNU General Public License v2.0

Shell 0.75% CMake 4.44% C++ 86.30% Cuda 7.29% Roff 1.22%

microbenchmark cpu cpu-frequency cpu-microarchitecture avx performance assembly intrinsics gflop microarchitecture

peakperf's Issues

Intel 13th Gen not is unknown

Hi,
I got the following error:
Unknown microarchitecture detected: M=0x00000007 EM=0x0000000B F=0x00000006 EF=0x00000000 S=0x00000001

The cpu is a 13th gen Intel I7-13700K.

If you need anything else let me know.
Dennis

Wrong benchmark name in KNL

  Microarch: Knights Landing
  Benchmark: Zen (AVX2)

Contact

Hey Dr-noob, 2 years ago you posted something about hacking pacybits, i need help with it and i'm wondering jf you could help me. Can i contact you somewhere?

I know your post was old but it would be really appreciated if you could help me.

First, one have to change some int and long to stdint's type (int32_t and int64_t). After that, I tried running it and the performance was horrible. Then, I figured out, looking at assembly, that the loop was compiled poorly. I noticed I was using a 32 bit compiler (mingw which is 32 bits). Lastly, I tried compiling it with a 64 bit compiler (I used mingw-w64), using the Windows build (MingW-W64-builds) and the Arch Linux one (mingw-w64-gcc-bin). Both gave me the same result: segmentation fault. I found somewhere that running it with 32 bits but having a segfault with 64 bits could be caused to some issues with the stack, which should be solved using -fno-stack-protector. This does not solve the segfault. I love you, Windows ❤️

Colored -l (list benchmarks) output

Benchmarks capable of run in the current CPU may be highlighted in green, while not supported ones may be highlited in red

Wrong compute architecture is being detected during build

Hello!
I noticed the following during build:

./build.sh
...

-- The CXX compiler identification is GNU 13.2.1
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for a CUDA compiler
-- Looking for a CUDA compiler - /opt/cuda/bin/nvcc
-- The CUDA compiler identification is NVIDIA 12.2.140
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /opt/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- ----------------------
-- peakperf build report:
-- CPU mode: ON
-- GPU mode: ON
-- ----------------------
-- Configuring done (7.2s)
-- Generating done (0.0s)
-- Build files have been written to: /home/william/src/peakperf/build
[  5%] Building CXX object CMakeFiles/cpu_device.dir/src/cpu/cpufetch/cpufetch.cpp.o
[  5%] Building CXX object CMakeFiles/512_8.dir/src/cpu/arch/512_8.cpp.o
[ 11%] Building CXX object CMakeFiles/cpu_device.dir/src/cpu/cpufetch/cpuid.cpp.o
[ 11%] Building CXX object CMakeFiles/512_12.dir/src/cpu/arch/512_12.cpp.o
[ 14%] Building CUDA object CMakeFiles/gpu_device.dir/src/gpu/arch.cu.o
[ 20%] Building CXX object CMakeFiles/cpu_device.dir/src/cpu/cpufetch/uarch.cpp.o
[ 23%] Building CXX object CMakeFiles/256_5.dir/src/cpu/arch/256_5.cpp.o
[ 29%] Building CXX object CMakeFiles/cpu_device.dir/src/cpu/arch/arch.cpp.o
[ 29%] Building CXX object CMakeFiles/256_6.dir/src/cpu/arch/256_6.cpp.o
[ 35%] Building CXX object CMakeFiles/256_6_nofma.dir/src/cpu/arch/256_6_nofma.cpp.o
[ 35%] Building CUDA object CMakeFiles/gpu_device.dir/src/gpu/kernel.cu.o
[ 35%] Building CXX object CMakeFiles/cpu_device.dir/src/cpu/arch/arch_sse.cpp.o
[ 38%] Building CXX object CMakeFiles/128_6.dir/src/cpu/arch/128_6.cpp.o
[ 44%] Building CXX object CMakeFiles/cpu_device.dir/src/cpu/arch/arch_avx512.cpp.o
[ 47%] Building CXX object CMakeFiles/cpu_device.dir/src/cpu/arch/arch_avx.cpp.o
[ 47%] Building CXX object CMakeFiles/256_8.dir/src/cpu/arch/256_8.cpp.o
[ 52%] Building CXX object CMakeFiles/128_8.dir/src/cpu/arch/128_8.cpp.o
[ 52%] Building CXX object CMakeFiles/256_10.dir/src/cpu/arch/256_10.cpp.o
nvcc fatal   : Unsupported gpu architecture 'compute_35'
make[2]: *** [CMakeFiles/gpu_device.dir/build.make:76: CMakeFiles/gpu_device.dir/src/gpu/arch.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
nvcc fatal   : Unsupported gpu architecture 'compute_35'
make[2]: *** [CMakeFiles/gpu_device.dir/build.make:90: CMakeFiles/gpu_device.dir/src/gpu/kernel.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:401: CMakeFiles/gpu_device.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 55%] Linking CXX static library lib512_12.a
[ 58%] Linking CXX static library lib512_8.a
[ 58%] Built target 512_12
[ 58%] Built target 512_8
[ 61%] Linking CXX static library lib256_5.a
[ 64%] Linking CXX static library lib256_8.a
[ 67%] Linking CXX static library lib256_10.a
[ 70%] Linking CXX static library lib256_6.a
[ 73%] Linking CXX static library lib128_8.a
[ 76%] Linking CXX static library lib128_6.a
[ 76%] Built target 256_5
[ 79%] Built target 256_8
[ 79%] Built target 256_10
[ 79%] Linking CXX static library lib256_6_nofma.a
[ 79%] Built target 256_6
[ 79%] Built target 128_8
[ 82%] Linking CXX static library libcpu_device.a
[ 82%] Built target 128_6
[ 82%] Built target 256_6_nofma
[ 82%] Built target cpu_device
make: *** [Makefile:136: all] Error 2

The relevant part being nvcc fatal : Unsupported gpu architecture 'compute_35'
I looked at the code briefly, but couldn't see anything obvious that would cause it to return as compute_35.

The getGencode script is just a validation tool I cobbled together for another project https://github.com/wallentx/alpha-report/blob/90ee2e7c006dcfd75dd76fe31ffc0a866179d819/get-gencode#L1-L33

Build fails because -march is missing in old compilers

If compiler is old, it may not support new -march flags, like zenver2

FLOPS in KNL

Hybrid mode

Run peakperf in CPU and GPU at the same time:

device == DEVICE_TYPE_HYBRID

Nº  Time(s)  TFLOP/s (CPU +  GPU)
 1  2.50984   4.300  (500 + 3800)
 2  2.50898   4.310  (500 + 3810)

[GPU] Support for RT cores

Same as tensor cores, but with RT cores. Not sure if this RT cores will provide more performance than tensor cores, tough.

[GPU] cudaSetDevice not used when -g is specified

helper_cuda.h not found when compiling with gpu support

[GPU] Support for Ampere GPUs

The table in https://github.com/Dr-Noob/peakperf#62-gpu also needs to be updated with proper information.

Supported benchmarks information may be wrong

Because even tough a CPU may support a instruction set, may not support instructions generated by certain -march flags

[GPU] Support for tensor cores

Detect uarch and deduce if the GPU has tensor cores or not
Run a GeMM (how?) using tensor cores to achieve the peak performance in half precision

GFLOPS off by factor of 2 on AMD Ryzen 7 5800X 8-Core Processor

When I run the benchmark with >8 threads, my average performance is consistently lower than the expected GFLOPS. Running peakperf with no arguments yields GFLOP of 2048 but average performance under 1200 (I have decent undervolt enabled). Specifying 8 threads changes the printed GFLOP value to 1024, and I'm seeing an average perforamance > 1100 (but less than the results for t = 16). I'm not sure if this is an issue with my cpu configuration or the test expectations.

Add some feedback while benchmark is running

Something that makes the user understand that computing is taking place.

For example, a progress bar, like:

|███████████████ | 50%
|████████████████████████████████| 100%

or a blinking cursor like this function does:

spin() {
   local -a marks=( '/' '-' '\' '|' )
   while [[ 1 ]]; do
     printf '%s\r' "${marks[i++ % ${#marks[@]}]}"
     sleep 0.1
   done
 }

This would require adding an additional thread in main

[CPU] Support for non-AVX variantes

There are many uarchs (e.g., Kaby Lake) that support AVX in the majority of CPUs but not all (e.g., celeron), but peakperf currently assumes that they all support AVX.

Separate CPU backends by latencies and ALUs, not uarchs

Commit c763c17 separated files by uarch, instead of the old system of separating backends by latencies, ALU numbers and width. I think I introduced that change because I thought I had to use specific -march flags for each architecture, but it is not needed. Going back to the old system would reduce lines of code and code complexity.

Automatic benchmarking

Detect CPU microarchitecture (using cpufetch code) and select the right benchmark automatically

dr-noob / peakperf Goto Github PK

peakperf's Issues

Recommend Projects

Recommend Topics

Recommend Org