intel / x86-simd-sort Goto Github PK

View Code? Open in Web Editor NEW

799.0 21.0 47.0 971 KB

C++ template library for high performance SIMD based sorting algorithms

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.09% C++ 52.37% Meson 0.77% Shell 0.29% Python 0.37% Jupyter Notebook 46.11%

argsort avx2 avx512 quickselect quicksort sort x86 partialsort

x86-simd-sort's Issues

Benchmarks can't be compiled on older platforms

When I try to compile the benchmark executable I get the following error from running make meson:

...
FAILED: benchexe
c++  -o benchexe 'benchexe@exe/benchmarks_main.cpp.o' -Wl,--as-needed -Wl,--no-undefined -Wl,--whole-archive -Wl,--start-group benchmarks/libbench_qsort.a utils/libcpuinfo.a -Wl,--no-whole-archive /usr/local/lib/libbenchmark.a -Wl,--end-group
/usr/bin/ld: /usr/local/lib/libbenchmark.a(benchmark_runner.cc.o): in function `benchmark::internal::BenchmarkRunner::DoNIterations()':
benchmark_runner.cc:(.text+0x1418): undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
...

This is with v1.8.0 of Google Benchmarks, v1.13.0 of Google's Test Framework. My system is running Ubuntu 20.04 and I have tried both g++-10 and g++-9. Further, the Makefile spits out many errors because my version of GCC can't compile the fp16 files.

I eventually pinned the meson issue down to the fact that my system is running glibc 2.31, where the linker still needs to be explicitly given -pthread or -lpthread to use threading. This flag is present by default for gtest andgtest_main, but Google Benchmark's pkg-config will only expose it when you indicate that it's being linked statically.

The default installation method for both Google's benchmark and testing suite will give a static library, so we should probably be explicitly configuring it as such in the build scripts. As for the Makefile, we could adapt this solution from Stack Overflow to detect when GCC is not new enough for the -march=sapphirerapids flag and exclude the fp16 files.

Test failure in rare cases with partial_qsort on smallrange arrays.

It seems the partial_qsort test rarely fails in the smallrange test case. It never seems to occur outside of the smallrange test case. Attached is a patch file that replicates the failure reliably

0001-Replicate-partial-sort-test-failure.patch

Skip avx512_argsort when the array has NAN's

We are not allowed to modify the array, its hard to handle NAN's correctly in avx512_argsort routines for float and double.

Multiple definition issue due to explicit function template specialization

On AVX512 builds there are multiple definitions of some functions, e.g. avx512_qsort, see numpy/numpy#25274

Reason is that the explicit specialization of the function template is not declared inline. See cppreference:

Whether an explicit specialization of a function [...] template is inline/constexpr(since C++11) [...] is determined by the explicit specialization itself, regardless of whether the primary template is declared with that specifier

And e.g. template <> void avx512_qsort(_Float16 *arr, arrsize_t arrsize, bool hasnan) has no inline and hence could be define by multiple translation units, see the source:

x86-simd-sort/src/avx512fp16-16bit-qsort.hpp

Line 180 in d9c9737

void avx512_qsort(_Float16 *arr, arrsize_t arrsize, bool hasnan)

The issue seems to also apply to pretty much every explicit function specialization, so it looks like a major fixup is required searching for "template <>" and marking all those functions defined in headers inline.

CI: Add basic CI

Add CI tests to run for every PR.

Build testexe and run to check pass/fail
Build x86-simd-sort with main branch of NumPy and ensure it builds correctly.

Add AVX-512 quicksort for _Float16 dtype

Use AVX512FP16 ISA.

About partial sort/topk

Hi
Would there be any blocker in order to implement an option to do partial sorting (sometimes called 'topk') ?
https://en.cppreference.com/w/cpp/algorithm/partial_sort
I know it s usually done with heap sort and this project is more into quicksort but still.
Kind
WT
ref:
https://arrayfire.org/docs/group__stat__func__topk.htm#gsc.tab=0
https://pytorch.org/docs/stable/generated/torch.topk.html

Lower-version gcc can`t support Optimization Pragmas.

We attempt to merge avx512-sort to oceanbase, compared with std::sort, the avx512-sort has an obvious perf. improvement. But there is a problem when merging new x86_simd_sort into oceanbase. We found that, you added some GCC Optimization Pragmas in the latest version of x86-simd-sort. Such as “#pragma GCC unroll 8” . The default GCC version of oceanbase is gcc5.2.0, it seems that this gcc doesn`t support these pragmas. The error is like this:

Please help to fix.

A question about hardware implementation of quick-sort in future Intel CPUs

Hi,

If I understand well, this is a software C++ implementation of a quick sort using hardware AVX vector instructions.

A lot of CPUs now have instructions for cryptography, hashing etc. And Intel does.

Is there a roadmap at intel , even for the next 10 years, to have a basic quick sort performed in hardware the same way a hash is computed ? I know the hardware implementation is complex.

We need the hardware to move closer to what standard software libraries provide as a service level and a hardware sort is big game changer (I think of Data processing, Machine learning, binning, quantization, etc).

Any plans to implement on AVX/AVX2?

Any plans to implement on AVX/AVX2 as well, in order to improve availability?

Can I use this lib to output the original index of these sorted data

This lib looks awesome!
But I have a requirement to output the original index of these sorted data, for example the input array is {300 200 400 99 150 50 230 250}, then the sorted result should be {50 99 150 200 230 250 300 400}, and I want to get the original index as output {5, 3, 4, 1, 6, 7, 0, 2}.
Currently, I use the std::sort with a lambda to implement this require, but the performance is not good.

Merge sort and AVX512

Hi!

Many thanks for your contribution to speeding up sorting in numpy. Wanted to ask if there are any plans to speed up merge/tim sort with AVX2/AVX512?

Add vqsort to our benchmarks

Add it here: https://github.com/intel/x86-simd-sort/blob/main/benchmarks/bench-qsort.hpp

Use google benchmarks

Consider and perhaps move to google benchmarks instead of reinventing the wheel.

How does x86-simd-sort compare to vxsort-cpp?

Have you tried comparing to:
https://github.com/damageboy/vxsort-cpp

Improve perf of qsort

two ideas we could try:

Use larger bitonic sorting networks (256 and 512 elements)
Improve pivot selection by pich a set of random indices

Question regarding renaming `core` to `_core`

Hi!
In the numpy/numpy#24634 PR, numpy.core is being renamed to numpy._core which affects .github/workflows/build-numpy.yml file here.

Would renaming core path to _core work here? Or do you need to run the CI against other numpy versions?

Improve argsort for 32-bit

32-bit argsort uses ymm registers: we can switch to zmm registers (use 2x i64gather instructions) and add new bitonic networks.

Recent changes don't compile with MSVC, cause warnings in GCC

The additions in #61 fail to compile with MSVC. Is MSVC supported?

# using VS2022 (17.7.34031.279) ...
deps\simdsort\src\xss-network-qsort.hpp(80,11): error C2466: cannot allocate an array of constant size 0
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,0,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,1,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,2,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,4,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,8,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(134,5): message : see reference to function template instantiation 'void sort_n_vec<vtype,16,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' 
 being compiled
          with
          [
              vtype=vtype
          ]
xxx.h(199,5): message : see reference to function template instantiation 'void sort_n<vtype,256>(zmm_vector<float>::type_t *,int)' being compiled
simdsort\src\xss-network-qsort.hpp(80,11): error C2133: 'vecs': unknown size
simdsort\src\xss-network-qsort.hpp(83,30): error C2466: cannot allocate an array of constant size 0
simdsort\src\xss-network-qsort.hpp(83,30): error C2133: 'ioMasks': unknown size

The changes also cause a few GCC -Wunused-but-set-parameter warnings for the regs parameter in the same file.

Need Argsort For 32Bit Index For 32Bit Types

We only have argsort whose indexes are (unsigned) int64_t. For 32bit types like int/float, I think argsort with int32_t as index type can also be provided, which may performs better with avx512's i32gather_ps/epi32 instructions. Also, this function can be easily implemented in AVX2 plantforms.

Build instructions with meson fail

Instructions in README for meson fail because there is no meson.build in the utils directory

performance on amd 7950x ...

Hello,
I tried benchmark on 7950x cpu and performance is in some tests up to 2.3x faster but in other tests much slower (like 0.3x) compared to classical sorting. Is amd implementation of avx512 not so powerful (and your code is not suitable for zen4) or is it something else ?

Thanks,
Jan

Benchmarks improvments

Dont rebuild benchexe when running branchcompare

Argsort on int32_t severely underperforms

Hello, I just want to make sure I'm not doing something wrong. I stumbled upon this project when I found that IPP's radix sort had buffer size calculations that were overflowing well before the limits of an INT32_MAX and all of the "Index" variants also only used 32 bit types. To be clear, a 32 bit index variant still would be beneficial and I see there are plans to eventually implement that. It's just I need a conditional code path for when there are too many values (I have a dataset generated that has length in the billions). I noticed that when the code makes use of the AVX512 instructions for argsort, it severely under performs against the radix sort (which as far as I can tell from the assembly, is completely scalar except for an iota like scan to generate the sequential index).

The benchmarks distributed with this project do show it winning by some margins against the scalar variants in here but not by the typical order of magnitudes that were touted when this project was first introduced. Is this maybe a result of the gather data sampling mitigations in microcode? I know this heavily uses compressed store and gather, and those functions were severely limited by this.

Here's my benchmark on my hardware:

[astylinski@fedoravm builddir]$ ./benchexe --benchmark_filter=/int32_t
2023-12-03T11:30:05-05:00
Running ./benchexe
Run on (6 X 3312 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 4096 KiB (x6)
  L3 Unified 16384 KiB (x1)
Load Average: 1.02, 0.68, 0.32
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
simdargsort/random_128/int32_t               959 ns          956 ns       731536
simdargsort/random_256/int32_t              2045 ns         2040 ns       343034
simdargsort/random_512/int32_t              3866 ns         3857 ns       181511
simdargsort/random_1k/int32_t               9289 ns         9267 ns        75493
simdargsort/random_5k/int32_t              50900 ns        50768 ns        13789
simdargsort/random_100k/int32_t          1764351 ns      1757047 ns          400
simdargsort/random_1m/int32_t           29256670 ns     29111150 ns           24
simdargsort/random_10m/int32_t         959237861 ns    954095660 ns            1
simdargsort/smallrange_128/int32_t           722 ns          720 ns       975799
simdargsort/smallrange_256/int32_t          2049 ns         2044 ns       342384
simdargsort/smallrange_512/int32_t          3655 ns         3646 ns       191985
simdargsort/smallrange_1k/int32_t           9169 ns         9147 ns        76497
simdargsort/smallrange_5k/int32_t          21375 ns        21320 ns        32747
simdargsort/smallrange_100k/int32_t       467703 ns       465853 ns         1502
simdargsort/smallrange_1m/int32_t        9097255 ns      9047326 ns           75
simdargsort/smallrange_10m/int32_t     183596592 ns    182645622 ns            4
simdargsort/sorted_10k/int32_t            104574 ns       104314 ns         6668
simdargsort/constant_10k/int32_t            8648 ns         8628 ns        81134
simdargsort/reverse_10k/int32_t           105977 ns       105722 ns         6558
scalarargsort/random_128/int32_t             797 ns          795 ns       882026
scalarargsort/random_256/int32_t            1736 ns         1732 ns       403901
scalarargsort/random_512/int32_t            4000 ns         3990 ns       176210
scalarargsort/random_1k/int32_t            21879 ns        21822 ns        31937
scalarargsort/random_5k/int32_t           262249 ns       261537 ns         2676
scalarargsort/random_100k/int32_t        7551632 ns      7525002 ns           93
scalarargsort/random_1m/int32_t        103352847 ns    102895746 ns            7
scalarargsort/random_10m/int32_t      1974523438 ns   1964054017 ns            1
scalarargsort/smallrange_128/int32_t         667 ns          665 ns      1051997
scalarargsort/smallrange_256/int32_t        1498 ns         1494 ns       469065
scalarargsort/smallrange_512/int32_t        3152 ns         3144 ns       223444
scalarargsort/smallrange_1k/int32_t        13730 ns        13697 ns        50943
scalarargsort/smallrange_5k/int32_t       127606 ns       127285 ns         5499
scalarargsort/smallrange_100k/int32_t    3031465 ns      3019854 ns          232
scalarargsort/smallrange_1m/int32_t     38177852 ns     37995219 ns           18
scalarargsort/smallrange_10m/int32_t   695700305 ns    692244521 ns            1
scalarargsort/sorted_10k/int32_t           92423 ns        92202 ns         7578
scalarargsort/constant_10k/int32_t         83139 ns        82931 ns         8439
scalarargsort/reverse_10k/int32_t          74214 ns        74031 ns         9359

Improve vector FP16 comparison function

I suspect this function

x86-simd-sort/src/avx512-16bit-qsort.hpp

Line 65 in 7d7591c

static opmask_t ge(zmm_t x, zmm_t y)

can be improved with fewer operations. See:
https://github.com/numpy/numpy/blob/0bd56e7ec12f8ceeb8d082340e71e60b873d5c57/numpy/core/src/npysort/npysort_common.h#L153 for reference.

Compilation falure on vpcompressw

Hi,
A build failure on some asm instr:

The Meson build system
Version: 0.55.1
Source dir: repos/x86-simd-sort
Build dir: repos/x86-simd-sort/builddir
Build type: native build
Project name: x86-simd-sort
Project version: 4.0.0
C++ compiler for the host machine: ccache c++ (gcc 11.3.0 "c++ (GCC) 11.3.0")
C++ linker for the host machine: c++ ld.bfd 2.27-44
Host machine cpu family: x86_64
Host machine cpu: x86_64
Compiler for C++ supports arguments -march=haswell: YES 
Compiler for C++ supports arguments -march=skylake-avx512: YES 
Compiler for C++ supports arguments -march=icelake-client: YES 
Build targets in project: 4

x86-simd-sort 4.0.0

  Configuration
    Can compile AVX-512 FP16 ISA: NO
              Build test content: NO
                Build benchmarks: NO

Found runner: /usr/bin/ninja
ninja: Entering directory `.'
[2/8] Compiling C++ object lib/liblibicl.a.p/x86simdsort-icl.cpp.o
FAILED: lib/liblibicl.a.p/x86simdsort-icl.cpp.o 
ccache c++ -Ilib/liblibicl.a.p -Ilib -I../lib -I../src -fvisibility=hidden -fvisibility-inlines-hidden -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wnon-virtual-dtor -std=c++17 -O3 -fPIC -march=icelake-client -MD -MQ lib/liblibicl.a.p/x86simdsort-icl.cpp.o -MF lib/liblibicl.a.p/x86simdsort-icl.cpp.o.d -o lib/liblibicl.a.p/x86simdsort-icl.cpp.o -c ../lib/x86simdsort-icl.cpp
{standard input}: Assembler messages:
{standard input}:184: Error: no such instruction: `vpcompressw %zmm4,(%rbx,%r12,2){%k2}'
{standard input}:185: Error: no such instruction: `vpcompressw %zmm4,64(%rbx,%rax,2){%k1}'
{standard input}:197: Error: no such instruction: `vpcompressw %zmm4,(%rbx,%r12,2){%k2}'
{standard input}:198: Error: no such instruction: `vpcompressw %zmm4,(%rbx,%rax,2){%k1}'
...

Tks

argsort and AVX2

Hi,
Wanted to ask if there are any plans for argsort implementation with AVX2?

CI: add windows build on MSVC

Argsort Adaptation

Amazing work! I was wondering if it would be possible to adapt the qsort kernels for an argsort. So also returning the indices that would make the original array sorted. (I believe you can do this with std::sort by sorting pair<value, size_t> where size_t is the original index) Would love to help out if it's feasible!

intel / x86-simd-sort Goto Github PK

x86-simd-sort's Issues

Recommend Projects

Recommend Topics

Recommend Org