intel / x86-simd-sort Goto Github PK
View Code? Open in Web Editor NEWC++ template library for high performance SIMD based sorting algorithms
License: BSD 3-Clause "New" or "Revised" License
C++ template library for high performance SIMD based sorting algorithms
License: BSD 3-Clause "New" or "Revised" License
When I try to compile the benchmark executable I get the following error from running make meson
:
...
FAILED: benchexe
c++ -o benchexe 'benchexe@exe/benchmarks_main.cpp.o' -Wl,--as-needed -Wl,--no-undefined -Wl,--whole-archive -Wl,--start-group benchmarks/libbench_qsort.a utils/libcpuinfo.a -Wl,--no-whole-archive /usr/local/lib/libbenchmark.a -Wl,--end-group
/usr/bin/ld: /usr/local/lib/libbenchmark.a(benchmark_runner.cc.o): in function `benchmark::internal::BenchmarkRunner::DoNIterations()':
benchmark_runner.cc:(.text+0x1418): undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
...
This is with v1.8.0 of Google Benchmarks, v1.13.0 of Google's Test Framework. My system is running Ubuntu 20.04 and I have tried both g++-10 and g++-9. Further, the Makefile spits out many errors because my version of GCC can't compile the fp16 files.
I eventually pinned the meson issue down to the fact that my system is running glibc 2.31, where the linker still needs to be explicitly given -pthread
or -lpthread
to use threading. This flag is present by default for gtest
andgtest_main
, but Google Benchmark's pkg-config
will only expose it when you indicate that it's being linked statically.
The default installation method for both Google's benchmark and testing suite will give a static library, so we should probably be explicitly configuring it as such in the build scripts. As for the Makefile, we could adapt this solution from Stack Overflow to detect when GCC is not new enough for the -march=sapphirerapids
flag and exclude the fp16 files.
It seems the partial_qsort test rarely fails in the smallrange test case. It never seems to occur outside of the smallrange test case. Attached is a patch file that replicates the failure reliably
We are not allowed to modify the array, its hard to handle NAN's correctly in avx512_argsort routines for float and double.
On AVX512 builds there are multiple definitions of some functions, e.g. avx512_qsort
, see numpy/numpy#25274
Reason is that the explicit specialization of the function template is not declared inline. See cppreference:
Whether an explicit specialization of a function [...] template is inline/constexpr(since C++11) [...] is determined by the explicit specialization itself, regardless of whether the primary template is declared with that specifier
And e.g. template <> void avx512_qsort(_Float16 *arr, arrsize_t arrsize, bool hasnan)
has no inline
and hence could be define by multiple translation units, see the source:
x86-simd-sort/src/avx512fp16-16bit-qsort.hpp
Line 180 in d9c9737
The issue seems to also apply to pretty much every explicit function specialization, so it looks like a major fixup is required searching for "template <>
" and marking all those functions defined in headers inline.
Add CI tests to run for every PR.
Use AVX512FP16 ISA.
Hi
Would there be any blocker in order to implement an option to do partial sorting (sometimes called 'topk') ?
https://en.cppreference.com/w/cpp/algorithm/partial_sort
I know it s usually done with heap sort and this project is more into quicksort but still.
Kind
WT
ref:
https://arrayfire.org/docs/group__stat__func__topk.htm#gsc.tab=0
https://pytorch.org/docs/stable/generated/torch.topk.html
We attempt to merge avx512-sort to oceanbase, compared with std::sort, the avx512-sort has an obvious perf. improvement. But there is a problem when merging new x86_simd_sort into oceanbase. We found that, you added some GCC Optimization Pragmas in the latest version of x86-simd-sort. Such as โ#pragma GCC unroll 8โ . The default GCC version of oceanbase is gcc5.2.0, it seems that this gcc doesn`t support these pragmas. The error is like this:
Please help to fix.
Hi,
If I understand well, this is a software C++ implementation of a quick sort using hardware AVX vector instructions.
A lot of CPUs now have instructions for cryptography, hashing etc. And Intel does.
Is there a roadmap at intel , even for the next 10 years, to have a basic quick sort performed in hardware the same way a hash is computed ? I know the hardware implementation is complex.
We need the hardware to move closer to what standard software libraries provide as a service level and a hardware sort is big game changer (I think of Data processing, Machine learning, binning, quantization, etc).
Any plans to implement on AVX/AVX2 as well, in order to improve availability?
This lib looks awesome!
But I have a requirement to output the original index of these sorted data, for example the input array is {300 200 400 99 150 50 230 250}, then the sorted result should be {50 99 150 200 230 250 300 400}, and I want to get the original index as output {5, 3, 4, 1, 6, 7, 0, 2}.
Currently, I use the std::sort with a lambda to implement this require, but the performance is not good.
Hi!
Many thanks for your contribution to speeding up sorting in numpy. Wanted to ask if there are any plans to speed up merge/tim sort with AVX2/AVX512?
Consider and perhaps move to google benchmarks instead of reinventing the wheel.
Have you tried comparing to:
https://github.com/damageboy/vxsort-cpp
two ideas we could try:
Hi!
In the numpy/numpy#24634 PR, numpy.core
is being renamed to numpy._core
which affects .github/workflows/build-numpy.yml
file here.
Would renaming core
path to _core
work here? Or do you need to run the CI against other numpy versions?
32-bit argsort uses ymm registers: we can switch to zmm registers (use 2x i64gather instructions) and add new bitonic networks.
The additions in #61 fail to compile with MSVC. Is MSVC supported?
# using VS2022 (17.7.34031.279) ...
deps\simdsort\src\xss-network-qsort.hpp(80,11): error C2466: cannot allocate an array of constant size 0
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,0,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b
eing compiled
with
[
vtype=vtype
]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,1,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b
eing compiled
with
[
vtype=vtype
]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,2,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b
eing compiled
with
[
vtype=vtype
]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,4,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b
eing compiled
with
[
vtype=vtype
]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,8,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b
eing compiled
with
[
vtype=vtype
]
simdsort\src\xss-network-qsort.hpp(134,5): message : see reference to function template instantiation 'void sort_n_vec<vtype,16,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)'
being compiled
with
[
vtype=vtype
]
xxx.h(199,5): message : see reference to function template instantiation 'void sort_n<vtype,256>(zmm_vector<float>::type_t *,int)' being compiled
simdsort\src\xss-network-qsort.hpp(80,11): error C2133: 'vecs': unknown size
simdsort\src\xss-network-qsort.hpp(83,30): error C2466: cannot allocate an array of constant size 0
simdsort\src\xss-network-qsort.hpp(83,30): error C2133: 'ioMasks': unknown size
The changes also cause a few GCC -Wunused-but-set-parameter
warnings for the regs
parameter in the same file.
We only have argsort whose indexes are (unsigned) int64_t. For 32bit types like int/float, I think argsort with int32_t as index type can also be provided, which may performs better with avx512's i32gather_ps/epi32 instructions. Also, this function can be easily implemented in AVX2 plantforms.
Instructions in README for meson fail because there is no meson.build in the utils directory
Hello,
I tried benchmark on 7950x cpu and performance is in some tests up to 2.3x faster but in other tests much slower (like 0.3x) compared to classical sorting. Is amd implementation of avx512 not so powerful (and your code is not suitable for zen4) or is it something else ?
Thanks,
Jan
Dont rebuild benchexe when running branchcompare
Hello, I just want to make sure I'm not doing something wrong. I stumbled upon this project when I found that IPP's radix sort had buffer size calculations that were overflowing well before the limits of an INT32_MAX and all of the "Index" variants also only used 32 bit types. To be clear, a 32 bit index variant still would be beneficial and I see there are plans to eventually implement that. It's just I need a conditional code path for when there are too many values (I have a dataset generated that has length in the billions). I noticed that when the code makes use of the AVX512 instructions for argsort, it severely under performs against the radix sort (which as far as I can tell from the assembly, is completely scalar except for an iota like scan to generate the sequential index).
The benchmarks distributed with this project do show it winning by some margins against the scalar variants in here but not by the typical order of magnitudes that were touted when this project was first introduced. Is this maybe a result of the gather data sampling mitigations in microcode? I know this heavily uses compressed store and gather, and those functions were severely limited by this.
Here's my benchmark on my hardware:
[astylinski@fedoravm builddir]$ ./benchexe --benchmark_filter=/int32_t
2023-12-03T11:30:05-05:00
Running ./benchexe
Run on (6 X 3312 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 4096 KiB (x6)
L3 Unified 16384 KiB (x1)
Load Average: 1.02, 0.68, 0.32
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
simdargsort/random_128/int32_t 959 ns 956 ns 731536
simdargsort/random_256/int32_t 2045 ns 2040 ns 343034
simdargsort/random_512/int32_t 3866 ns 3857 ns 181511
simdargsort/random_1k/int32_t 9289 ns 9267 ns 75493
simdargsort/random_5k/int32_t 50900 ns 50768 ns 13789
simdargsort/random_100k/int32_t 1764351 ns 1757047 ns 400
simdargsort/random_1m/int32_t 29256670 ns 29111150 ns 24
simdargsort/random_10m/int32_t 959237861 ns 954095660 ns 1
simdargsort/smallrange_128/int32_t 722 ns 720 ns 975799
simdargsort/smallrange_256/int32_t 2049 ns 2044 ns 342384
simdargsort/smallrange_512/int32_t 3655 ns 3646 ns 191985
simdargsort/smallrange_1k/int32_t 9169 ns 9147 ns 76497
simdargsort/smallrange_5k/int32_t 21375 ns 21320 ns 32747
simdargsort/smallrange_100k/int32_t 467703 ns 465853 ns 1502
simdargsort/smallrange_1m/int32_t 9097255 ns 9047326 ns 75
simdargsort/smallrange_10m/int32_t 183596592 ns 182645622 ns 4
simdargsort/sorted_10k/int32_t 104574 ns 104314 ns 6668
simdargsort/constant_10k/int32_t 8648 ns 8628 ns 81134
simdargsort/reverse_10k/int32_t 105977 ns 105722 ns 6558
scalarargsort/random_128/int32_t 797 ns 795 ns 882026
scalarargsort/random_256/int32_t 1736 ns 1732 ns 403901
scalarargsort/random_512/int32_t 4000 ns 3990 ns 176210
scalarargsort/random_1k/int32_t 21879 ns 21822 ns 31937
scalarargsort/random_5k/int32_t 262249 ns 261537 ns 2676
scalarargsort/random_100k/int32_t 7551632 ns 7525002 ns 93
scalarargsort/random_1m/int32_t 103352847 ns 102895746 ns 7
scalarargsort/random_10m/int32_t 1974523438 ns 1964054017 ns 1
scalarargsort/smallrange_128/int32_t 667 ns 665 ns 1051997
scalarargsort/smallrange_256/int32_t 1498 ns 1494 ns 469065
scalarargsort/smallrange_512/int32_t 3152 ns 3144 ns 223444
scalarargsort/smallrange_1k/int32_t 13730 ns 13697 ns 50943
scalarargsort/smallrange_5k/int32_t 127606 ns 127285 ns 5499
scalarargsort/smallrange_100k/int32_t 3031465 ns 3019854 ns 232
scalarargsort/smallrange_1m/int32_t 38177852 ns 37995219 ns 18
scalarargsort/smallrange_10m/int32_t 695700305 ns 692244521 ns 1
scalarargsort/sorted_10k/int32_t 92423 ns 92202 ns 7578
scalarargsort/constant_10k/int32_t 83139 ns 82931 ns 8439
scalarargsort/reverse_10k/int32_t 74214 ns 74031 ns 9359
I suspect this function
x86-simd-sort/src/avx512-16bit-qsort.hpp
Line 65 in 7d7591c
Hi,
A build failure on some asm instr:
The Meson build system
Version: 0.55.1
Source dir: repos/x86-simd-sort
Build dir: repos/x86-simd-sort/builddir
Build type: native build
Project name: x86-simd-sort
Project version: 4.0.0
C++ compiler for the host machine: ccache c++ (gcc 11.3.0 "c++ (GCC) 11.3.0")
C++ linker for the host machine: c++ ld.bfd 2.27-44
Host machine cpu family: x86_64
Host machine cpu: x86_64
Compiler for C++ supports arguments -march=haswell: YES
Compiler for C++ supports arguments -march=skylake-avx512: YES
Compiler for C++ supports arguments -march=icelake-client: YES
Build targets in project: 4
x86-simd-sort 4.0.0
Configuration
Can compile AVX-512 FP16 ISA: NO
Build test content: NO
Build benchmarks: NO
Found runner: /usr/bin/ninja
ninja: Entering directory `.'
[2/8] Compiling C++ object lib/liblibicl.a.p/x86simdsort-icl.cpp.o
FAILED: lib/liblibicl.a.p/x86simdsort-icl.cpp.o
ccache c++ -Ilib/liblibicl.a.p -Ilib -I../lib -I../src -fvisibility=hidden -fvisibility-inlines-hidden -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wnon-virtual-dtor -std=c++17 -O3 -fPIC -march=icelake-client -MD -MQ lib/liblibicl.a.p/x86simdsort-icl.cpp.o -MF lib/liblibicl.a.p/x86simdsort-icl.cpp.o.d -o lib/liblibicl.a.p/x86simdsort-icl.cpp.o -c ../lib/x86simdsort-icl.cpp
{standard input}: Assembler messages:
{standard input}:184: Error: no such instruction: `vpcompressw %zmm4,(%rbx,%r12,2){%k2}'
{standard input}:185: Error: no such instruction: `vpcompressw %zmm4,64(%rbx,%rax,2){%k1}'
{standard input}:197: Error: no such instruction: `vpcompressw %zmm4,(%rbx,%r12,2){%k2}'
{standard input}:198: Error: no such instruction: `vpcompressw %zmm4,(%rbx,%rax,2){%k1}'
...
Tks
Hi,
Wanted to ask if there are any plans for argsort implementation with AVX2?
Amazing work! I was wondering if it would be possible to adapt the qsort kernels for an argsort. So also returning the indices that would make the original array sorted. (I believe you can do this with std::sort by sorting pair<value, size_t>
where size_t
is the original index) Would love to help out if it's feasible!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.