Giter VIP home page Giter VIP logo

x86-simd-sort's Issues

Benchmarks can't be compiled on older platforms

When I try to compile the benchmark executable I get the following error from running make meson:

...
FAILED: benchexe
c++  -o benchexe 'benchexe@exe/benchmarks_main.cpp.o' -Wl,--as-needed -Wl,--no-undefined -Wl,--whole-archive -Wl,--start-group benchmarks/libbench_qsort.a utils/libcpuinfo.a -Wl,--no-whole-archive /usr/local/lib/libbenchmark.a -Wl,--end-group
/usr/bin/ld: /usr/local/lib/libbenchmark.a(benchmark_runner.cc.o): in function `benchmark::internal::BenchmarkRunner::DoNIterations()':
benchmark_runner.cc:(.text+0x1418): undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
...

This is with v1.8.0 of Google Benchmarks, v1.13.0 of Google's Test Framework. My system is running Ubuntu 20.04 and I have tried both g++-10 and g++-9. Further, the Makefile spits out many errors because my version of GCC can't compile the fp16 files.

I eventually pinned the meson issue down to the fact that my system is running glibc 2.31, where the linker still needs to be explicitly given -pthread or -lpthread to use threading. This flag is present by default for gtest andgtest_main, but Google Benchmark's pkg-config will only expose it when you indicate that it's being linked statically.

The default installation method for both Google's benchmark and testing suite will give a static library, so we should probably be explicitly configuring it as such in the build scripts. As for the Makefile, we could adapt this solution from Stack Overflow to detect when GCC is not new enough for the -march=sapphirerapids flag and exclude the fp16 files.

Multiple definition issue due to explicit function template specialization

On AVX512 builds there are multiple definitions of some functions, e.g. avx512_qsort, see numpy/numpy#25274

Reason is that the explicit specialization of the function template is not declared inline. See cppreference:

Whether an explicit specialization of a function [...] template is inline/constexpr(since C++11) [...] is determined by the explicit specialization itself, regardless of whether the primary template is declared with that specifier

And e.g. template <> void avx512_qsort(_Float16 *arr, arrsize_t arrsize, bool hasnan) has no inline and hence could be define by multiple translation units, see the source:

void avx512_qsort(_Float16 *arr, arrsize_t arrsize, bool hasnan)

The issue seems to also apply to pretty much every explicit function specialization, so it looks like a major fixup is required searching for "template <>" and marking all those functions defined in headers inline.

CI: Add basic CI

Add CI tests to run for every PR.

  • Build testexe and run to check pass/fail
  • Build x86-simd-sort with main branch of NumPy and ensure it builds correctly.

Lower-version gcc can`t support Optimization Pragmas.

We attempt to merge avx512-sort to oceanbase, compared with std::sort, the avx512-sort has an obvious perf. improvement. But there is a problem when merging new x86_simd_sort into oceanbase. We found that, you added some GCC Optimization Pragmas in the latest version of x86-simd-sort. Such as โ€œ#pragma GCC unroll 8โ€ . The default GCC version of oceanbase is gcc5.2.0, it seems that this gcc doesn`t support these pragmas. The error is like this:
image

Please help to fix.

A question about hardware implementation of quick-sort in future Intel CPUs

Hi,

If I understand well, this is a software C++ implementation of a quick sort using hardware AVX vector instructions.

A lot of CPUs now have instructions for cryptography, hashing etc. And Intel does.

Is there a roadmap at intel , even for the next 10 years, to have a basic quick sort performed in hardware the same way a hash is computed ? I know the hardware implementation is complex.

We need the hardware to move closer to what standard software libraries provide as a service level and a hardware sort is big game changer (I think of Data processing, Machine learning, binning, quantization, etc).

Can I use this lib to output the original index of these sorted data

This lib looks awesome!
But I have a requirement to output the original index of these sorted data, for example the input array is {300 200 400 99 150 50 230 250}, then the sorted result should be {50 99 150 200 230 250 300 400}, and I want to get the original index as output {5, 3, 4, 1, 6, 7, 0, 2}.
Currently, I use the std::sort with a lambda to implement this require, but the performance is not good.

Merge sort and AVX512

Hi!

Many thanks for your contribution to speeding up sorting in numpy. Wanted to ask if there are any plans to speed up merge/tim sort with AVX2/AVX512?

Improve perf of qsort

two ideas we could try:

  • Use larger bitonic sorting networks (256 and 512 elements)
  • Improve pivot selection by pich a set of random indices

Improve argsort for 32-bit

32-bit argsort uses ymm registers: we can switch to zmm registers (use 2x i64gather instructions) and add new bitonic networks.

Recent changes don't compile with MSVC, cause warnings in GCC

The additions in #61 fail to compile with MSVC. Is MSVC supported?

# using VS2022 (17.7.34031.279) ...
deps\simdsort\src\xss-network-qsort.hpp(80,11): error C2466: cannot allocate an array of constant size 0
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,0,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,1,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,2,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,4,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(76,9): message : see reference to function template instantiation 'void sort_n_vec<vtype,8,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' b 
eing compiled
          with
          [
              vtype=vtype
          ]
simdsort\src\xss-network-qsort.hpp(134,5): message : see reference to function template instantiation 'void sort_n_vec<vtype,16,zmm_vector<float>::reg_t>(zmm_vector<float>::type_t *,int32_t)' 
 being compiled
          with
          [
              vtype=vtype
          ]
xxx.h(199,5): message : see reference to function template instantiation 'void sort_n<vtype,256>(zmm_vector<float>::type_t *,int)' being compiled
simdsort\src\xss-network-qsort.hpp(80,11): error C2133: 'vecs': unknown size
simdsort\src\xss-network-qsort.hpp(83,30): error C2466: cannot allocate an array of constant size 0
simdsort\src\xss-network-qsort.hpp(83,30): error C2133: 'ioMasks': unknown size

The changes also cause a few GCC -Wunused-but-set-parameter warnings for the regs parameter in the same file.

Need Argsort For 32Bit Index For 32Bit Types

We only have argsort whose indexes are (unsigned) int64_t. For 32bit types like int/float, I think argsort with int32_t as index type can also be provided, which may performs better with avx512's i32gather_ps/epi32 instructions. Also, this function can be easily implemented in AVX2 plantforms.

performance on amd 7950x ...

Hello,
I tried benchmark on 7950x cpu and performance is in some tests up to 2.3x faster but in other tests much slower (like 0.3x) compared to classical sorting. Is amd implementation of avx512 not so powerful (and your code is not suitable for zen4) or is it something else ?

Thanks,
Jan

Argsort on int32_t severely underperforms

Hello, I just want to make sure I'm not doing something wrong. I stumbled upon this project when I found that IPP's radix sort had buffer size calculations that were overflowing well before the limits of an INT32_MAX and all of the "Index" variants also only used 32 bit types. To be clear, a 32 bit index variant still would be beneficial and I see there are plans to eventually implement that. It's just I need a conditional code path for when there are too many values (I have a dataset generated that has length in the billions). I noticed that when the code makes use of the AVX512 instructions for argsort, it severely under performs against the radix sort (which as far as I can tell from the assembly, is completely scalar except for an iota like scan to generate the sequential index).

The benchmarks distributed with this project do show it winning by some margins against the scalar variants in here but not by the typical order of magnitudes that were touted when this project was first introduced. Is this maybe a result of the gather data sampling mitigations in microcode? I know this heavily uses compressed store and gather, and those functions were severely limited by this.

Here's my benchmark on my hardware:

[astylinski@fedoravm builddir]$ ./benchexe --benchmark_filter=/int32_t
2023-12-03T11:30:05-05:00
Running ./benchexe
Run on (6 X 3312 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 4096 KiB (x6)
  L3 Unified 16384 KiB (x1)
Load Average: 1.02, 0.68, 0.32
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
simdargsort/random_128/int32_t               959 ns          956 ns       731536
simdargsort/random_256/int32_t              2045 ns         2040 ns       343034
simdargsort/random_512/int32_t              3866 ns         3857 ns       181511
simdargsort/random_1k/int32_t               9289 ns         9267 ns        75493
simdargsort/random_5k/int32_t              50900 ns        50768 ns        13789
simdargsort/random_100k/int32_t          1764351 ns      1757047 ns          400
simdargsort/random_1m/int32_t           29256670 ns     29111150 ns           24
simdargsort/random_10m/int32_t         959237861 ns    954095660 ns            1
simdargsort/smallrange_128/int32_t           722 ns          720 ns       975799
simdargsort/smallrange_256/int32_t          2049 ns         2044 ns       342384
simdargsort/smallrange_512/int32_t          3655 ns         3646 ns       191985
simdargsort/smallrange_1k/int32_t           9169 ns         9147 ns        76497
simdargsort/smallrange_5k/int32_t          21375 ns        21320 ns        32747
simdargsort/smallrange_100k/int32_t       467703 ns       465853 ns         1502
simdargsort/smallrange_1m/int32_t        9097255 ns      9047326 ns           75
simdargsort/smallrange_10m/int32_t     183596592 ns    182645622 ns            4
simdargsort/sorted_10k/int32_t            104574 ns       104314 ns         6668
simdargsort/constant_10k/int32_t            8648 ns         8628 ns        81134
simdargsort/reverse_10k/int32_t           105977 ns       105722 ns         6558
scalarargsort/random_128/int32_t             797 ns          795 ns       882026
scalarargsort/random_256/int32_t            1736 ns         1732 ns       403901
scalarargsort/random_512/int32_t            4000 ns         3990 ns       176210
scalarargsort/random_1k/int32_t            21879 ns        21822 ns        31937
scalarargsort/random_5k/int32_t           262249 ns       261537 ns         2676
scalarargsort/random_100k/int32_t        7551632 ns      7525002 ns           93
scalarargsort/random_1m/int32_t        103352847 ns    102895746 ns            7
scalarargsort/random_10m/int32_t      1974523438 ns   1964054017 ns            1
scalarargsort/smallrange_128/int32_t         667 ns          665 ns      1051997
scalarargsort/smallrange_256/int32_t        1498 ns         1494 ns       469065
scalarargsort/smallrange_512/int32_t        3152 ns         3144 ns       223444
scalarargsort/smallrange_1k/int32_t        13730 ns        13697 ns        50943
scalarargsort/smallrange_5k/int32_t       127606 ns       127285 ns         5499
scalarargsort/smallrange_100k/int32_t    3031465 ns      3019854 ns          232
scalarargsort/smallrange_1m/int32_t     38177852 ns     37995219 ns           18
scalarargsort/smallrange_10m/int32_t   695700305 ns    692244521 ns            1
scalarargsort/sorted_10k/int32_t           92423 ns        92202 ns         7578
scalarargsort/constant_10k/int32_t         83139 ns        82931 ns         8439
scalarargsort/reverse_10k/int32_t          74214 ns        74031 ns         9359

Compilation falure on vpcompressw

Hi,
A build failure on some asm instr:

The Meson build system
Version: 0.55.1
Source dir: repos/x86-simd-sort
Build dir: repos/x86-simd-sort/builddir
Build type: native build
Project name: x86-simd-sort
Project version: 4.0.0
C++ compiler for the host machine: ccache c++ (gcc 11.3.0 "c++ (GCC) 11.3.0")
C++ linker for the host machine: c++ ld.bfd 2.27-44
Host machine cpu family: x86_64
Host machine cpu: x86_64
Compiler for C++ supports arguments -march=haswell: YES 
Compiler for C++ supports arguments -march=skylake-avx512: YES 
Compiler for C++ supports arguments -march=icelake-client: YES 
Build targets in project: 4

x86-simd-sort 4.0.0

  Configuration
    Can compile AVX-512 FP16 ISA: NO
              Build test content: NO
                Build benchmarks: NO

Found runner: /usr/bin/ninja
ninja: Entering directory `.'
[2/8] Compiling C++ object lib/liblibicl.a.p/x86simdsort-icl.cpp.o
FAILED: lib/liblibicl.a.p/x86simdsort-icl.cpp.o 
ccache c++ -Ilib/liblibicl.a.p -Ilib -I../lib -I../src -fvisibility=hidden -fvisibility-inlines-hidden -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wnon-virtual-dtor -std=c++17 -O3 -fPIC -march=icelake-client -MD -MQ lib/liblibicl.a.p/x86simdsort-icl.cpp.o -MF lib/liblibicl.a.p/x86simdsort-icl.cpp.o.d -o lib/liblibicl.a.p/x86simdsort-icl.cpp.o -c ../lib/x86simdsort-icl.cpp
{standard input}: Assembler messages:
{standard input}:184: Error: no such instruction: `vpcompressw %zmm4,(%rbx,%r12,2){%k2}'
{standard input}:185: Error: no such instruction: `vpcompressw %zmm4,64(%rbx,%rax,2){%k1}'
{standard input}:197: Error: no such instruction: `vpcompressw %zmm4,(%rbx,%r12,2){%k2}'
{standard input}:198: Error: no such instruction: `vpcompressw %zmm4,(%rbx,%rax,2){%k1}'
...

Tks

argsort and AVX2

Hi,
Wanted to ask if there are any plans for argsort implementation with AVX2?

Argsort Adaptation

Amazing work! I was wondering if it would be possible to adapt the qsort kernels for an argsort. So also returning the indices that would make the original array sorted. (I believe you can do this with std::sort by sorting pair<value, size_t> where size_t is the original index) Would love to help out if it's feasible!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.