xtensor-stack / xsimd Goto Github PK

View Code? Open in Web Editor NEW

2.0K 71.0 245.0 3.88 MB

C++ wrappers for SIMD intrinsics and parallelized, optimized mathematical functions (SSE, AVX, AVX512, NEON, SVE))

Home Page: https://xsimd.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

C++ 97.47% CMake 1.29% Shell 1.02% HTML 0.01% Python 0.21%

simd-intrinsics vectorization simd cpp avx neon sse avx512 simd-instructions mathematical-functions

xsimd's People

Contributors

Stargazers

Watchers

Forkers

johanmabille mywoodstock sylvaincorlay wolfv linecode templeblock shyamalschandra ax3l ukoethe gpuworld kgupta15 bluescarni mangotwo2 zgsxwsdxg fcccode eminsight radovankavicky gapdata ubiqelife-lin caozhongz andre521 phcerdan serge-sans-paille tdegeus full-stack-ai-apps jpcima martinrenou epignatelli sthagen lgarrison mainland hleclerc easyaspi314 sakishum dendisuhubdy vishalbelsare jeandet dreamplayerzhang nativeinstruments mickey9910326 buckaroo-pm zhihaoy oalign arthurallshire khawatkom delongqilinksprite robertodr myfreebrain elmanvon yssource omaralvarez wuyuanmm dirty-south-supercomputing msnh2012 xiaowei942 zloop1982 akzheng thomasretornaz yazici jinfeihan57 shivampr21 alexandresee shaarigan teamposer sam1992xjh timblechmann steve-hawk woffl michaelsb manump amahoneylit astrohawk generalzzd 5l1v3r1 joe-nano chengwei920412 hypercode-go ubpa tzf-omkey weinvest adytzu2007 stjordanis jyjatbupt willp552 d3v3l0 nfreewind zeta1999 royzon eef808a24ff bobdeng1974 guomeimei biodataanalysis raphaelk12 zouyxdut pesterie 0xbyteshift dumpinfo whu-dft agsdzj doytsujin

xsimd's Issues

Make xaligned_malloc default

I have benchmarked posix_memalign and _mm_malloc, and they are quite a bit slower for 32 bit alignments than the xaligned_malloc/free.

Or we could enable/disable them by some compile time flags?

Random access violation on windows x64

exp / log / pow for complex<float>

pow of complex<float> has some accuracy issues with AVX512.

Can not build with Intel Compiler on MSVC because cmake fails

Currently the build fails for me with the latest Intel Compiler on MSVC (and Visual Studio 2015). The build aborts in cmake configuration with error

-- Performing Test HAS_CPP11_FLAG
-- Performing Test HAS_CPP11_FLAG - Failed
CMake Error at test/CMakeLists.txt:53 (message):
  Unsupported compiler -- xsimd requires C++11 support!

-- Configuring incomplete, errors occurred!

I assume the detection is not perfect, because the Intel Compiler has support for C++11. Is there anything I can do to help?

Implement missing operator%

We need the modulo operator in xsimd for operations in xtensor.

Aligned Pool Allocator

An aligned pool allocator that keeps memory around until explicit flush/cleanup would be a nice additional feature.

Implement wrappers for unsigned integers

This is a requirement for logarithm functions.

batch<float, 8>::store_unaligned() segfaults

In xsimd_avx_float.hpp, the function batch<float, 8>::store_unaligned() calls the aligned intrinsic _mm256_store_ps() instead of the unaligned _mm256_storeu_ps(), leading to a segfault when the address is actually unaligned. I didn't check if the same problem occurs for other types.

OS X Support

load_[un]aligned and store_[un]aligned free functions only work on the largest vector width

I am not sure if this is a bug or feature, but there is no way to do a load/store of a vector which is smaller than the native vector width (e.g. a 128-bit batch<float, 4> on AVX) using the free function interface from xsimd/xsimd.hpp.

The way the current interface works, one has to use the constructor/method syntax for this kind of loads and stores.

Implement missing basic functions

Basi functions which are missing:

fmod
remainder
fdim
clip

Implement classification functions

Classification functions:

isfinite
isinf
isnan

CMakeLists.txt compliant with Debian packaging

Following changes are required:

download gtest instead of using local installation
test installed package instead of local one
remove CMAKE_SIZEOF_VOID_P from xsimdConfigVersion.cmake

Implement logarithm functions

Logarithm functions:

log
log10
log2
log1p

Compiling Simple Test Program

Hello!

I tried compiling the simple test program you include in your documentation

#include <iostream>
#include "xsimd/xsimd.hpp"

namespace xs = xsimd;

int main(int argc, char* argv[])
{
    xs::batch<double, 4> a(1.5, 2.5, 3.5, 4.5);
    xs::batch<double, 4> b(2.5, 3.5, 4.5, 5.5);
    auto mean = (a + b) / 2;
    std::cout << mean << std::endl;
    return 0;
}

I compile using
g++ -std=c++14 -o out -I xsimd/include/ main.cpp
where gcc is version 5.4.1 .

I get the following error

In file included from xsimd/types/xsimd_types_include.hpp:15:0,
                 from xsimd/types/xsimd_traits.hpp:12,
                 from xsimd/xsimd.hpp:14,
                 from main.cpp:2:
xsimd/types/xsimd_sse_int32.hpp: In function ‘xsimd::batch<int, 4ul> xsimd::select(const xsimd::batch_bool<int, 4ul>&, const xsimd::batch<int, 4ul>&, const xsimd::batch<int, 4ul>&)’:
xsimd/types/xsimd_sse_int32.hpp:441:70: error: ‘s’ was not declared in this scope
         return _mm_or_si128(_mm_and_si128(cond, a), _mm_andnot_si128(s, b));
                                                                      ^
In file included from xsimd/types/xsimd_types_include.hpp:16:0,
                 from xsimd/types/xsimd_traits.hpp:12,
                 from xsimd/xsimd.hpp:14,
                 from main.cpp:2:
xsimd/types/xsimd_sse_int64.hpp: In function ‘xsimd::batch<long int, 2ul> xsimd::select(const xsimd::batch_bool<long int, 2ul>&, const xsimd::batch<long int, 2ul>&, const xsimd::batch<long int, 2ul>&)’:
xsimd/types/xsimd_sse_int64.hpp:460:70: error: ‘s’ was not declared in this scope
         return _mm_or_si128(_mm_and_si128(cond, a), _mm_andnot_si128(s, b));
                                                                      ^

I'm sure it's something simple I'm missing.

Thanks!

Wrong simd_return_type with complex

The issue is fully described here

operator<< and operator>> implementation

The current implementations of this operators rely on operator[] and separately operates on elements of a batch.
This should be refactored with specifics intrinsics for better performances.

Test failures for old Intel arch

There are some small failures with loading 64bit integers from char for -arch=nocona

[ RUN      ] xsimd.api_load
1: lhs = 67305985 - rhs = 1
2: lhs = 134678021 - rhs = 2
load uchar   -> int64  : BAD
Nb diff  : 2 (100%)
1: lhs = 67305985 - rhs = 1
2: lhs = 134678021 - rhs = 2
loadu uchar  -> int64  : BAD
Nb diff  : 2 (100%)
/home/wolfv/Programs/xsimd/test/xsimd_api_test.cpp:70: Failure
Value of: res
  Actual: false
Expected: true
[  FAILED  ] xsimd.api_load (0 ms)

and

[ RUN      ] xsimd.sse_load
1: lhs = 67305985 - rhs = 1
2: lhs = 134678021 - rhs = 2
load uchar  -> int64  : BAD
Nb diff  : 2 (100%)

Use _mm256_cvttps_epi32 instead of _mm256_cvtps_epi32 everywhere

cvttps uses truncation and

3/5 == 0 becomes true which seems to be the default behavior in C++

AVX512 plans?

Hello xsimd developers,

I was wondering if you have any timeline on AVX512 support?

I recently gained access to a Skylake-AVX512 Xeon server, and wanted to test out how the new instructions would fare on my xsimd-enhanced libraries. I went ahead and hacked together some very preliminary support for AVX512 in xsimd which is nevertheless sufficient for some initial experiments (single-precision floats, basic maths). Posting the link here, in case it is of any use:

master...bluescarni:avx512

(note that I am pretty much a newbie when it comes to SIMD programming, so probably there are inefficiencies and mistakes)

Thanks for the great library!

Remove need for -flax-vector-types on ARM/GCC

char loading to other batches currently makes use of lax conversion enabled by default on clang (apparently).
We should remove the need to enable this compiler flag on GCC so that it compiles without hassle (by using static casts where appropriate).

Implement transpose

We should implement a transpose operation to transpose NxN matrix blocks (where N is batch width).
The interface should probably look like the one for haddp, e.g. taking a pointer of rows.

template <class T, N>
void transpose(batch<T, N>* rows)
{
... inplace transpose ...
}

Benchmark prints a "neon" result (presumably SSE) on an x86 machine

~~I think the tile says it all :)~~

When running "make/ninja xbenchmark" on my Haswell-based machine, a pair of "neon" rows is present in the table of timings, even though this CPU certainly has no neon instructions available.

Add tests for haddp

Currently haddp is untested (and unused).

batch<uint32_t, ...> and batch <uint64_t, ...>

There is support for a batch of int32_t, but nothing for uint32_t. Same applies to uint64_t.

Implement trigonometric functions

Trigonometric functions:

Add unsigned char / signed char batches

@iamthebot finally run into our missing support for char loading.

So we need to add support to this ASAP.

Could not find batch reducers on ReadTheDocs

Maybe that's because the search box is currently broken. But I did not find documentation for hadd and other batch reducers on ReadTheDocs, even though I expected it to be generated from this Doxygen.

NEON intrinsics for complex<float>

power, trigonometric and hyperbolic tests are failing for NEON instructions because of potential truncation.

Update documentation to include support for neon instruction set

mm_malloc.h missing on Raspberry Pi V2/B

We probably need to detect ARM or 32bit better.

scatter and gather instructions

It could be interesting to support intels scatter and gather instructions (however, scatter does seem to be available only since AVX512).

is_nan vs isnan

In the STL it's std::isnan and it's also xsimd::isinf.
So is there a reason for xsimd::is_nan or would it make sense to rename it to xsimd::isnan to adhere to STL and unify the xsimd interface?

Add arange

SIMDified arange can give a ~4x improvement about std::iota and for loops.

Unit tests should exercise constructors

So, I tried to integrate #100 in my project to see if it solves the problem it's intended to solve, and I discovered a bunch of issues related to constructors of batch and batch_bool. This means that we don't yet have unit tests for them.

Current batch_bool<float/double> equality operator on SSE/AVX wrong

We're using a __m256d floating point type to store the batch bool for float/double in SSE and AVX and do equality comparison using variants of mm_cmp_pd/ps(...)

The problem is that theses functions check for NaN as NaNs are incomparable -- and you can select a mode to get to a desired result (e.g. ordered comparsion results in NaN and number to be false or whatever).

The other fact is that a true value is represented by setting all bits to 1 – including the NaN indicating bit.

So currently, if you compare a batch_bool<double, 4> a(true); a == a with itself, the result will be filled with false as you compare two NaN numbers with each other.

Solution a) switch to integer type inside of batch_bool. This reduces also the amount of implementations (as we can share for int/float int64/double).
Just need to add a cast to or constructor from __m256d (as that's still the result type of the comparison of float/double batches).

Solution b) cast to int before comparison and cast back for storage.

add a sincos method

Some of the mathematical functions need both the sine and the cosine of their argument. Since these functions share some steps in their computation (among them, the reduction), a sincos method computing both sine and cosine in a single pass may improve performances.

A version of select() with a compile-time mask would be nice

At least on x86, the fastest intrinsics for shuffling the contents of a vector or blending data from two vectors take an immediate operand, which must be a compile-time constant. So there would be a use case for a compile-time version of xsimd::select(), as it could use these faster instructions.

An example of prior art for this is the shuffle() instruction family in bSIMD:

Implement hyperbolic functions

Hyperbolic functions:

Implement cumulative sum/prod ... using SIMD

We have a prototype for a cumulative sum using xsimd here: https://gist.github.com/jpivarski/b2a04778124e7dc790d87fcdfd399e1e/

We should work to integrate it into the library.

Implement error and gamma functions

Error and gamma functions:

erf
erfc
tgamma
lgamma

Add batch_traits for batch_bool

Implement power functions

Power functions:

pow
cbrt
hypot

sqrt has its specific intrinsics.

Mathematical functions should accept integral type

Mathematical functions of xsimd should accept integral types when their standard counterparts do.

Implement exponential functions

Exponential functions:

exp
exp2
exp10
expm1

Testing fails on Debian

Using version 3.0.1 + fix for select, here is the output from the tests:

[==========] Running 22 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 22 tests from xsimd
[ RUN      ] xsimd.sse_float_trigonometric
[       OK ] xsimd.sse_float_trigonometric (9 ms)
[ RUN      ] xsimd.sse_double_trigonometric
[       OK ] xsimd.sse_double_trigonometric (9 ms)
[ RUN      ] xsimd.sse_float_rounding
/<<PKGBUILDDIR>>/test/xsimd_rounding_test.cpp:34: Failure
Value of: res
  Actual: false
Expected: true
[  FAILED  ] xsimd.sse_float_rounding (0 ms)
[ RUN      ] xsimd.sse_double_rounding
/<<PKGBUILDDIR>>/test/xsimd_rounding_test.cpp:41: Failure
Value of: res
  Actual: false
Expected: true
[  FAILED  ] xsimd.sse_double_rounding (0 ms)
[ RUN      ] xsimd.sse_float_power
[       OK ] xsimd.sse_float_power (6 ms)
[ RUN      ] xsimd.sse_double_power
[       OK ] xsimd.sse_double_power (4 ms)
[ RUN      ] xsimd.sse_float_hyperbolic
[       OK ] xsimd.sse_float_hyperbolic (8 ms)
[ RUN      ] xsimd.sse_double_hyperbolic
[       OK ] xsimd.sse_double_hyperbolic (5 ms)
[ RUN      ] xsimd.sse_float_fp_manipulation
[       OK ] xsimd.sse_float_fp_manipulation (0 ms)
[ RUN      ] xsimd.sse_double_fp_manipulation
[       OK ] xsimd.sse_double_fp_manipulation (0 ms)
[ RUN      ] xsimd.sse_float_exponential
/<<PKGBUILDDIR>>/test/xsimd_exponential_test.cpp:35: Failure
Value of: res
  Actual: false
Expected: true
[  FAILED  ] xsimd.sse_float_exponential (10 ms)
[ RUN      ] xsimd.sse_double_exponential
/<<PKGBUILDDIR>>/test/xsimd_exponential_test.cpp:42: Failure
Value of: res
  Actual: false
Expected: true
[  FAILED  ] xsimd.sse_double_exponential (7 ms)
[ RUN      ] xsimd.sse_float_error_gamma
/<<PKGBUILDDIR>>/test/xsimd_error_gamma_test.cpp:35: Failure
Value of: res
  Actual: false
Expected: true
[  FAILED  ] xsimd.sse_float_error_gamma (23 ms)
[ RUN      ] xsimd.sse_double_error_gamma
/<<PKGBUILDDIR>>/test/xsimd_error_gamma_test.cpp:42: Failure
Value of: res
  Actual: false
Expected: true
[  FAILED  ] xsimd.sse_double_error_gamma (13 ms)
[ RUN      ] xsimd.sse_float_basic_math
[       OK ] xsimd.sse_float_basic_math (0 ms)
[ RUN      ] xsimd.sse_double_basic_math
[       OK ] xsimd.sse_double_basic_math (0 ms)
[ RUN      ] xsimd.sse_float_basic
[       OK ] xsimd.sse_float_basic (0 ms)
[ RUN      ] xsimd.sse_double_basic
[       OK ] xsimd.sse_double_basic (0 ms)
[ RUN      ] xsimd.sse_int32_basic
[       OK ] xsimd.sse_int32_basic (0 ms)
[ RUN      ] xsimd.sse_int64_basic
[       OK ] xsimd.sse_int64_basic (0 ms)
[ RUN      ] xsimd.sse_conversion
[       OK ] xsimd.sse_conversion (0 ms)
[ RUN      ] xsimd.sse_cast
[       OK ] xsimd.sse_cast (0 ms)
[----------] 22 tests from xsimd (94 ms total)

[----------] Global test environment tear-down
[==========] 22 tests from 1 test case ran. (94 ms total)
[  PASSED  ] 16 tests.
[  FAILED  ] 6 tests, listed below:
[  FAILED  ] xsimd.sse_float_rounding
[  FAILED  ] xsimd.sse_double_rounding
[  FAILED  ] xsimd.sse_float_exponential
[  FAILED  ] xsimd.sse_double_exponential
[  FAILED  ] xsimd.sse_float_error_gamma
[  FAILED  ] xsimd.sse_double_error_gamma

 6 FAILED TESTS

Add constructors from intializer lists and helpers

Batch constructor taking an initializer list should be available. Helpers allowing to build batches from initializer lists without knowing their real type should be implemented too.

xsimd includes cause failure for specific arch

I included xsimd.hpp from cloning the repo, then going into includes, and making a main.cpp at the same path:

#include "xsimd.hpp"
#include <iostream>
int main(){
	std::cout << "Hello world.";
}

and it fails with the following error when I try to compile:

root@dc0238c74c63:/.../native/third_party/xsimd/include/xsimd# g++ main.cpp -o main
In file included from types/xsimd_types_include.hpp:22,
                 from types/xsimd_traits.hpp:14,
                 from xsimd.hpp:14,
                 from main.cpp:1:
types/xsimd_sse_int8.hpp: In static member function 'static xsimd::detail::sse_int8_batch_kernel<signed char>::batch_type xsimd::detail::batch_kernel<signed char, 16>::abs(const batch_type&)':
types/xsimd_sse_int8.hpp:590:32: error: '_mm_srai_epi8' was not declared in this scope
                 __m128i sign = _mm_srai_epi8(rhs, 31);
                                ^~~~~~~~~~~~~
types/xsimd_sse_int8.hpp:590:32: note: suggested alternative: '_mm_srai_epi32'
                 __m128i sign = _mm_srai_epi8(rhs, 31);
                                ^~~~~~~~~~~~~
                                _mm_srai_epi32

Here is my system information:

root@dc0238c74c63:/# cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2300.000
cache size	: 46080 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt
bugs		:
bogomips	: 4600.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2300.000
cache size	: 46080 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt
bugs		:
bogomips	: 4600.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

root@dc0238c74c63:/# gcc --version
gcc (Ubuntu 8.1.0-1ubuntu1) 8.1.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@dc0238c74c63:/# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.1 LTS
Release:	18.04
Codename:	bionic

std::nextafter

For API completness, it would be great to have support for nextafter

Documentation search bar is broken

For some reason, the ReadTheDocs documentation does not have a working search index yet. That is sad, as it is a very important feature for finding functions in the docs when one doesn't know how the docs are organized.

P.S. This issue likely also affects other QuantStack projects using the same doc generation toolchain.