mikaelslevinsky / fasttransforms Goto Github PK

View Code? Open in Web Editor NEW

56.0 6.0 8.0 1.26 MB

:bullettrain_front: Fast orthogonal polynomial transforms :surfer:

License: MIT License

Makefile 0.52% C 98.94% BitBake 0.54%

legendre chebyshev jacobi laguerre ultraspherical spherical zernike proriol harmonics spin

fasttransforms's People

Contributors

Stargazers

Watchers

Forkers

dawese dlfivefifty sciencewiki ikaroruan brockklippenstein rethd eschnett mikaem

fasttransforms's Issues

Vectorized transforms with non-power-of-two degrees

This has been done for

spherical harmonics
triangular harmonics
disk harmonics
spin-weighted spherical harmonics

Add link to github page in docs

Add documentation

There is almost no documentation or even code comments to explain what the code does. For example, I'm trying to use c_execute_tri_lo2hi but there is no explanation what basis the lo should be in.

Fast trivial `cheb2jac`

julia> x = randn(100); @time cheb2jac(x, -1/2,-1/2);
  0.000522 seconds (18 allocations: 1.234 KiB)

julia> x = randn(1000); @time cheb2jac(x, -1/2,-1/2);
  0.025063 seconds (18 allocations: 8.297 KiB)

Change the vectorization paradigm

Single step decrements in Proriol/Zernike/spherical harmonic polynomial orders are currently done with a sweep of Givens rotations. Nominally, they come from the QR factorization of W = I±X or I-X^2. Multi-step decrements can be done with a Householder QR factorization of W^k for some integer k (c.f. https://arxiv.org/pdf/2302.08448.pdf with @dlfivefifty and @TSGut). It's reasonable then that k would be chosen such that the Householder vectors fit exactly into SIMD registers or small multiples of these to maximize throughput.

Spherical and Disk harmonic transforms segfault with an even number of columns

Reimplement code coverage

This used to work on Travis, but now that the free trial has been exhausted and the CI has mostly been migrated to GitHub Actions, the coverage flags make the tests extremely slow. If this can somehow be fixed, it may be reimplemented by the two environment variables COV=gcov and FT_COVERAGE=1, and post-success make coverage && bash <(curl -s https://codecov.io/bash).

Use runtime checks to dispatch on SIMD

A cross-compiler might be more advanced than the user's personal computer. Thus, cross-compiling should be done at the highest level of SIMD provided that runtime checks on cpuid can be implemented and have the least overhead.

Fast jac2jac with integer steps

julia> n = 1000; x = randn(n); @time FastTransforms.lib_jac2jac(x, 0, 0, 1, 1);
  0.021386 seconds (6 allocations: 8.109 KiB)

julia> n = 10_000; x = randn(n); @time FastTransforms.lib_jac2jac(x, 0, 0, 1, 1);
  0.676580 seconds (7 allocations: 78.344 KiB)

tests freeze on julia 1.4

tests freeze here

...
  [44cfe95a] Pkg 
  [de0858da] Printf 
  [3fa0cd96] REPL 
  [9a3f8284] Random 
  [ea8e919c] SHA 
  [9e88b42a] Serialization 
  [6462fe0b] Sockets 
  [2f01184e] SparseArrays 
  [10745b16] Statistics 
  [8dfed614] Test 
  [cf7118a7] UUIDs 
  [4ec0a83e] Unicode 
Test Summary:     | Pass  Total
Special functions |   19     19
Test Summary:       | Pass  Total
Chebyshev transform |  250    250
Test Summary:                         | Pass  Total
Fejér and Clenshaw--Curtis quadrature |    9      9

Unify execution model of triangular and disk harmonics

Right now, one has conditionals based on parameters, the other always executes the 1D column transforms.

Make not working on Mojave

[solver-mbook:~/Projects/FastTransforms] solver% export CC=gcc-8 && make
make lib
gcc-8 -Ofast  -march=native -mtune=native -I./src -I/usr/local/opt/openblas/include -I/usr/local/opt/fftw/include -shared -lm -fPIC -fopenmp src/transforms.c src/rotations.c src/permute.c src/drivers.c src/fftw.c -L/usr/local/opt/openblas/lib -L/usr/local/opt/fftw/lib -lm -lblas -lfftw3 -lfftw3_threads -o libfasttransforms.dylib
In file included from src/transforms.c:3:
src/ftinternal.h:4:10: fatal error: stdlib.h: No such file or directory
 #include <stdlib.h>
          ^~~~~~~~~~
compilation terminated.
In file included from /usr/local/opt/openblas/include/openblas_config.h:104,
                 from /usr/local/opt/openblas/include/cblas.h:5,
                 from src/fasttransforms.h:4,
                 from src/rotations.c:3:
/usr/local/Cellar/gcc/8.2.0/lib/gcc/8/gcc/x86_64-apple-darwin17.7.0/8.2.0/include-fixed/stdio.h:78:10: fatal error: _stdio.h: No such file or directory
 #include <_stdio.h>
          ^~~~~~~~~~
compilation terminated.
In file included from src/permute.c:3:
src/ftinternal.h:4:10: fatal error: stdlib.h: No such file or directory
 #include <stdlib.h>
          ^~~~~~~~~~
compilation terminated.
In file included from /usr/local/opt/openblas/include/openblas_config.h:104,
                 from /usr/local/opt/openblas/include/cblas.h:5,
                 from src/fasttransforms.h:4,
                 from src/drivers.c:3:
/usr/local/Cellar/gcc/8.2.0/lib/gcc/8/gcc/x86_64-apple-darwin17.7.0/8.2.0/include-fixed/stdio.h:78:10: fatal error: _stdio.h: No such file or directory
 #include <_stdio.h>
          ^~~~~~~~~~
compilation terminated.
In file included from /usr/local/opt/openblas/include/openblas_config.h:104,
                 from /usr/local/opt/openblas/include/cblas.h:5,
                 from src/fasttransforms.h:4,
                 from src/fftw.c:3:
/usr/local/Cellar/gcc/8.2.0/lib/gcc/8/gcc/x86_64-apple-darwin17.7.0/8.2.0/include-fixed/stdio.h:78:10: fatal error: _stdio.h: No such file or directory
 #include <_stdio.h>
          ^~~~~~~~~~
compilation terminated.
make[1]: *** [lib] Error 1
make: *** [all] Error 2

Unify ft_harmonic_plan and ft_tetrahedral_harmonic_plan

Add Travis Continuous Integration

CC @dawese.

Two parameter rectangularized disk polynomials

Dunkl-Xu polynomials with the weight (1-x^2-y^2)^β are written in terms of equal-parameter Jacobi polynomials. These can be generalized in two ways: with inequal-parameter Jacobi polynomials for a weight like (1-sqrt(x^2+y^2))^α(1+sqrt(x^2+y^2))^β; or with equal-parameter Jacobi polynomials with the quadratic transformations for a weight like (x^2+y^2)^α(1-x^2-y^2)^β. The latter, which I prefer due to their correspondence with generalized Zernike, are like 2D Konoplev OPs with a symmetric weight on [-1,1] with interior algebraic factor. @dlfivefifty

Use FT_NUM_THREADS

Directly using OpenMP's functions may lead to issues when linking with other libraries.

Remove |m| in disk2cfs documentation

m == |m| as m is non-negative. This could be confusing on this page:

https://mikaelslevinsky.github.io/FastTransforms/transforms.html

Inconsistency with FastTransforms.jl

n = 600
    S = KoornwinderTriangle(0.0,-0.5,-0.5)
    ff = (xy) -> P(18,8,0.,-0.5,-0.5,xy...)
    p = points(S,n)
    v = ff.(p)
    PP = plan_transform(S,v)
    v̂ = PP.duffyplan*v
    F = tridevec_trans(v̂)
    F̌ = FastTransforms.cheb2tri(F,0.,-0.5,-0.5)
    F̌₂ = PP.tri2cheb \ F
    norm(F̌ - F̌₂)
size(F)
F̃ = copy(F)
c_cheb2tri(c_plan_tri2cheb(size(F,1),0.0,-0.5,-0.5),F̃)

Support ARM architectures

Support vector fields on the sphere

Is there a performance improvement to vectorized kernels when aligning memory?

Add tests for blockbanded_source.c

Do not declare backups if Intel instruction sets are not available

There should be a macro that determines the user's API for the "most vectorized" driver routines.

Disk harmonic plan segfaults when `n=1`

I think I narrowed down the bug:

julia> using MultivariateOrthogonalPolynomials

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(2)
              end

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(3)
              end

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(3)
              end

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(3)
              end

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(3)
              end

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(4)
              end

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(4)
              end

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(4)
              end

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(4)
              end

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(5)
              end

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(5)
              end

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(5)
              end

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(1)
              end

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(1)
              end

julia> for _=1:10000
                  MultivariateOrthogonalPolynomials.CDisk2CxfPlan(1)
              end

signal (11): Segmentation fault: 11
in expression starting at no file:0
_ZN4llvm20AAResultsWrapperPass13runOnFunctionERNS_8FunctionE at /Users/solver/Projects/julia-1.1/usr/lib/libLLVM.dylib (unknown line)
_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE at /Users/solver/Projects/julia-1.1/usr/lib/libLLVM.dylib (unknown line)
_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE at /Users/solver/Projects/julia-1.1/usr/lib/libLLVM.dylib (unknown line)
_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE at /Users/solver/Projects/julia-1.1/usr/lib/libLLVM.dylib (unknown line)

Add SIMD to `rectdisk2cheb`

Triangle transform broken for a,b,c = 0,0.5,-0.5

P = MultivariateOrthogonalPolynomials.c_plan_tri2cheb(4, 0.0, 0.5, -0.5)
F = zeros(4,4); F[1]=1;
MultivariateOrthogonalPolynomials.c_cheb2tri(P, F) # Fills F with NaN

Note that a,b,c = (0,-0.5,-0.5) works fine.

Use C99 Complex types if already defined

2D cheb2leg (or more generally, support "region")

FFTW allows specifying "regions", for multidimensional FFTs:

julia> X = randn(5,5,5);

julia> F = fft(X,(1,3));

julia> F[:,1,:] ≈ fft(X[:,1,:]) # 2D FFT acting on 1st and 3rd dimension
true

It turns out I need this feature for Legendre transforms.... at the moment it just does 1D transforms:

julia> X = randn(5,5);

julia> cheb2leg(X)[:,1] == cheb2leg(X[:,1])
true

I can add it to FastTransforms.jl, e.g., if I need a 2D Legendre I can do

cheb2leg(cheb2leg(X')')

but I'm curious if this could be SIMD-optimised (or multithreaded) in C to make it faster?

Add methods & tests for banded LU factorization solves

Add installation instructions

brew install gcc
brew install openblas
brew install fftw
export CC=gcc-8 && make

export OMP_NUM_THREADS=4
export VECLIB_MAXIMUM_THREADS=4

Combine DCTs (DSTs) and connection coefficients for faster (alternative) synthesis and analysis

Using an O(n² log n) instead of O(n²) precomputation, the colatitudinal DCTs and DSTs can be applied to the Legendre to Chebyshev connection coefficients to reduce the overall spherical harmonic synthesis and analysis to execute_sph_hi2lo followed by a sequence of cblas_dgemm and one longitudinal FFTW_HC2R.

Contrast this with the current four-step approach of execute_sph_hi2lo followed by a sequence of cblas_dtrmm and a sequence of FFTW_REDFT01 and FFTW_RODFT01, and finally the longitudinal FFTW_HC2R.

Support cone orthogonal polynomials

Sometimes columns are not stored contiguously.

I think assuming stride(A,1) == 1 is fine (that is, ColumnMajor())

Disk transforms on higher order tensors?

In preparation for Cylinder solves, in FastTransforms.jl I'm planning to add support for disk2cxf(A::Array{T,3}, α, β, 1) in analogy to fft(A::Matrix, 1), etc. I can do this all in Julia via views (I assume strided array views are compatible with FastTransforms) but perhaps doing this in C there's potential for better parallelisation?

Note the interface needs some thought. Way back, when I wrote the previous paragraph, I was thinking disk2cxf(A::Array{T,3}, α, β, 1) would act on the first two dimensions of the array but what if we want to act on the first and third? Or if we have tensor products of two disks do we want to write the transform that transforms both as disk2cxf(A::Array{T,4}, α, β, 1:2)? Perhaps a better interface is

disk2cxf(A::Array{T,3}, α, β, (1,2)) # single disk transform on first two dimensions
disk2cxf(A::Array{T,3}, α, β, (1,3)) # single disk transform on first and last dimension
disk2cxf(A::Array{T,4}, α, β, ((1,2), (3,4)) # tensor disk transform on first and second followed by 3rd and 4th dimension
disk2cxf(A::Array{T,4}, α, β) # same as above
disk2cxf(A::Array{T,3}, α, β, ((1,2), (2,3)) # Get lost!

You might wonder why not leave it to the "user" of FastTransforms.jl. There won't likely be any speedup and multithreading is a pipe-dream. I guess the argument is (1) consistent interface (2) consistent allocation-free behaviour in higher dimensions (3) preparation for MPIFastTransforms.jl or similar for distributed memory parallelisation (which we pretty much need for Cylinders or Cubes to do anything fun)

@ioannisPApapadopoulos

Submit to J. Open Source Software

This would make it easy to cite the package.

Add FFTW-based synthesis and analysis

Hi Michael,

It took a little while to get build on linux going. Here's suggestions for tweaks to library (not worth a pull request). I only tested on ubuntu 16.04.

Append to Make.inc:

ifeq ($(UNAME), Linux)
LDLIBS += -lm
CFLAGS += -I/usr/lib/openblas-base
LDFLAGS += -L/usr/lib/openblas-base
endif

You'll also need to add in the ifeq windows block:
UNAME := Windows

Also, you'll want to add instructions to type in the shell:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:.

Otherwise the executable can't find the .so

My tests now all run, although I don't understand the output.

PS It's not apparent from your docs how to do a simple spherical harmonic eval or projection onto a tensor product grid. Am I supposed to do sph to Fourier then do FFT2 on grid? I would have to guess the normalization, etc. An example of that would be very helpful.

PPS Good to see you at ICOSAHOM!