Giter VIP home page Giter VIP logo

Comments (90)

seanlaw avatar seanlaw commented on June 12, 2024 3

It appears that maybe we should consider implementing the six step or eight step FFT algorithm next as it should have much better memory locailty and is therefore "optimized". I'd expect this to be faster than our current sliding dot product. I'm not sure how/if any of the radix variants (shown at the same link above) will help.

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 2

@NimaSarajpoor In case I forget, OTFFT uses the MIT License, which is compatible with STUMPY's BSD3 License. We must keep the original MIT license even if our work derives from it. This answer gives us information on how we'll need to go about it. I just don't want to forget!

I think we'll simply have an fft.py file that contains all of our code and then we'll update our LICENSE file in a similar way to pandoc's LICENSE file to call out the exceptional fft.py file.

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 2

No, just keep it as-is. I figured that's what it was but wanted to make sure there wasn't something else that you were measuring that coincidentally fell onto the same line. It's surprising that we are that much worse from 2^-2^12.

I was wondering if there is any opportunity to leverage vectorization. Code like:

for k in prange(n):
        kn = k * n
        for p in range(k + 1, n):
            i = k + p * n
            j = p + kn
            x[i], x[j] = x[j], x[i]

almost "look like" it might perform better if it were pure numpy functions (whether it's the whole thing or just the inner loop). I don't know that for a fact but it might be worth trying/thinking about. Then again, we should profile it

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 2

In addition to that, I just discovered that the [average] performance of MATLAB code gets better as the number of iterations increases (generally speaking).

So, if MATLAB uses FFTW then what might be happening is that MATLAB is creating FFTW "plans" in the background that allows it to improve the computation speed: https://www.fftw.org/fftw3_doc/Using-Plans.html

I see! yeah... that might be the reason. So, I am now thinking of looking at MATLAB-vs-Python comparison from two perspectives:

(1) I can use median instead of average running time, and I think that should give me a somewhat stable value for MATLAB code's performance. I will try it out.

(2) I can just consider the running time of the first run. My understanding is that a plan for fft(T) cannot be used for computing the fft of a new time series data (to boost its performance). Since we often care about computing the fft(T) only once, it should be okay to just consider the running time of the first run.

[WIP]


In the meantime, let's review the OFFT 6-step FFT algorithm (which is for time series T with length $N = 2^{2k}$. With setting n to $2^{k}$ (note that $n = \sqrt{N}$), the timeseries T can be seen as n chunks, each with length n.

6-step FFT
Step 1: Transpose blocks, i.e. x[:] = x.reshape(n,n).T.reshape(-1,)
Step 2: Apply the function _fft0 on each block
Step 3: Multiply elements by twiddle factor, $W^{pk}_{N}$, which is $e ^ {-j\theta}$. $\theta = \frac{2\pi}{N} \times (p \times k)$, where p and k are the row / column indices of element if T is viewed as a n-by-n matrix (meaning, the p and k for twiddle factor of x[idx] can be computed as follows: p = idx // n and k = idx % n )
Step 4: Transpose blocks, i.e. x[:] = x.reshape(n,n).T.reshape(-1,)
Step 5: Apply the function _fft0 on each block
Step 6: Transpose blocks, i.e. x[:] = x.reshape(n,n).T.reshape(-1,)

Note:
When T has length $2^{2k+1} (= 2 \times 2^{2k}$), we will use 8-step FFT. It does a preprocessing step followed by calling 6-step FFT twice, one for first half, and one for the second half (each half has length $2^{2k}$).

Reminder:
When T has real-valued elements, RFFT first halves the length, then it calls 6-step FFT or 8-step FFT, depending on log2(len(T) / 2) being even or odd.


Regarding tranpose step:

In our implementation, we do:

@njit
def transpose_v1(x):
    n = int(np.sqrt(len(x))
    for k in range(n):
        for p in range(k + 1, n):
            i = k + p * n
            j = p + k * n
            x[i], x[j] = x[j], x[i]

instead of

@njit
def transpose_v2(x):
    n = int(np.sqrt(len(x))
    x[:] = x.reshape(n,n).T.copy().reshape(-1,)
    # note: `.copy()` is required. see: https://github.com/numba/numba/issues/5433
image

So, our for-loop implementation should be okay. I am stopping here regarding tranpose steps. We can get back to transpose step later again if needed.

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 1

@NimaSarajpoor I just noticed that scipy.fft.rfft has a parameter called workers, which can perform FFT in parallel. I wonder if you could try that and see if it makes any difference?

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 1

@NimaSarajpoor I came across a more recent FFT implementation called OTFFT that claims to be faster than FFTW and has a more generous MIT license. However, I tried to implement the basic fft function in Python but haven't been able to get the same answer as scipy.fft.fft. Here's what I did (List-7: Final version of the Stockham Algorithm):

import math
import cmath
import numpy as np

def fft0(n, s, eo, x, y):
    if not math.log2(n).is_integer():  # Check if n is power of 2
        pass
    
    m = n // 2
    theta0 = 2 * math.pi / n
    if n == 1:
        if eo:
            for q in range(s):
                y[q] = x[q]
    else:
        for p in range(m):
            wp = complex(math.cos(p*theta0), -math.sin(p*theta0))
            for q in range(s):
                a = complex(x[q + s*(p + 0)])
                b = complex(x[q + s*(p + m)])
                y[q + s*(2*p + 0)] = a + b
                y[q + s*(2*p + 1)] = (a - b) * wp
        fft0(n//2, 2*s, not eo, y, x)


def fft(n, x):
    y = np.empty(n, dtype=complex)
    fft0(n, 1, False, x, y)
    for k in range(n):
        x[k] /= n

Would you mind taking a look? Maybe I messed up somewhere but I've been staring at it for too long and I'm not able to spot anything. Thanks in advance!

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

@seanlaw

I came across a more recent FFT implementation called OTFFT that claims to be faster than FFTW and has a more generous MIT licene

Cool!

Would you mind taking a look?

Sure! Will take a look.


Also:
I have been trying scipy.fft.rfft / scipy.fft.fft. Also, as you mentioned before, I am using different number of workers, 1 vs os.cpu_count(). Haven't seen any improvement yet compared to stumpy.core.sliding_dot_product.

According to the scipy doc:

The workers argument specifies the maximum number of parallel jobs to split the FFT computation into. This will execute independent 1-D FFTs within x. So, x must be at least 2-D and the non-transformed axes must be large enough to split into chunks. If x is too small, fewer jobs may be used than requested.

I will test again and share the result and code for our future reference.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

Also: I have been trying scipy.fft.rfft / scipy.fft.fft. Also, as you mentioned before, I am using different number of workers, 1 vs os.cpu_count(). Haven't seen any improvement yet compared to stumpy.core.sliding_dot_product.

According to the scipy doc:

The workers argument specifies the maximum number of parallel jobs to split the FFT computation into. This will execute independent 1-D FFTs within x. So, x must be at least 2-D and the non-transformed axes must be large enough to split into chunks. If x is too small, fewer jobs may be used than requested.

I will test again and share the result and code for our future reference.

Currently, core.sliding_dot_product is using the scipy.signal.convolve function. In what follows, the performance of code.sliding_dot_product is compared with some alternatives.

sliding_dot_product_v0 = core.sliding_dot_product

def sliding_dot_product_v1(Q, T):  
    n = len(T)
    X = scipy.fft.rfft(T) * scipy.fft.rfft(np.flipud(Q), n=n)
    out = scipy.fft.irfft(X, n=n)
    
    return out[len(Q) - 1 :]


def sliding_dot_product_v2(Q, T):
    n = len(T)
    X = scipy.fft.rfft(T, workers=8) * scipy.fft.rfft(np.flipud(Q), n=n, workers=8)
    out = scipy.fft.irfft(X, n=n, workers=8)
    
    return out[len(Q) - 1 :]


def sliding_dot_product_v3(Q, T):
    n = len(T)
    X = np.fft.rfft(T) * np.fft.rfft(np.flipud(Q), n=n)
    out = np.fft.irfft(X, n=n)

    return out[len(Q) - 1 :]

And, this is the code for tracking the running time for different window sizes

n = 1_000_000
data = np.array(loadmat('./DAMP_data/mit_long_term_ecg14046.mat')['mit_long_term_ecg_14046'][0]).astype(np.float64)
T = data[:n]

t = []
for m in range(3, 5000):
    Q = T[:m]
   
    t1 = time.time()
    comp = sliding_dot_product_function(Q, T)
    t2 = time.time()

    t.append(t2 - t1)    
image

As observed:

  • The functions sliding_dot_product_v1 and sliding_dot_product_v2 are both using scipy rfft and they are the same except for the number of workers. As expected, their performances are close to each other. This is because the number of workers affects the performance if we have 2D inputs.

  • The performance of sliding_dot_product_v3 (using numpy rfft) is close to the existing version sliding_dot_product_v0.

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 1

Thanks @NimaSarajpoor. In case it matters (and if you're not already doing this), it would make sense to test window sizes and/or time series lengths in powers of 2 rather than increments of 1.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

Thanks @NimaSarajpoor. In case it matters (and if you're not already doing this), it would make sense to test window sizes and/or time series lengths in powers of 2 rather than increments of 1.

image

[Note]
According to the source code of scipy.fft.rfft, we can use the parameter n to pad the input Q with zeros to make its length the same as the length of T. so, if T is power of two, I think we do not need to have power of two for the length of query. I am going to provide the performance of sliding dot product functions for queries with length in range (4, 1025):

image

[update] Correction regarding the label of x axis in the bottom figure is: "the length of query "

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

@seanlaw

@NimaSarajpoor I came across a more recent FFT implementation called OTFFT that claims to be faster than FFTW and has a more generous MIT license. However, I tried to implement the basic fft function in Python but haven't been able to get the same answer as scipy.fft.fft. Here's what I did (List-7: Final version of the Stockham Algorithm):

import math
import cmath
import numpy as np

def fft0(n, s, eo, x, y):
    if not math.log2(n).is_integer():  # Check if n is power of 2
        pass
    
    m = n // 2
    theta0 = 2 * math.pi / n
    if n == 1:
        if eo:
            for q in range(s):
                y[q] = x[q]
    else:
        for p in range(m):
            wp = complex(math.cos(p*theta0), -math.sin(p*theta0))
            for q in range(s):
                a = complex(x[q + s*(p + 0)])
                b = complex(x[q + s*(p + m)])
                y[q + s*(2*p + 0)] = a + b
                y[q + s*(2*p + 1)] = (a - b) * wp
        fft0(n//2, 2*s, not eo, y, x)


def fft(n, x):
    y = np.empty(n, dtype=complex)
    fft0(n, 1, False, x, y)
    for k in range(n):
        x[k] /= n

Would you mind taking a look? Maybe I messed up somewhere but I've been staring at it for too long and I'm not able to spot anything. Thanks in advance!

It turns out that x will be output if we just avoid dividing it by n.

def fft0(n, s, eo, x, y):
    if not math.log2(n).is_integer():  # Check if n is power of 2
        pass
    
    m = n // 2
    theta0 = 2 * math.pi / n
    if n == 1:
        if eo:
            for q in range(s):
                y[q] = x[q]
    else:
        for p in range(m):
            wp = complex(math.cos(p*theta0), -math.sin(p*theta0))
            for q in range(s):
                a = complex(x[q + s*(p + 0)])
                b = complex(x[q + s*(p + m)])
                y[q + s*(2*p + 0)] = a + b
                y[q + s*(2*p + 1)] = (a - b) * wp
        fft0(n//2, 2*s, not eo, y, x)


# I swapped the params of `fft` function to make its signature similar to `scipy.ftt.ftt`
def fft(x, n):  
    y = np.empty(n, dtype=complex)
    fft0(n, 1, False, x, y)
    
    return x

And, to test it:

for power in range(1, 10):
    n = 2 ** power
    x = np.random.rand(n).astype(complex)
   
    ref = scipy.fft.fft(x)
    np.testing.assert_almost_equal(ref, fft(x, n))

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 1

It turns out that x will be output if we just avoid dividing it by n.

Hmmm, I wonder why they performed the division?! Thanks for figuring it out. I just ported it over blindly without trying to understand it 🤣.

How about the ifft?

def ifft(n, x):
    for p in range(n):
        x[p] = x[p].conjugate()
    y = np.empty(n, dtype=complex)
    fft0(n, 1, False, x, y)
    # for k in range(n):
    #     x[k] = x[k].conjugate()

This doesn't seem to match scipy.fft.ifft either.

Would you mind doing a performance comparison if you are able to crack this?

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

@seanlaw

It turns out that x will be output if we just avoid dividing it by n.

Hmmm, I wonder why they performed the division?! Thanks for figuring it out. I just ported it over blindly without trying to understand it 🤣.

How about the ifft?

def ifft(n, x):
    for p in range(n):
        x[p] = x[p].conjugate()
    y = np.empty(n, dtype=complex)
    fft0(n, 1, False, x, y)
    # for k in range(n):
    #     x[k] = x[k].conjugate()

This doesn't seem to match scipy.fft.ifft either.

Would you mind doing a performance comparison if you are able to crack this?

This should work:

def _ifft(x):
    n = len(x)   # assuming `n` is power of two 
    x[:] = np.conjugate(x)
    y = np.empty(n, dtype=np.complex128)
    fft0(n, 1, False, x, y)
    
    return np.conjugate(x / n)

I am working on some minor changes to boost the performance. I will share the performance of both original version, and the enhanced version, and will compare them with the core.sliding_dot_product.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

For now, I did some enhancements on the new fft / ifft functions suggested in #938 (comment).


Part (I):
I show the performance of four versions against the performance of our reference version, i.e. core.sliding_dot_product. The output of each version is tested to make sure that the function works correclty. The length of time series T is $2^{15}$, and the length of query is in range(10, 1000 + 10, 10).

These are the description of the four versions:

v0 --> use the new functions fft and ifft 

v1 --> v0 + Reused the already-allocated memory `y`

v2 --> v1 +  Converted the inner for-loop of `fft` function to a numpy vectorized operation 

v3 -->  v2 + Added njit decorator with `fastmath=True`

v4 --> v3 + Parallelized the outer for-loop of `fft` function
image

For a time series with length $2^{15}$, it seems that there is not much difference between v0 and v1. The changes in v2 and v3 seem to be very effective. How about v4? To better demonstrate its impact, in figure below, I am showing the performance of v3, v4, and the reference only.

image

And, we can zoom in further by removing the v3 from the figure. Then, we will see:

image

Part (II):
We now show how the v4 performs against the ref (i.e. core.sliding_dot_product) for different length of time series T.

image

As observed, the gap in the performance becomes bigger as the length of time series increases.


The code is available in the notebook pushed to this PR #939.

Next steps:
(1) Make the code cleaner.
(2) Profile the function to see where that increase in the performance gap comes from.
(3) Optimize accordingly.

@seanlaw
What do you think?
Also: If I need to add/revise a step, please let me know.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

If I understand correctly, the stockham algorithm is NOT faster than scipy.fft.convolve (or they are about the same after some optimizations). Is that correct? And it also means that the stockham algo is much slower than FFTW?

Yes. After some initial optimizations, I can see that the stockham algorithm (from OTFFT) is slower. So, as you mentioned:

  • Python Stockham algo (from OTFFT) is slower than scipy.fft.convolve.
  • scipy.fft.convolve is slower than MATLAB FFTW.

I wonder if there might be some clues in the OTFFT source code in terms of how they might have parallelized it using OpenMP. I'd be pretty happy if we could get within 2x (slower) than FFTW.

I will try to go through it to get some idea. Need to run the tests on MATALB online server if we want to consider MATLAB FFTW as our benchmark.

I'm confused as to why using prange wouldn't give you the necessary speedup but that also depends on the hardware that you are using.

I think it gave us some boost. See the figure below...
v4 is the same as v3 but with this difference that it uses prange.

292773450-2d09bf64-fc06-47ab-98fe-453247951b6a-2

numba. get_num_threads() is 8 in my macOS system.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

we should consider implementing the six step or eight step FFT algorithm next as it should have much better memory locailty and is therefore "optimized"

I have implemented six-step-FFT. I will work on eight-step FFT algorithm.

[Note to self]
Before I forget, here are a couple of notes that I may consider need to revisit later:

  • numpy vectorized operation seems to increase the running time (?!)
  • Turning off parallelization may speed up the computation (?!)

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 1

I have implemented six-step-FFT. I will work on eight-step FFT algorithm.

In case it matters, section 6.3 might be relevant as it discusses Stockham and the 6-step algo. More importantly, it describes how/why cache memory is important

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 1

should we expect the 6-step and 8-step implementation (without multithreading) be faster than the scipy.fft.fft ?

Based on these performance results I expect the njit version of the 6/8 step algo to be faster than scipy.fft.fft and, possibly, as fast as FFTW.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

[Update]
I implemented the 8-step algo as well. I ran it on MATLAB online server, and got this:

image

Based on these performance results I expect the njit version of the 6/8 step algo to be faster than scipy.fft.fft and, possibly, as fast as FFTW.

I will go through section 6.3 and check the algo again to better understand it, and see if I am doing something wrong in my code. I may start with a notebook, and implement each function in one cell to go through them together with you, @seanlaw .

Also, need to take a look at the OTFFT source code

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

This is 8-step with njit and parallel=True?

Yes, and yes

Any idea if the 8-step is faster than scipy?

I think scipy.fft is slightly faster than my current implementation of 6-step / 8-step algo, in my Mac.

Note that 8-step is for $2^{odd}$, and 6-step is for $2^{even}$. (Btw, the former calls the latter twice). Instead of sliding dot product, I am going to show the performance for fft function. I tested the performance for time series T with length that is power of two, and is from $2^{1}$ to $2^{24}$.

See figure below. ref is scipy.fft.fft, and comp is our fft function, with njit and parallel=True, which calls 6-step algo when the power is even, and calls 8-step, when the power is odd.

image

[Update-1]
After reducing number of redundant arithmetic operations...I got this:

image

[Update-2]
I did another minor enhancements. Also, I am now checking the performance of fft of `T with length $2^{27}$. Furthermore, I now run each case 21 times, and get its median as the running time.

image

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

It's surprising that it jumps so dramatically at 2^24 to 2^27. I wonder if FFTW suffers from this too or if it still scales linearly. Technically, this algo should be n*logn in computational complexity.

I do not know what is happening there. I wanted to try 2^28 but I noticed it takes too much time so I just stopped at 2^27. I Will check the behaviour of MATLAB FFTW for large-size inputs.


As suggested by @seanlaw in #939 (comment), we are going to take advantage of rfft since the input is a real-valued data. Based on the algo provided in 2.6.2 and summarized in #939 (comment), I implemented an initial version of fft that takes advantage of rfft.

In the following figure, ref is scipy.fft.fft, and comp is a njit with parallel=True function that takes advantage of rfft

image

I will work on finding opportunities to improve the performance.


According to #938 (comment), we might be able to just use rfft for computing the sliding-dot-product (see function below). So, we do not need to construct the fft. We just need to implement the equivalent of np.fft.irfft

def sliding_dot_product_v3(Q, T):
    n = len(T)
    X = np.fft.rfft(T) * np.fft.rfft(np.flipud(Q), n=n)
    out = np.fft.irfft(X, n=n)

    return out[len(Q) - 1 :]

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

In case that matters, I also checked the performance of scipy.fft.fft using MATLAB online for the same input data used in my previous comment.

image

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 1

Also, the following figures confirm that six-step / eight-step FFT outperform FFTW in MATLAB

Amazing! Can you confirm whether MATLAB and 6/8-step Python are single threaded or multithreaded?

I created and pushed a new notebook to PR #939 , I am now thinking of adding functions from the link above to it, one by one. What do you think?

Please give me some time to review and respond

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

Can you confirm whether MATLAB and 6/8-step Python are single threaded or multithreaded?

Before I provide some answers, I should mention that I got slightly different result!

Short answer: They were multithreaded.

Long answer: They were multithreaded. I think it is important to set the number of threads (Not sure if I made any mistake or the online server behaves differently before but I think we should/can trust the following results this time as I am explicity set the number of threats in both MATLAB code and Python code)

To make sure I am not making any accidental mistake, I got the performance for three cases: N_THREADS in {1, 4, 8}

To manage threads:

N = 8  # NUMBER OF THREADS I WOULD LIKE TO USE

# in MATLAB
LASTN = maxNumCompThreads(N);

# in Python
import numba
numba.set_num_threads(N)

I first show the performance of FFT in MATLAB for the three cases:

image

And, now I show the performance of FFT in Python for the same three cases:

image

So, we can see that the number of threads affects the performance.


Okay, let's compare MALTAB vs Python. To show this, I subtract the running time of Python from its corresponding running time via MATLAB code. So a positive y-value means that the Python code is faster than its MATLAB.

image

It seems that as number of threads increase, the six-step / eight-step FFT shows better performance!

Btw, let's zoom in for range(5, 21)

image

And MATLAB-vs-Python when we have 8 threads:

image

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

Okay... I tried to check the performance again. I ran each case for 50 times and then took the average. The following cases are considered:

  • MATLAB fft
  • Python six-step / eight-step fft
  • Python six-step / eight-step rfft
  • Scipy fft
  • Scipy rfft

And, for each of the cases above, I considered different scenarios: 1 thread / 4 thread / 8 thread.

The results are provided below. I decided to not zoom in for each figure so that I can keep this comment easy to follow. In the following figures, 6fft refers to 6-step / 8-step fft


MATLAB

image

Python vs MATLAB [with 1 thread]

image

When we have one thread only, scipy fft / rfft wins.

Python vs MATLAB [with 8 thread]

Note that scipy is not parallelized. But, just to be consistent, I added it to the figure for the sake of comparison.

image

SHOW the diff: Python vs MATLAB [with 8 thread]

In the following figure, the red plot shows MATLAB running time minus the fft running time; And, the blue plot shows MATLAB running time minus the 6fft running time

image

[Update]
And if I zoom into the figure above to better see the comparison for time series with length <= 2 ^ 20, I will get this:
image

which shows MATLAB FFT performs better when length is <= 2 ^ 20

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

However, I think the majority of use cases will likely be in the 2^20 range (what do you think?)

Right! So, it is important to have good performance in that range.

Can you please provide the raw timing data in a table?

TL; DR: It does not look good 😄

Sure. The table is provided below. The numbers are rounded to seven decimal places. I also added columns to show the ratio. If it too much, you may just take a look at the figure provided the bottom of this comment (The figure shows the 6fft / MATLAB as well as the 6rfft / MATLAB ratio.)

Log2 of len(T) Python fft / MATLAB Python rfft / MATLAB MATLAB Python fft Python rfft
2 0.0327306 0.0265907 0.0004038 1.32E-05 1.07E-05
3 13.3032982 13.4683572 2.1E-06 2.77E-05 2.8E-05
4 31.7814838 31.6292407 1.7E-06 5.28E-05 5.25E-05
5 8.3517041 8.3672224 3.4E-06 2.82E-05 2.83E-05
6 5.0265435 5.2300492 1.04E-05 5.22E-05 5.43E-05
7 5.1188469 5.2070618 6E-06 3.07E-05 3.12E-05
8 4.7836901 4.7770128 1.21E-05 5.81E-05 5.8E-05
9 2.555058 2.4614294 1.43E-05 3.64E-05 3.51E-05
10 2.4825403 2.6721459 2.62E-05 6.5E-05 7E-05
11 1.49305 1.3197951 3.09E-05 4.61E-05 4.08E-05
12 2.3069466 2.1782147 4.06E-05 9.36E-05 8.83E-05
13 1.2975097 1.3393951 6.1E-05 7.92E-05 8.17E-05
14 1.8206379 1.7744369 9.97E-05 0.0001815 0.0001769
15 1.3474092 1.2914043 0.0001739 0.0002343 0.0002245
16 1.4363382 1.4223928 0.0003683 0.0005289 0.0005238
17 1.4568064 1.4602421 0.0006648 0.0009685 0.0009708
18 2.3318331 2.2741222 0.000938 0.0021872 0.002133
19 1.9355066 1.8724722 0.001927 0.0037296 0.0036082
20 1.7575101 2.1766148 0.0038541 0.0067736 0.0083889
21 0.3776821 0.5315373 0.0301298 0.0113795 0.0160151
22 0.5337859 0.6022327 0.0620284 0.0331099 0.0373555
23 0.4275656 0.5058924 0.160916 0.0688022 0.0814062
24 0.9009623 0.8916813 0.2904172 0.261655 0.2589596
25 0.7019051 0.5632563 0.4811759 0.3377398 0.2710254
26 1.0531359 0.9573159 1.1281815 1.1881284 1.0800261
27 0.6763958 0.6486571 2.3729286 1.6050389 1.5392169
image

Maybe I am not taking advantage of parallelization in the proper way. I am thinking of profiling the blocks of code in this function, and see if it can give me some clue. Do you think it is a good approach?


[Update]
I think that I need to dig into section 7. And then check the source code. I have not checked it yet.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

Yes, the time can add up if we need to do millions+/billions+ of FFT calculations

If, at some point, our goal is to use this for DAMP, then we may actually do millions of FFT calculation.

For instance, for a timeseries with 1M data points , and say m = 50, we consider the following chunksizes in one single backward process:

2^6=64
2^7=128
2^8=256
2^9=512
2^10=1024
2^11=2048
.
.
.

Even if I prune half of the datapoints (in forward process), I still have 0.5M datapoints; and for each, I may need to do backward process for several chunks. So, let's say I am very lucky and I can do early-abandoning just in the first chunk, with size 64. If I take a look at the table above, I can see that the running time of python fft is 4E-5 second more than the MATLAB one. Therefore,

$(0.5 * 10^{6}) * (4 * 10^{-5}) = 20 $

So, just the fft part can add 20 sec to our running time? I might be wrong though. I can start using it in DAMP, and then try tracking the performance of DAMP, but I think it is better to not mix things in this issue (?)

I have a question. Why is Python rfft not faster than Python fft?

Right!! Will check out the function and then run the test just for python fft and python rfft

If you can do some profiling (for various length input) without njit (i.e., purely in Python) then I think that is a good start. Then, ask the numba devs.

I think you should post it in the numba discussion (or maybe post it as a numba Github issue may be better) to see if they could help us improve the performance from the LLVM level (as they can see how things get JIT compiled) and possibly at the prange level

Sure! I will start to do some profiling.

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 1

but I think it is better to not mix things in this issue (?)

Yes, let's not mix things yet

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

I wouldn't mix the two (i.e., I wouldn't modify those functions and, instead, create new functions that would be more direct

Makes sense indeed!

if I understand correctly, you just need to split/divide n up into n_threads and so this feels like overkill). Would something simpler like this work?

I will try it. It is possible that we may get an imbalanced parallelization this way. Note that g = idx + 1. So, as idx increases, the variable g increases. So, the second for-loop, i.e. for k in range(0, min(n, n - g)), will have less number of iterations (computations). So, chunking into equally-sized ranges may not work well. I will need to test its performance.

Again, I think we should post our code to the numba Github issues to see if there are any obvious (big) things that can be changed in the previous code to speed it up

I will post this to the numba GitHub issues. I will provide them with the performance result as well.

I think it might be a good idea to allow the number of threads to be specified as a function parameter (with the default being n_threads = -1 (i.e., use ALL available threads and, otherwise, add a check to make sure we don't specify MORE than the available number of threads) OR we might have some heuristic approach that uses only a single thread for smaller lengths)).

Sure. The second part, i.e.

we might have some heuristic approach that uses only a single thread for smaller lengths

is a cool idea! Let's see if the Numba community can resolve the issue as you suggested. (If not, I will try this approach)

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 1

So, chunking into equally-sized ranges may not work well. I will need to test its performance.

Ahh, I see. I missed that when I was looking at it.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

Again, I think we should post our code to the numba Github issues to see if there are any obvious (big) things that can be changed in the previous code to speed it up

I will post this to the numba GitHub issues. I will provide them with the performance result as well.

I created the post on Numba Discourse:
https://numba.discourse.group/t/where-to-optimize/2357

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 1

With FFT performance being so finicky (even on different hardware and OS platforms), I was thinking MAYBE that (if we do add OTFFT), we should give the user an easy way to switch back to the original FFT when needed. However, I'm not sure how to do this without having to check some config parameter every time we call FFT (i.e., I don't want to check an if/else statement for config.STUMPY_FFT_BACKEND every time we call FFT as this checking would add unnecessary time to, say, DAMP). Again, this is just a thought and maybe I'm over complicating things.

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 1

Thanks for pulling all of this together @NimaSarajpoor! I really appreciate the effort that you are going through. This is very exciting!

I'm wondering if there is a way to implement an efficient version for GPUs?

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

... but the "complex" part of the full fft is useless for STUMPY
... than MATLAB fft (even though it outputs the complex part).

Just to make sure we are on the same page, I would like to mention that the output returned by real fft (rfft) contains complex part. This rfft output is just the first half of full_fft(T). The second half is computed by "conjugating" the first half. Do we need the second half? See below.

from a STUMPY standpoint, I don't actually care about full_fft_with_rfft, right?

To compute the sliding dot product, one needs to compute something like this: ifft(F), where F = fft(A) * fft(B) (A and B, each, is 1D array filled with real-valued elements). We can compute rF = rfft(A) * rfft(B) instead (which is basically the rfft version of F).

Note that we then need to expand rF to compute full-length F first, and THEN pass it to the function ifft. Otherwise, we get wrong result. In other words, ifft(rF) is NOT equivalent to ifft(F). Alternatively, we can do scipy.fft.irfft(rF, n = len(A)). So, if we can write a Numba-friendly function that has the same functionality as scipy.fft.irfft, then the answer to your question is: yes, we do not care. I need to think how I can revise the algo provided in OTFFT to achieve irfft.

This is interesting. It means that if one actually cared about the "complex" part then it would still be "faster" to simply compute the rfft and then expand to full_fft essentially "for free".

Exactly! I do not know how other implementations of fft work. But, I should not be surprised if I see they check the type of values in the array first and if they see all values are real, then they leverage the rfft trick.

If possible (I haven't looked at the code yet), maybe we can add a fft function that adds the additional steps as some sort of wrapper around rfft?

Right! I have tried something similar for now but I need to make it more modular.

I agree! There are some parts (around 2^17 to 2^21) where, for some reason, OTFFT and SciPy are around 2x slower than MATLAB. Any idea why that might be?

That strange behaviour caught my attention too! I couldn't figure out why. Will think about it.

Does your current OTFFT implementation create new arrays every time? Or is there a way to implement it so that one or more arrays can be reused? I'm just thinking about DAMP where rfft would be called many times and we might be able to save some array-creation overhead time. For a single or few calls to rfft, it certainly doesn't matter. But for thousands or millions of calls, it can add up (maybe this is a numba optimization/compilation question?)

The current OTFFT implementation creates new arrays every time. If we want to consider your proposal in DAMP, then we need to create initial arrays of size 4, 8, 16, 32, ..., 2 ** pmax, or just create one large array of length 2 ** pmax and just re-using its slices with different sizes (?!) I need to think more about it.

Thanks for pulling all of this together @NimaSarajpoor! I really appreciate the effort that you are going through. This is very exciting!

I know! Right? 😄 Thanks for the support! Appreciated!

I'm wondering if there is a way to implement an efficient version for GPUs?

Will do some search and think about it! 👍

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

Thus, our overall "complex" X array is half the length of the "complex" array used in full fft and then, when we do fft(X), there is less work to be done than if we do fft(T). Is that correct?

Yes, Exactly!

Btw, is this (along with post_process) what you are referring to as the "rfft trick"?

Maybe I used it with a different meaning before. But, in my recent comments like this one when I said:

"So far, we have seen that computing fft with rfft trick is faster than computing fft directly (i.e. applying OTFFT 6-step algo on full T directly). So, in what follows, we do NOT consider full_fft in our comparison. Since MATLAB compute full fft (i.e. the output has the same length as input T), we compare its performance against our full_fft_with_rfft case (and NOT just rfft in which the output has length len(T) // 2 + 1). Fortunately, as observed, there is a very little cost in computing full fft using rfft trick."

By rfft trick, I meant the following:

rfft + trick, where rfft part is the function which has ITS OWN post_process part (see the rfft function provided in my earlier comment above), and trick part is ANOTHER post_processing.

So, if F = fft(T), the rfft function (which has its own post_processing part) gives me F[:(n//2 + 1)], and the trick (ANOTHER post_processing) takes F[:(n//2 + 1)] as input, and returns the full fft F.

The trick to convert array F[:(n//2 + 1)] to (full-length fft) F can be something like this:

# T is real-valued array with len(T)

# part 1: rfft
y = rfft(T)  
# the function `rfft` above has its own post_processing which is used inside the function. 
# The output `y` has length `n // 2 + 1` 

# part 2: trick (to covert it to full-length fft)
out = np.empty(n, dtype=complex)   # full-length fft
out[: len(y)] = y
k_range = np.arange(len(y),  n)
y[k_range] = np.conjugate(y[n - k_range])

# Done!

I vaguely remember seeing QT.real[m - 1 : n]

The sliding dot product QT is the "real" part of the output array retuned by inverse fft (not fft). Therefore, I say the following statement is correct:

BOTH the real and imaginary outputs from fft(X) necessary

But you may feel a gap regarding the connection between FFT, IFFT, and that "real" part thing I mentioned above. The following note should help you fill that gap:

The QT is the output array returned by Inverse FFT (IFFT). The IFFT algo proposed in OTFFT calls FFT function!

So, the Inverse FFT takes an array (with dtype complex) as input, and pass it to a FFT function. So, we know that we need to compute the FFT of a complex array at some point. In FFT algo, we are multiplying elements of array to complex values. So, let's say an element of array is 1 + 2j and a complex value is 3 + 4j. Then:

(1 + 2j) * (3 + 4j)  = (1 * 3 - 2 * 4) + j(1 * 4 + 2 * 3) 

So, the complex part affects the final result of IFFT. For example, a part of the "real" value is coming from multiplying the imaginary values 2j * 4j as shown above. So, QT.real is affected by BOTH real and imaginary values in FFT.

Perhaps, what I should really be asking is what does the real and imaginary parts of fft(X) represent

I will need to dig deeper to understand the math behind FFT to see what real / imaginary part represents. But, as I said earlier, I don't think we can ignore the imaginary part of fft(T). I will provide another update once I better understand FFT.

To some extent, it would've been super helpful to have a scipy.fft.rconvolve (real-convolve), right?

Right! I think you meantscipy.signal.fftconvolve, which is used in

stumpy/stumpy/core.py

Lines 619 to 655 in 251da45

def sliding_dot_product(Q, T):
"""
Use FFT convolution to calculate the sliding window dot product.
Parameters
----------
Q : numpy.ndarray
Query array or subsequence
T : numpy.ndarray
Time series or sequence
Returns
-------
output : numpy.ndarray
Sliding dot product between `Q` and `T`.
Notes
-----
Calculate the sliding dot product
`DOI: 10.1109/ICDM.2016.0179 \
<https://www.cs.ucr.edu/~eamonn/PID4481997_extend_Matrix%20Profile_I.pdf>`__
See Table I, Figure 4
Following the inverse FFT, Fig. 4 states that only cells [m-1:n]
contain valid dot products
Padding is done automatically in fftconvolve step
"""
n = T.shape[0]
m = Q.shape[0]
Qr = np.flipud(Q) # Reverse/flip Q
QT = convolve(Qr, T)
return QT.real[m - 1 : n]

which is equivalent to ifft(fft(T) * fft(Qr)). So, we can wrap a function around it and say it is equivalent to scipy.signal.fftconvolve. (I am not sure though...maybe there is a different logic behind convolve. Need to check that)

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

The IRFFT algo provided in previous comment is based on the code used in this MATLAB code.

I then did some minor changes. For instance, I avoided doing np.conjugate in the following function

def _ifft(x):
    x = x.copy()
    n = len(x)
    x[:] = np.conjugate(x)
    x[:] = scipy.fft.fft(x)
    x[:] = np.conjugate(x / n)
    
    return x

and instead, mixed/merged it into the fft function without increasing the number of operations. So, the function is now looking like this:

def _ifft(x):
   fft_with_conjugate_effect_inplace(x)   

Also, from STUMPY's standpoint, we do not need to be worried about overwriting input x.

The Numba-friendly IRFFT code is provided here:
https://gist.github.com/NimaSarajpoor/dc2bfa6f231dce65e3969fb009a89168

This code is for parallel (multi-threading) mode. For single mode, we just change prange to range (and remove parallel=True from the njit decorator)

I used the following code to get the performance:

# MATLAB 
rng(1,"twister");
T = rand(1, 2^pmax);
save('T_input.mat', 'T')
running_time = []
T = np.array(loadmat('./T_input.mat')['T'][0]).astype(np.float64)
p_list = range(2, 28)
for p in p_list:
    idx = 2 ** p
    data = T[:idx]
    
    R = scipy.fft.rfft(data)
    R_origin = R.copy()

    _irfft(R)
    total_time = 0
    for _ in range(20):
        R = R_origin.copy()
        tic = time.time()
        _irfft(R)
        toc = time.time()
        total_time += toc - tic
    running_time.append(total_time) 
    
running_time_python_arr = np.array(running_time)
np.save('Feb4_numba_irfft_parallel.npy', running_time_python_arr)

I got the performance for the following cases:

  • MATLAB ifft (baseline)
  • Scipy irfft
  • Numba-Friendly irfft parallel mode
  • Numba-Friendly irfft single mode
image

The Numba-Friendly IRFFT (parallel + single) shows a good performance except when the length of time series is between 2^7 to 2^13.

@seanlaw
There is one more thing that I need to try, and see how much it affects the performance. As you suggested before, we can allow function RFFT to reuse an already-created array. We can use this technique here. RFFT creates an auxiliary array. IRFFT creates an auxiliary array too. Instead of creating auxiliary array twice, we can create it once and then pass it to RFFT and IRFFT. From STUMPY's standpoint, we are going to use RFFT and IRFFT together to compute the sliding-dot-product. So, that technique should reduce the running time of computing the sliding-dot-product.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

To improve the performance of OFFT rfft, I tried to use profila to detect hot spots in the code. It is Linux only, and only single-threaded Numba can be profiled currently, parallel functions are not yet supported.

Also:
An example is provided by Itamar here


How does it work? A quick guide

We can utilize profila to determine how much different parts of a code contribute to its overall running time.

For instance, let's save the following code in the use_vector.py file:

import time

import numpy as np
import numba
from numba import njit

@njit
def func1(T):
    x = T[::2] + 1j * T[1::2]
    return x


seed = 0
np.random.seed(seed)
p = 25
T = np.random.rand(2 ** p)

func1(T)
start = time.time()
while time.time() - start < 1 * 60:
    func1(T)

And, let's load this file onto a Google Colab, and run the following:

# Google Colab
!sudo apt-get install -y gdb
!python3 -m pip install profila
!python -m profila annotate -- ./use_vector.py 

And we get this:

**Total samples:** 8515 (31.7% non-Numba samples, 9.6% bad samples)

/content/use_forloop.py (lines 1 to 9):

  0.3% | import time
  0.2% | 
       | import numpy as np
       | import numba
       | from numba import njit
       | 
       | @njit
       | def func1(T):
 58.3% |     x = T[::2] + 1j * T[1::2]

From the top line, we can see that 31.7+9.6 (about 41%) of time was spent on running non-numb code. Other than that, we can see that the line x = T[::2] + 1j * T[1::2] is associated with a high percentage, meaning that the program spends a lot of time on running this part of the code.

Note 1:
It seems that we cannot use profila to COMPARE two different versions of a code as they may provide different outcomes and make the apple-to-apple comparison (so, I think we can just use it to detect computationally-heavy parts of the code). For instance, let's use for-loop instead of using a vectorized operation. Let's save the following code in use_forloop.py file, followed by running it in Colab.

import time

import numpy as np
import numba
from numba import njit


@njit
def func1(T):
    n = len(T) // 2
    x = np.empty(n, dtype=np.complex_)
    for i in range(n):
        x[i] = T[2 * i] + 1j * T[2 * i + 1]
    return x


seed = 0
np.random.seed(seed)
p = 25
T = np.random.rand(2 ** p)

func1(T)
start = time.time()
while time.time() - start < 1 * 60:
    func1(T)

And, running it with profile will result in:

**Total samples:** 8135 (30.1% non-Numba samples, 11.9% bad samples)

/content/use_vector.py (lines 12 to 13):

       |     for i in range(n):
 58.0% |         x[i] = T[2 * i] + 1j * T[2 * i + 1]

It is not clear why, in contrast to previous case, we do not see 0.3% and 0.2% anymore. It turns out that the for-loop version is in fact faster than the one with vectorized operation. I run the following script:

import time
import numpy as np

from use_vector import func1 as func1_vector
from use_forloop import func1 as func1_forloop


seed = 0
np.random.seed(seed)
n_iter = 10000
for p in range(2, 26):
    print('p =', p)
    T = np.random.rand(2 ** p)
    
    func1_vector(T)
    start = time.time()
    for _ in range(n_iter):
        func1_vector(T)
    t_vector = time.time() - start
    print('Vector:', t_vector)

    func1_forloop(T)
    start = time.time()
    for _ in range(n_iter):
        func1_forloop(T)
    t_forloop = time.time() - start
    print('For loop:', t_forloop)

    print('Speedup (%):', (t_vector - t_forloop) / t_vector * 100)
    print('----------------------------------------')

print('Done!')

and got the following output:

p = 2
Vector: 0.0038788318634033203
For loop: 0.0037953853607177734
Speedup (%): 2.1513307517364315
----------------------------------------
p = 3
Vector: 0.003984928131103516
For loop: 0.004025936126708984
Speedup (%): -1.0290774201268398
----------------------------------------
p = 4
Vector: 0.0039730072021484375
For loop: 0.003986358642578125
Speedup (%): -0.3360537686029765
----------------------------------------
p = 5
Vector: 0.003939151763916016
For loop: 0.004065990447998047
Speedup (%): -3.2199491586974944
----------------------------------------
p = 6
Vector: 0.004212379455566406
For loop: 0.0041120052337646484
Speedup (%): 2.382839031016527
----------------------------------------
p = 7
Vector: 0.003946065902709961
For loop: 0.003972768783569336
Speedup (%): -0.6766962721285723
----------------------------------------
p = 8
Vector: 0.004380941390991211
For loop: 0.003954887390136719
Speedup (%): 9.72517006802721
----------------------------------------
p = 9
Vector: 0.005097866058349609
For loop: 0.004839897155761719
Speedup (%): 5.060331119633337
----------------------------------------
p = 10
Vector: 0.006334066390991211
For loop: 0.0052988529205322266
Speedup (%): 16.34358414574472
----------------------------------------
p = 11
Vector: 0.008842229843139648
For loop: 0.007023811340332031
Speedup (%): 20.565157602394372
----------------------------------------
p = 12
Vector: 0.013345003128051758
For loop: 0.01079106330871582
Speedup (%): 19.137798581458917
----------------------------------------
p = 13
Vector: 0.022167205810546875
For loop: 0.017853260040283203
Speedup (%): 19.460936155567026
----------------------------------------
p = 14
Vector: 0.04245400428771973
For loop: 0.03231096267700195
Speedup (%): 23.891837250442254
----------------------------------------
p = 15
Vector: 0.07767105102539062
For loop: 0.06183910369873047
Speedup (%): 20.383330877658267
----------------------------------------
p = 16
Vector: 0.15003013610839844
For loop: 0.11719012260437012
Speedup (%): 21.888944685287125
----------------------------------------
p = 17
Vector: 0.29427504539489746
For loop: 0.22797513008117676
Speedup (%): 22.529914225227845
----------------------------------------
p = 18
Vector: 1.4237439632415771
For loop: 1.2148120403289795
Speedup (%): 14.674824147236551
----------------------------------------
p = 19
Vector: 2.7721738815307617
For loop: 2.4250600337982178
Speedup (%): 12.521359141471738
----------------------------------------
p = 20
Vector: 6.615902900695801
For loop: 6.305397987365723
Speedup (%): 4.69331122283282
----------------------------------------
p = 21
Vector: 14.823853015899658
For loop: 14.31436276435852
Speedup (%): 3.436962380797168
----------------------------------------
p = 22
Vector: 31.09589385986328
For loop: 29.736020803451538
Speedup (%): 4.373159564218175
----------------------------------------
p = 23
Vector: 63.45819973945618
For loop: 60.71899080276489
Speedup (%): 4.316556328319752
----------------------------------------
p = 24
Vector: 143.89739298820496
For loop: 138.86121082305908
Speedup (%): 3.499842534019141
----------------------------------------
p = 25
Vector: 288.8068096637726
For loop: 276.7986350059509
Speedup (%): 4.157857175113533
----------------------------------------
Done!

We can see that the for-loop approach is faster for p >= 10. In fact, in some cases, one can notice a 20% improvement in the running time.

Also, if interested to read more about vectorized operation vs for-loop (in the presence of Numba), see this stackoverflow post


Now, let's use this to find hot spots in our OFFT RFFT code as shared previously here. Note that this code uses parallel function which is NOT supported in Profila. So, we will use the single-threaded Numba. Note that the performance of both single- and multi- threaded Numba code for our RFFT code is shown in this comment. As shown, we should focus on the bump in the performance when p is in range(17, 21), and that part is coming from the multi-threaded code. Still, improving the code in the Single-Threaded Numba code might help us with improving the performance of the Multi-Threaded code!

First, let's replace prange with range, and save it in offt_single_v1.py. We also add the following block of code at the end of the script:

seed = 0
np.random.seed(seed)
p = 20
T = np.random.rand(2 ** p)

_rfft(T)
start = time.time()
while time.time() - start < 5 * 60:
  _rfft(T)

Let's run it via profila and check out the output...

**Total samples:** 34233 (19.7% non-Numba samples, 8.8% bad samples)

/content/offt_single_v1.py (lines 1 to 171):

       | import math
       | import time
       | 
       | import numba
       | import scipy
       | import numpy as np
       | from numba import njit
       | 
       | 
       | @njit(fastmath=True)
       | def _fft0(n, s, eo, x, y):
       |     """
       |     A recursive function that is used as part of fft algorithm
       | 
       |     n : int
       |     s : int
       |     eo: bool
       |     x : numpy.array 1D
       |     y : numpy.array 1D
       |     """
       |     if n == 2:
       |         if eo:
       |             z = y
       |         else:
       |             z = x
       | 
  0.1% |         for i in range(s):
       |             j = i + s
  0.1% |             a = x[i]
  0.4% |             b = x[j]
  0.5% |             z[i] = a + b
  0.4% |             z[j] = a - b
       | 
       |     elif n >= 4:
       |         m = n // 2
       |         sm = s * m
       | 
  0.1% |         theta = math.pi / m
  0.6% |         c = math.cos(theta) - 1j * math.sin(theta)
       | 
       |         twiddle_factor = 1.0
  0.8% |         for p in range(m):
       |             sp = s * p
       |             two_sp = 2 * sp
  1.1% |             for q in range(s):
       |                 i = sp + q
       |                 j = i + sm
       | 
  0.9% |                 k = two_sp + q
 10.7% |                 y[k] = x[i] + x[j]
 12.0% |                 y[k + s] = (x[i] - x[j]) * twiddle_factor
       | 
  0.6% |             twiddle_factor = twiddle_factor * c
       | 
  0.2% |         _fft0(m, 2*s, not eo, y, x)
       | 
       |     else:
       |         pass
       | 
       | 
       | @njit(fastmath=True)
       | def _sixstep_fft(logtwo_n, x, y):
       |     N = 2 ** logtwo_n
       |     n = 2 ** int(logtwo_n // 2)  # basically, np.sqrt(N)
       | 
       |     for k in range(n):
  0.2% |         for p in range(k + 1, n):
       |             i = k + p * n
       |             j = p + k * n
  4.0% |             x[i], x[j] = x[j], x[i]
       | 
       |     for p in range(n):
       |         start = p * n
       |         _fft0(n, 1, False, x[start:], y[start:])
       | 
       |     theta_init = 2 * math.pi / N
       |     n_plus_1 = n + 1
       |     for p in range(n):
       |         theta0 = theta_init * p
       |         ppn = p * n_plus_1
       | 
       |         c = math.cos(theta0) - 1j * math.sin(theta0)
       |         w = math.cos(theta0 * p) - 1j * math.sin(theta0 * p)
  0.1% |         for alpha in range(0, n - p):
       |             i = ppn + alpha
       | 
       |             if alpha == 0:
       |                 x[i] = x[i] * w
       |             else:
       |                 j = ppn + alpha * n
  4.8% |                 x[j], x[i] = x[i] * w, x[j] * w
       | 
  0.1% |             w = w * c
       | 
       |     for k in range(n):
       |         start = k * n
       |         _fft0(n, 1, False, x[start:], y[start:])
       | 
       |     for k in range(n):
       |         kn = k * n
  0.1% |         for p in range(k + 1, n):
       |             i = k + p * n
       |             j = p + kn
  3.7% |             x[i], x[j] = x[j], x[i]
       | 
       | 
       | @njit(fastmath=True)
       | def _eightstep_fft(logtwo_n, x, y):
       |     n = 2 ** logtwo_n
       |     m = int (n // 2)
       | 
       |     theta0 = 2 * math.pi / n
  0.1% |     for i in range(m):
       |         theta = i * theta0
  5.8% |         wp = math.cos(theta) - 1j * math.sin(theta)
       | 
       |         j = i + m
  4.9% |         y[i] = x[i] + x[j]
  2.9% |         y[j] = (x[i] - x[j]) * wp
       | 
       |     _sixstep_fft(logtwo_n - 1, y, x)
       |     _sixstep_fft(logtwo_n - 1, y[m:], x[m:])
       | 
  0.1% |     for p in range(m):
       |         two_p = 2 * p
  0.6% |         x[two_p] = y[p]
  0.8% |         x[two_p + 1] = y[p + m]
       | 
       |     return
       | 
       | 
       | @njit(fastmath=True)
       | def _compute_fft(x, y):
       |     # only when len(x) is a power of two, and it is >= 2.
       |     n = len(x)
       |     logtwo_n = int(np.log2(n))
       | 
       |     if logtwo_n == 1:
       |         _fft0(n, 1, False, x, y)
       |     elif logtwo_n % 2 == 0:
       |         _sixstep_fft(logtwo_n, x, y)
       |     else:
       |         _eightstep_fft(logtwo_n, x, y)
       | 
       |     return
       | 
       | 
       | @njit(fastmath=True)
       | def _rfft(T):
       |     n = len(T)
       |     half_n = int(n // 2)
       |     n_rfft = half_n + 1
       |     y = np.empty(n_rfft, dtype=np.complex_)
       | 
  6.1% |     x = T[::2] + 1j * T[1::2]
       |     _compute_fft(x, y[:half_n])
       | 
       |     y[0] = x[0].real + x[0].imag
       |     y[n // 4] = x[n // 4].conjugate()
       |     y[half_n] = x[0].real - x[0].imag
       | 
       |     theta0 = math.pi / half_n
       | 
  0.1% |     for k in range(1, n // 4):
  0.1% |         c = x[half_n - k].conjugate()
       |         theta = theta0 * k
  7.6% |         val = 0.5 * (x[k] - c) * (1 + math.sin(theta) + 1j * math.cos(theta))
  0.4% |         y[k] = x[k] - val
  0.2% |         y[half_n - k] = x[half_n - k] + val.conjugate()
       | 
       |     return y

Now, we can see the parts that contribute more to the total running time. I am noticing the following parts:

# in function _fft0
 10.7% |                 y[k] = x[i] + x[j]
 12.0% |                 y[k + s] = (x[i] - x[j]) * twiddle_factor


# in function _rfft
 7.6% |         val = 0.5 * (x[k] - c) * (1 + math.sin(theta) + 1j * math.cos(theta))


# in function _rfft
  6.1% |     x = T[::2] + 1j * T[1::2]


# in function _eightstep_fft
  5.8% |         wp = math.cos(theta) - 1j * math.sin(theta)

As explained here, the compiled code is different than the source code. In other words, the compiled code can combine several statements. So, a percentage (of running time) shown next to a line does not necessarily reflect the contribution of that single line alone. Instead, it can reflect the contribution of several (surrounding) lines together. Still, it may give us some idea where we should pay attention to.

Let's see what we can do....

Item 1:

# in function _fft0
 10.7% |                 y[k] = x[i] + x[j]
 12.0% |                 y[k + s] = (x[i] - x[j]) * twiddle_factor

Going to skip this one as it is not clear what can be done to reduce the computing time here.

Item 2:

# in function _rfft
 7.6% |         val = 0.5 * (x[k] - c) * (1 + math.sin(theta) + 1j * math.cos(theta))

Here, theta = theta0 * k, where k is changing in the for-loop for k in range(1, n // 4):. Let's start with computing an initial value for math.sin(theta) + 1j * math.cos(theta), and updating it in each iteration.

Note that $sin(\theta) = cos(\pi / 2 - \theta)$, and $cos(\theta) = sin(\pi / 2 - \theta)$. we also know that $e^{j\theta} = cos(\theta) + jsin(\theta)$. Therefore, the expression math.sin(theta) + 1j * math.cos(theta) is equivalent to $e^{j(\frac{\pi}{2} - \theta)}$. Hence, we can write it as $e^{j(\frac{\pi}{2})} \times e^{-j\theta}$, which is equal to $1j e^{-j\theta}$. Therefore, in each iteration, we basically multiply the previously-computed value by e^{-jtheta0}. We can use it to update this value.

Item 3

  6.1% |     x = T[::2] + 1j * T[1::2]

We can replace it with a for-loop.

Item 4

# in function _eightstep_fft
  5.8% |         wp = math.cos(theta) - 1j * math.sin(theta)

note that the expression math.cos(theta) - 1j * math.sin(theta) is equivalent to $e^{-j\theta}$. We can use the same trick we used in Item 2 above.

We call this v2. In addition, I noticed c = math.cos(theta) - 1j * math.sin(theta) in the function _fft0. So, on top of v2, I change the code to update c in each call of _fft0 rather than computing it each time. You can get the three versions here:

I then run the following code to check their performance:

import time
import numpy as np

from offt_single_v1 import rfft as rfft1_v1
from offt_single_v2 import rfft as rfft1_v2
from offt_single_v3 import rfft as rfft1_v3
import scipy

seed = 0
np.random.seed(seed)
pmax = 20
data = np.random.rand(2 ** pmax)

n_iter = 10000
for p in range(2, pmax + 1):
    print('-----------------------------------')
    n = 2 ** p
    T = data[:n]
    ref = scipy.fft.rfft(T)
    
    ##########################################
    comp1 = rfft1_v1(T)
    np.testing.assert_almost_equal(ref, comp1)

    start = time.time()
    for _ in range(n_iter):
        rfft1_v1(T)
    end = time.time()   
    t1 = end - start
    print(f"p = {p}, time = {t1:.6f} s")
    ##########################################
    comp2 = rfft1_v2(T)
    np.testing.assert_almost_equal(ref, comp2)

    start = time.time()
    for _ in range(n_iter):
        rfft1_v2(T)
    end = time.time()
    t2 = end - start
    print(f"p = {p}, time = {t2:.6f} s")
    ##########################################
    comp3 = rfft1_v3(T)
    np.testing.assert_almost_equal(ref, comp3)

    start = time.time()
    for _ in range(n_iter):
        rfft1_v3(T)
    end = time.time()
    t3 = end - start
    print(f"p = {p}, time = {t3:.6f} s")

    print(f"v2: speedup (%) = {(t1 - t2) / t1 * 100:.2f}")
    print(f"v3: speedup (%) = {(t1 - t3) / t1 * 100:.2f}")

And I got the following output:

-----------------------------------
p = 2, time = 0.004732 s
p = 2, time = 0.004681 s
p = 2, time = 0.004869 s
v2: speedup (%) = 1.08
v3: speedup (%) = -2.89
-----------------------------------
p = 3, time = 0.005761 s
p = 3, time = 0.005616 s
p = 3, time = 0.005477 s
v2: speedup (%) = 2.52
v3: speedup (%) = 4.93
-----------------------------------
p = 4, time = 0.007164 s
p = 4, time = 0.006756 s
p = 4, time = 0.006991 s
v2: speedup (%) = 5.70
v3: speedup (%) = 2.42
-----------------------------------
p = 5, time = 0.008670 s
p = 5, time = 0.008037 s
p = 5, time = 0.007620 s
v2: speedup (%) = 7.30
v3: speedup (%) = 12.11
-----------------------------------
p = 6, time = 0.013308 s
p = 6, time = 0.011624 s
p = 6, time = 0.011230 s
v2: speedup (%) = 12.66
v3: speedup (%) = 15.61
-----------------------------------
p = 7, time = 0.015976 s
p = 7, time = 0.014699 s
p = 7, time = 0.012609 s
v2: speedup (%) = 7.99
v3: speedup (%) = 21.08
-----------------------------------
p = 8, time = 0.031250 s
p = 8, time = 0.026192 s
p = 8, time = 0.022152 s
v2: speedup (%) = 16.18
v3: speedup (%) = 29.11
-----------------------------------
p = 9, time = 0.046472 s
p = 9, time = 0.042073 s
p = 9, time = 0.034709 s
v2: speedup (%) = 9.47
v3: speedup (%) = 25.31
-----------------------------------
p = 10, time = 0.105313 s
p = 10, time = 0.086622 s
p = 10, time = 0.072196 s
v2: speedup (%) = 17.75
v3: speedup (%) = 31.45
-----------------------------------
p = 11, time = 0.174026 s
p = 11, time = 0.155572 s
p = 11, time = 0.136306 s
v2: speedup (%) = 10.60
v3: speedup (%) = 21.67
-----------------------------------
p = 12, time = 0.414733 s
p = 12, time = 0.340516 s
p = 12, time = 0.302217 s
v2: speedup (%) = 17.90
v3: speedup (%) = 27.13
-----------------------------------
p = 13, time = 0.718532 s
p = 13, time = 0.643369 s
p = 13, time = 0.598863 s
v2: speedup (%) = 10.46
v3: speedup (%) = 16.65
-----------------------------------
p = 14, time = 1.706717 s
p = 14, time = 1.418888 s
p = 14, time = 1.327434 s
v2: speedup (%) = 16.86
v3: speedup (%) = 22.22
-----------------------------------
p = 15, time = 3.279660 s
p = 15, time = 2.994144 s
p = 15, time = 2.917823 s
v2: speedup (%) = 8.71
v3: speedup (%) = 11.03
-----------------------------------
p = 16, time = 7.699075 s
p = 16, time = 6.543399 s
p = 16, time = 6.387876 s
v2: speedup (%) = 15.01
v3: speedup (%) = 17.03
-----------------------------------
p = 17, time = 16.511282 s
p = 17, time = 15.427652 s
p = 17, time = 15.296576 s
v2: speedup (%) = 6.56
v3: speedup (%) = 7.36
-----------------------------------
p = 18, time = 39.800181 s
p = 18, time = 35.164740 s
p = 18, time = 34.789528 s
v2: speedup (%) = 11.65
v3: speedup (%) = 12.59
-----------------------------------
p = 19, time = 81.914679 s
p = 19, time = 77.355520 s
p = 19, time = 77.089576 s
v2: speedup (%) = 5.57
v3: speedup (%) = 5.89
-----------------------------------
p = 20, time = 189.952086 s
p = 20, time = 164.420564 s
p = 20, time = 170.043715 s
v2: speedup (%) = 13.44
v3: speedup (%) = 10.48

I can see some good improvements. Now, let's check the performance of IRFFT as well while considering the new changes for the Single-Threaded mode, and compare it with the previously-reported result as shown in this comment.

We can run the following code to check and compare the performances:

import time
import numpy as np

from offt_ifft_single_v1 import _irfft as irfft1_v1
from offt_ifft_single_v2 import _irfft as irfft1_v2
from offt_ifft_single_v3 import _irfft as irfft1_v3

import scipy

seed = 0
np.random.seed(seed)
pmax = 21
data = np.random.rand(2 ** pmax)

n_iter = 10000
for p in range(2, pmax + 1):
    n = 2 ** p
    T = data[:n]
    R = scipy.fft.rfft(T)
    R_original = R.copy()

    R = R_original.copy()
    ref = scipy.fft.irfft(R)
    
    ##########################################
    R = R_original.copy()
    comp1 = irfft1_v1(R)
    np.testing.assert_almost_equal(ref, comp1)

    t1 = 0 
    for _ in range(n_iter):
        R = R_original.copy()
        start = time.time()
        irfft1_v1(R)
        end = time.time()  
        t1 += end - start 

    print(f"p = {p}, time = {t1:.6f} s")
    ##########################################
    R = R_original.copy()
    comp2 = irfft1_v2(R)
    np.testing.assert_almost_equal(ref, comp2)

    t2 = 0
    for _ in range(n_iter):
        R = R_original.copy()
        start = time.time()
        irfft1_v2(R)
        end = time.time()
        t2 += end - start

    print(f"p = {p}, time = {t2:.6f} s")
    ##########################################
    R = R_original.copy()
    comp3 = irfft1_v3(R)
    np.testing.assert_almost_equal(ref, comp3)

    t3 = 0
    for _ in range(n_iter):
        R = R_original.copy()
        start = time.time()
        irfft1_v3(R)
        end = time.time()
        t3 += end - start

    print(f"p = {p}, time = {t3:.6f} s")
    ##########################################

    print(f"v2: speedup (%) = {(t1 - t2) / t1 * 100:.2f}")
    print(f"v3: speedup (%) = {(t1 - t3) / t1 * 100:.2f}")
    print('-----------------------------------')

print('Done!')

and the result is:

-----------------------------------
p = 2, time = 0.005450 s
p = 2, time = 0.005486 s
p = 2, time = 0.005732 s
v2: speedup (%) = -0.66
v3: speedup (%) = -5.17
-----------------------------------
p = 3, time = 0.006788 s
p = 3, time = 0.006510 s
p = 3, time = 0.006630 s
v2: speedup (%) = 4.08
v3: speedup (%) = 2.32
-----------------------------------
p = 4, time = 0.008130 s
p = 4, time = 0.007696 s
p = 4, time = 0.007997 s
v2: speedup (%) = 5.33
v3: speedup (%) = 1.63
-----------------------------------
p = 5, time = 0.010062 s
p = 5, time = 0.009580 s
p = 5, time = 0.008821 s
v2: speedup (%) = 4.79
v3: speedup (%) = 12.34
-----------------------------------
p = 6, time = 0.014651 s
p = 6, time = 0.013156 s
p = 6, time = 0.012161 s
v2: speedup (%) = 10.20
v3: speedup (%) = 16.99
-----------------------------------
p = 7, time = 0.017105 s
p = 7, time = 0.015558 s
p = 7, time = 0.013750 s
v2: speedup (%) = 9.05
v3: speedup (%) = 19.62
-----------------------------------
p = 8, time = 0.033675 s
p = 8, time = 0.027852 s
p = 8, time = 0.024281 s
v2: speedup (%) = 17.29
v3: speedup (%) = 27.90
-----------------------------------
p = 9, time = 0.049581 s
p = 9, time = 0.044677 s
p = 9, time = 0.036468 s
v2: speedup (%) = 9.89
v3: speedup (%) = 26.45
-----------------------------------
p = 10, time = 0.110884 s
p = 10, time = 0.091791 s
p = 10, time = 0.076581 s
v2: speedup (%) = 17.22
v3: speedup (%) = 30.94
-----------------------------------
p = 11, time = 0.183751 s
p = 11, time = 0.165282 s
p = 11, time = 0.143961 s
v2: speedup (%) = 10.05
v3: speedup (%) = 21.65
-----------------------------------
p = 12, time = 0.437696 s
p = 12, time = 0.359679 s
p = 12, time = 0.317152 s
v2: speedup (%) = 17.82
v3: speedup (%) = 27.54
-----------------------------------
p = 13, time = 0.749402 s
p = 13, time = 0.678841 s
p = 13, time = 0.627844 s
v2: speedup (%) = 9.42
v3: speedup (%) = 16.22
-----------------------------------
p = 14, time = 1.785713 s
p = 14, time = 1.504538 s
p = 14, time = 1.398988 s
v2: speedup (%) = 15.75
v3: speedup (%) = 21.66
-----------------------------------
p = 15, time = 3.433555 s
p = 15, time = 3.145418 s
p = 15, time = 3.029525 s
v2: speedup (%) = 8.39
v3: speedup (%) = 11.77
-----------------------------------
p = 16, time = 8.023407 s
p = 16, time = 6.885627 s
p = 16, time = 6.676141 s
v2: speedup (%) = 14.18
v3: speedup (%) = 16.79
-----------------------------------
p = 17, time = 16.932221 s
p = 17, time = 15.830801 s
p = 17, time = 15.960933 s
v2: speedup (%) = 6.50
v3: speedup (%) = 5.74
-----------------------------------
p = 18, time = 41.322939 s
p = 18, time = 36.691810 s
p = 18, time = 36.909695 s
v2: speedup (%) = 11.21
v3: speedup (%) = 10.68
-----------------------------------
p = 19, time = 86.039012 s
p = 19, time = 79.891138 s
p = 19, time = 84.358842 s
v2: speedup (%) = 7.15
v3: speedup (%) = 1.95
-----------------------------------
p = 20, time = 189.571693 s
p = 20, time = 170.513088 s
p = 20, time = 179.009645 s
v2: speedup (%) = 10.05
v3: speedup (%) = 5.57
-----------------------------------

What else can we do to improve the performance? We can avoid doing x[:] = x / len(x) (in the function _ifft), and instead can do this division in the for-loop at the end of the function _irfft as follows:

    out = np.empty(n, dtype=np.float64)
    scaler = 1.0 / m_minus_1
    for i in range(n // 2):
        out[2 * i] = v[i].real * scaler
        out[2 * i + 1] = v[i].imag * scaler

let's call it version 4. Let's check the performance again.

p --> 2
p = 2, time = 0.005530 s
p = 2, time = 0.005503 s
p = 2, time = 0.005810 s
p = 2, time = 0.005445 s
v2: speedup (%) = 0.49
v3: speedup (%) = -5.06
v4: speedup (%) = 1.55
-----------------------------------
p --> 3
p = 3, time = 0.006380 s
p = 3, time = 0.006365 s
p = 3, time = 0.006671 s
p = 3, time = 0.006160 s
v2: speedup (%) = 0.24
v3: speedup (%) = -4.56
v4: speedup (%) = 3.45
-----------------------------------
p --> 4
p = 4, time = 0.007977 s
p = 4, time = 0.007702 s
p = 4, time = 0.007978 s
p = 4, time = 0.007645 s
v2: speedup (%) = 3.45
v3: speedup (%) = -0.01
v4: speedup (%) = 4.16
-----------------------------------
p --> 5
p = 5, time = 0.009601 s
p = 5, time = 0.009346 s
p = 5, time = 0.009416 s
p = 5, time = 0.008089 s
v2: speedup (%) = 2.66
v3: speedup (%) = 1.92
v4: speedup (%) = 15.75
-----------------------------------
p --> 6
p = 6, time = 0.014200 s
p = 6, time = 0.013405 s
p = 6, time = 0.012694 s
p = 6, time = 0.011361 s
v2: speedup (%) = 5.60
v3: speedup (%) = 10.60
v4: speedup (%) = 19.99
-----------------------------------
p --> 7
p = 7, time = 0.017294 s
p = 7, time = 0.015869 s
p = 7, time = 0.013665 s
p = 7, time = 0.012828 s
v2: speedup (%) = 8.24
v3: speedup (%) = 20.98
v4: speedup (%) = 25.82
-----------------------------------
p --> 8
p = 8, time = 0.033520 s
p = 8, time = 0.028279 s
p = 8, time = 0.023579 s
p = 8, time = 0.022665 s
v2: speedup (%) = 15.64
v3: speedup (%) = 29.66
v4: speedup (%) = 32.38
-----------------------------------
p --> 9
p = 9, time = 0.049492 s
p = 9, time = 0.044685 s
p = 9, time = 0.036992 s
p = 9, time = 0.034859 s
v2: speedup (%) = 9.71
v3: speedup (%) = 25.26
v4: speedup (%) = 29.57
-----------------------------------
p --> 10
p = 10, time = 0.110744 s
p = 10, time = 0.092250 s
p = 10, time = 0.076980 s
p = 10, time = 0.073542 s
v2: speedup (%) = 16.70
v3: speedup (%) = 30.49
v4: speedup (%) = 33.59
-----------------------------------
p --> 11
p = 11, time = 0.185736 s
p = 11, time = 0.166577 s
p = 11, time = 0.144147 s
p = 11, time = 0.137034 s
v2: speedup (%) = 10.31
v3: speedup (%) = 22.39
v4: speedup (%) = 26.22
-----------------------------------
p --> 12
p = 12, time = 0.435531 s
p = 12, time = 0.363399 s
p = 12, time = 0.321387 s
p = 12, time = 0.305873 s
v2: speedup (%) = 16.56
v3: speedup (%) = 26.21
v4: speedup (%) = 29.77
-----------------------------------
p --> 13
p = 13, time = 0.760939 s
p = 13, time = 0.689246 s
p = 13, time = 0.636102 s
p = 13, time = 0.609297 s
v2: speedup (%) = 9.42
v3: speedup (%) = 16.41
v4: speedup (%) = 19.93
-----------------------------------
p --> 14
p = 14, time = 1.823544 s
p = 14, time = 1.528142 s
p = 14, time = 1.425907 s
p = 14, time = 1.359158 s
v2: speedup (%) = 16.20
v3: speedup (%) = 21.81
v4: speedup (%) = 25.47
-----------------------------------
p --> 15
p = 15, time = 3.498502 s
p = 15, time = 3.205326 s
p = 15, time = 3.092848 s
p = 15, time = 2.973426 s
v2: speedup (%) = 8.38
v3: speedup (%) = 11.60
v4: speedup (%) = 15.01
-----------------------------------
p --> 16
p = 16, time = 8.139666 s
p = 16, time = 6.976795 s
p = 16, time = 6.743277 s
p = 16, time = 6.500640 s
v2: speedup (%) = 14.29
v3: speedup (%) = 17.16
v4: speedup (%) = 20.14
-----------------------------------
p --> 17
p = 17, time = 17.187242 s
p = 17, time = 15.995858 s
p = 17, time = 16.000840 s
p = 17, time = 15.510158 s
v2: speedup (%) = 6.93
v3: speedup (%) = 6.90
v4: speedup (%) = 9.76
-----------------------------------
p --> 18
p = 18, time = 41.324159 s
p = 18, time = 36.661479 s
p = 18, time = 36.919138 s
p = 18, time = 35.217285 s
v2: speedup (%) = 11.28
v3: speedup (%) = 10.66
v4: speedup (%) = 14.78
-----------------------------------
p --> 19
p = 19, time = 84.381538 s
p = 19, time = 79.629904 s
p = 19, time = 83.814942 s
p = 19, time = 80.459350 s
v2: speedup (%) = 5.63
v3: speedup (%) = 0.67
v4: speedup (%) = 4.65
-----------------------------------
p --> 20
p = 20, time = 189.045427 s
p = 20, time = 170.765312 s
p = 20, time = 178.709926 s
p = 20, time = 170.873909 s
v2: speedup (%) = 9.67
v3: speedup (%) = 5.47
v4: speedup (%) = 9.61
-----------------------------------

Let's run the SingleThreaded Numba Code of our OFFT-based IRFFT in MATLAB, and compare it with MATLAB IFFT. We also ran Scipy IRFFT on MATLAB online as well.

image

As observed, the new version of our IRFFT is faster than its previous version (Note that, from Stumpy's standpoint, it is okay to overwrite the input array of inverse FFT, and that is what I have considered when I was testing our IRFFT code. So, it may not be apple to apple comparsion, but we should be fine.)

And if we go for Multithreaded Numba Code (with 8 threads), we will see:
image

And, if we combine SINGLE- and MULTI- Threaded OFFT IRFFT (by considering their minimum), we will see this (the red curve):

image

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

The following Colab notebook should allow you to run OFFT-based RFFT SingleThreaded Numba code via profila, and get some rough idea about how different parts contribute to the total running time.

https://colab.research.google.com/drive/1JVBzBxlE0VJa7BPh-C4mtc6zMBr-8wXK?usp=sharing

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

Let' evaluate the performance of our OFFT-based python code

  • sliding-dot-product (the new OFFT-based python code vs Scipy / core.sliding_dot_product / core._sliding_dot_product)
  • sliding-dot-product (the new OFFT-based python code vs MATLAB code)
  • MASS (the new OFFT-based python code vs MATLAB code)

Part I: sliding dot product (python OFFT-based code vs STUMPY's sliding-dot-product )

I now compare OFFT-baesd sliding dot product (with NJIT) with:

stumpy.core._sliding_dot_product (np.dot approach, with Numba)
and stumpy.core.sliding_dot_product (fft approach, without Numba).

image

And, to obtain the performances, I used the following script:

seed = 0
np.random.seed(seed)

p = 11
T = np.random.uniform(-1, 1, size=2 ** p)
T_original = T.copy()

n_iter = 10000

lst_dot_numba = []
lst_fft_No_numba = []
lst_offt_fft_numba = []

for m in range(10, 1000 + 10, 10):
    Q = T[:m].copy()
    Q_original = Q.copy()
    
    ref = core._sliding_dot_product(Q, T)  # np.dot
    s = time.time()
    for _ in range(n_iter):
        core._sliding_dot_product(Q, T)
    e = time.time()
    lst_dot_numba.append(e - s)
    print(f"core._sliding_dot_product (np.dot numba) --> m = {m}, time = {e - s:.6f} s")

    npt.assert_almost_equal(T, T_original)
    npt.assert_almost_equal(Q, Q_original)
    ######################################

    comp = core.sliding_dot_product(Q, T)
    np.testing.assert_allclose(ref, comp)

    s = time.time()
    for _ in range(n_iter):
        core.sliding_dot_product(Q, T)
    e = time.time()
    lst_fft_No_numba.append(e - s)
    print(f"core.sliding_dot_product (fft no numba) --> m = {m}, time = {e - s:.6f} s")

    npt.assert_almost_equal(T, T_original)
    npt.assert_almost_equal(Q, Q_original)


    ######################################

    comp = _offt_sliding_dot_product(Q, T)  # dummy run
    np.testing.assert_allclose(ref, comp)

    s = time.time()
    for _ in range(n_iter):
        _offt_sliding_dot_product(Q, T)
    e = time.time()
    lst_offt_fft_numba.append(e - s)
    print(f"OFFT sliding dot product (fft numba) --> m = {m}, time = {e - s:.6f} s")

    npt.assert_almost_equal(T, T_original)
    npt.assert_almost_equal(Q, Q_original)
    
    print('------------------------------')


np.save('lst_dot_numba.npy', np.array(lst_dot_numba))
np.save('lst_fft_No_numba.npy', np.array(lst_fft_No_numba))
np.save('lst_offt_fft_numba.npy', np.array(lst_offt_fft_numba))

Part II: sliding dot product: python OFFT-based code vs MATLAB (via fft)

python OFFT-based sliding dot product
MATLAB sliding dot product

image

Part III: distance profile: python OFFT-based code vs MATLAB (MASS_V2)

python mass (taking advantage of fft and Numba)
MATLAB mass (taking advantage of fft

image

A few notes:

Note 1: The correctness of OFFT-based python mass functions are tested by asserting their outputs against the reference stumpy core._mass method

Note 2: In the sliding-dot-product part (i.e. part I and part II), I am allowing our function _irfft takes a reusable array to store the output data.

Note 3: In our new OFFT-based python MASS function, I compute the distance and store it in that reusable array.

Note 4: MATLAB MASS function (in DAMP) re-computes the sliding mean and sliding average. In our function, though, I am computing those values in advance, and then pass them to the function (like the API of core._mass). So, again, this may not be apple-to-apple comparison but we should be good from STUMPY's standpoint.

Note 5: These results are for T, with length 2^11, and Q = T[:m] (with different values for m). Our OFFT-based python code used for the comparisons conducted here (in this comment) is SingleThread Numba code. The FFT in MATLAB code, however, had access to 8 number of threads in all of the comparisons conducted in this comment.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

[Warning!... Please also see the next comment]

[Update]
In what follows, I am going to show the performance of our MultiThreaded FFT-based Sliding-Dot-Product in python w.r.t. fft-based Sliding-dot-Product in MATLAB.

A few notes before I show the results:

  • Note 1: I ran both on MATLAB ONLINE SERVER.
  • Note 2: I checked the performance considering different lengths for T, ranging from 2^3 to 2^25. In each case, where T has a length of 2^p, I considered different lengths for query Q, with log2(len(Q)) ranging from 2 to p (exclusive). I ran each case 20 times.
  • Note 3: In the python code, I had a dummy run to avoid adding the compilation time to the total running time.
  • Note 4: I used 8 threads in both cases.
  • Note 5: Although I set the number of threads to 8 in both MATLAB and Python (Numba), I do not know why the maximum in MATLAB is 512, and the maximum in Python is 16 (?). See below:

For instance, if I run this:

# MATLAB
Nth = maxNumCompThreads;
disp(strcat("MAX N threads right now: ",num2str(Nth)));

num_of_thread = 1024;
LASTN = maxNumCompThreads(num_of_thread);
Nth = maxNumCompThreads;
disp(strcat("LASTN: ",num2str(Nth)));

I will get an error. But, if I set it to 512, I will get no error. In Numba, though, I cannot set the number of threads to 512. The following code

import numba
numba.set_num_threads(512)
n_threads = numba.get_num_threads()
print('n_threads" \n', n_threads)

raises an error, saying the number of threads should be between 1 and 16.


I am assuming that both MATLAB and Python use the same number of threads when I set it to 8 in both cases. I got the following result for the sliding dot product of Q and T. A value smaller than 1 means OFFT is faster than MATLAB. A smaller value means faster OFFT compared to its corresponding MATLAB code. In each row, the time series T has a fixed length, and the length of query changes from 2^2 to 2^p-1, where. p=log2(len(T)), which is shown in the first cell of each row.

(OFFT/MATLAB) Sliding Dot Product (Q, T) Log2 of len(T)
Log2 of len(Q) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
3 1.15
4 1.8 9.17
5 0.62 1.65 2.01
6 1.22 3.28 3.24 3.35
7 0.7 1.77 1.89 1.8 1.7
8 0.97 1.85 1.81 1.83 1.81 1.79
9 0.53 0.9 0.9 0.72 0.7 0.69 0.69
10 0.91 1.53 1.53 1.49 1.48 1.51 1.49 1.49
11 0.52 0.75 0.71 0.76 0.75 0.76 0.76 0.79 0.76
12 0.65 0.84 0.82 0.85 0.82 0.82 0.93 0.96 0.84 0.84
13 0.34 0.41 0.41 0.4 0.4 0.4 0.44 0.4 0.41 0.41 0.41
14 0.48 0.53 0.52 0.55 0.59 0.58 0.58 0.51 0.56 0.58 0.56 0.58
15 0.38 0.39 0.37 0.39 0.41 0.41 0.41 0.4 0.38 0.38 0.4 0.32 0.35
16 0.47 0.47 0.46 0.46 0.47 0.47 0.46 0.42 0.39 0.46 0.42 0.47 0.47 0.46
17 0.4 0.37 0.37 0.38 0.37 0.38 0.37 0.38 0.39 0.37 0.36 0.47 0.48 0.35 0.42
18 0.63 0.78 0.68 0.65 0.64 0.67 0.61 0.58 0.58 0.57 0.59 0.57 0.54 0.58 0.58 0.62
19 0.56 0.44 0.41 0.41 0.4 0.41 0.41 0.39 0.39 0.41 0.43 0.45 0.42 0.43 0.47 0.5 0.43
20 0.64 0.63 0.63 0.64 0.64 0.66 0.62 0.63 0.63 0.62 0.63 0.62 0.65 0.67 0.64 0.63 0.63 0.64
21 0.48 0.46 0.52 0.53 0.53 0.54 0.54 0.53 0.56 0.56 0.53 0.53 0.54 0.5 0.53 0.51 0.51 0.51 0.53
22 0.59 0.53 0.54 0.53 0.49 0.54 0.48 0.47 0.45 0.48 0.48 0.48 0.51 0.57 0.55 0.5 0.51 0.49 0.53 0.53
23 0.39 0.45 0.42 0.43 0.46 0.43 0.42 0.42 0.43 0.39 0.41 0.4 0.43 0.42 0.4 0.48 0.47 0.5 0.46 0.42 0.49
24 0.52 0.57 0.56 0.51 0.53 0.55 0.57 0.53 0.54 0.5 0.62 0.65 0.65 0.65 0.65 0.62 0.62 0.63 0.63 0.64 0.68 0.68
25 1.07 1.1 1.11 1.12 1.13 1.13 1.12 1.13 1.11 1.1 1.09 1.11 1.11 1.1 1.06 1.09 1.1 1.1 1.12 1.11 1.09 1.02 1.05

As observed, the OFFT-based slidingdotproduct shows good performance when len(T) > 2^10.

To get the results above, I used the following code:

# MATLAB
# This is to create and save one time series with length 2^25

rng(1,"twister");
T = rand(1, 2^25);
save('T_input.mat', 'T')

Now that we've saved our data, we are going to load and use it...

# MATLAB: Get performance for sliding-dot-product, when T has length 2^p

num_of_thread = 8;
LASTN = maxNumCompThreads(num_of_thread);
disp(strcat("LASTN: ",num2str(LASTN)));

Nth = maxNumCompThreads;
disp(strcat("MAX N threads: ",num2str(Nth)));


load("T_input.mat");


pmax = p;  % the log2 of len(T), set by user
disp(strcat("pmax: ",num2str(pmax)));
data = T(1, 1:2^pmax);

m_power_max = pmax - 1;
m_power_range = 2:1:m_power_max;

iter_max = 20;
iter_range = 1:1:iter_max;

running_time = [];
for m_power = m_power_range
    disp(m_power);

    m = 2^m_power;
    Q = data(1, 1:m);
    
    SLIDING_DOT_PRODUCT_MASS_V2(data, Q);  # I think no dummy run is needed in MATLAB :)
    
    total_t = 0;
    for i = iter_range
        t = tic;
        SLIDING_DOT_PRODUCT_MASS_V2(data, Q);
        total_t = total_t + toc(t);
    end
    running_time(end+1) = total_t;

   
end


fname = "%d_MATLAB.mat";
fname = sprintf(fname, pmax);
disp(fname);
save(fname, 'running_time')

function [z] = SLIDING_DOT_PRODUCT_MASS_V2(x, y)
    %THIS FUNCTION IS BASED ON THE CODE USED IN  DAMP CODE IN MATLAB.
    
    % x is the data, y is the query
    m = length(y);
    n = length(x);

    y = y(end:-1:1);%Reverse the query
    y(m+1:n) = 0; %aappend zeros

    %The main trick of getting dot products in O(n log n) time
    X = fft(x);
    Y = fft(y);
    Z = X.*Y;
    z = ifft(Z);
    z = z(m:n);

end

And, I used the following code to get the performance of Python:

numba.set_num_threads(8)
n_threads = numba.get_num_threads()
print('n_threads" \n', n_threads)


running_time = []
T = np.array(loadmat('./T_input.mat')['T'][0]).astype(np.float64)

print(f'log2 len of T --> {int(np.log2(len(T)))}')

pmax = p  # the log2 of len(T), set by user
print(f'pmax --> {pmax}')

T = T[:2 ** pmax].copy()
m_power_range = range(2, pmax)
for m_power in m_power_range:
    m = 2 ** m_power
    Q = T[:m].copy()
    
    _offt_sliding_dot_product(Q, T)
    
    print(f'm_power --> {m_power}')
    total_time = 0
    for _ in range(20):
        tic = time.time()
        _offt_sliding_dot_product(Q, T)
        toc = time.time()
        total_time += toc - tic
    running_time.append(total_time) 
    

running_time_python_arr = np.array(running_time)
np.save(f'{pmax}_OFFT.npy', running_time_python_arr)

I also checked the performance of SingleThreaded FFT-based sliding-dot-product in python, and compared it against fft-based Sliding-dot-Product in MATLAB. And I got the following result which shows the Python / MATLAB performance ratio. A value smaller than 1 means OFFT is faster than MATLAB. A smaller value means faster OFFT compared to its corresponding MATLAB code. In each row, the time series T has a fixed length, and the length of query changes from 2^2 to 2^p-1, where. p=log2(len(T)), which is shown in the first cell of each row.

(OFFT_SingleThreaded/MATLAB) Sliding Dot Product (Q, T) Log2 of len(T)
Log2 of len(Q) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
3 0.04
4 0.04 0.24
5 0.03 0.06 0.08
6 0.05 0.11 0.1 0.11
7 0.06 0.14 0.14 0.14 0.16
8 0.07 0.13 0.14 0.13 0.13 0.14
9 0.11 0.18 0.18 0.16 0.14 0.14 0.14
10 0.19 0.31 0.31 0.31 0.3 0.32 0.31 0.31
11 0.28 0.41 0.38 0.41 0.41 0.41 0.42 0.41 0.42
12 0.39 0.51 0.5 0.51 0.5 0.5 0.56 0.57 0.51 0.58
13 0.58 0.72 0.67 0.66 0.67 0.67 0.68 0.67 0.68 0.68 0.68
14 0.63 0.66 0.66 0.69 0.74 0.74 0.74 0.65 0.71 0.74 0.72 0.74
15 0.9 0.92 0.86 0.91 0.94 0.95 0.94 0.92 0.86 0.92 0.93 1.16 0.92
16 0.95 1 1.01 1.01 1.02 1.03 1 0.89 0.8 0.97 0.91 0.96 0.99 1.06
17 1.26 1.27 1.31 1.29 1.28 1.27 1.24 1.24 1.28 1.28 1.26 1.27 1.29 1.24 1.28
18 1.41 1.46 1.44 1.42 1.46 1.5 1.48 1.48 1.48 1.45 1.5 1.45 1.37 1.5 1.49 1.49
19 1.5 1.54 1.51 1.51 1.5 1.5 1.5 1.49 1.52 1.51 1.53 1.51 1.54 1.51 1.5 1.55 1.2
20 1.57 1.56 1.62 1.64 1.61 1.61 1.56 1.59 1.59 1.57 1.62 1.57 1.63 1.58 1.58 1.61 1.59 1.61
21 1.63 1.64 1.97 1.94 1.85 1.86 1.82 1.75 1.81 1.83 1.84 1.84 1.87 1.82 1.73 1.73 1.7 1.68 1.73
22 1.76 1.75 1.76 1.81 1.79 1.77 1.78 1.75 1.82 1.77 1.76 1.77 1.96 1.95 1.95 1.95 1.89 1.9 1.91 2.04
23 1.51 1.43 1.43 1.41 1.42 1.42 1.42 1.41 1.41 1.42 1.41 1.41 1.42 1.41 1.45 1.46 1.44 1.46 1.45 1.48 1.46
24 1.9 1.87 1.89 2.06 2.02 2 1.85 1.83 1.84 1.81 1.86 1.82 1.81 1.85 1.85 1.75 1.73 1.78 1.78 1.79 1.81 1.87
25 3.16 3.22 3.27 3.24 3.25 3.24 3.25 3.3 3.23 3.26 3.17 3.19 3.23 3.16 3.13 3.16 3.21 3.19 3.25 3.22 3.19 2.95 3.03

The OFFT-based sliding dot product shows a decent performance for T when len(T) < 2^16.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

In addition to that, I just discovered that the [average] performance of MATLAB code gets better as the number of iterations increases (generally speaking).

So, if MATLAB uses FFTW then what might be happening is that MATLAB is creating FFTW "plans" in the background that allows it to improve the computation speed: https://www.fftw.org/fftw3_doc/Using-Plans.html

I see! yeah... that might be the reason. So, I am now thinking of looking at MATLAB-vs-Python comparison from two perspectives:

(1) I can use median instead of average running time, and I think that should give me a somewhat stable value for MATLAB code's performance. I will try it out.

(2) I can just consider the running time of the first run. My understanding is that a plan for fft(T) cannot be used for computing the fft of a new time series data (to boost its performance). Since we often care about computing the fft(T) only once, it should be okay to just consider the running time of the first run.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

Profiling 6-step FFT

Now that we have a (somewhat) clear definition of six-step FFT (see previous comment), we can wrap a njit-decorated function around each step, and start profiling the six-step FFT algo at "step" level.

@njit(fastmath=True)
def _transpose(x, n):
    """
    n is sqrt(len(x))
    """
    for k in range(n):
        for p in range(k + 1, n):
            i = k + p * n
            j = p + k * n
            x[i], x[j] = x[j], x[i]
    
    return


@njit(fastmath=True)  
def _fft0(n, s, eo, x, y, c): 
    """
    A recursive function that is used as part of fft algorithm
    
    n : int
    s : int
    eo: bool
    x : numpy.array 1D
    y : numpy.array 1D
    """
    if n == 2:
        if eo:
            z = y
        else:
            z = x
        
        for i in range(s):
            j = i + s
            a = x[i]
            b = x[j]
            z[i] = a + b
            z[j] = a - b
            
    elif n >= 4:
        m = n // 2
        sm = s * m
    
        twiddle_factor = 1.0
        for p in range(m):
            sp = s * p
            two_sp = 2 * sp
            
            for q in range(s):
                i = sp + q
                j = i + sm
                
                k = two_sp + q
                y[k] = x[i] + x[j]
                y[k + s] = (x[i] - x[j]) * twiddle_factor
        
            twiddle_factor = twiddle_factor * c

        _fft0(m, 2*s, not eo, y, x, c * c)
        
    else:
        pass
    
@njit(fastmath=True)
def _apply_fft0(x, y, n, c_thata):
    for p in range(n):
        start = p * n
        _fft0(n, 1, False, x[start:], y[start:], c_thata)

    return


@njit(fastmath=True)
def _transpose_and_twiddle_factor(x, n, theta_init):
    n_plus_1 = n + 1
    for p in range(n):
        theta0 = theta_init * p
        ppn = p * n_plus_1
        
        c = math.cos(theta0) - 1j * math.sin(theta0)
        w = math.cos(theta0 * p) - 1j * math.sin(theta0 * p)
        for alpha in range(0, n - p):
            i = ppn + alpha
    
            if alpha == 0:
                x[i] = x[i] * w
            else:
                j = ppn + alpha * n
                x[j], x[i] = x[i] * w, x[j] * w
                
            w = w * c

    return


def _sixstep_fft(x, y):
    """
    x and y are of same length, and the log2 of their length is even.
    """
    N = len(x)
    n = int(np.sqrt(N))
    
    theta = 2 * math.pi / n
    c_thata = math.cos(theta) - 1j * math.sin(theta)
    theta_init = 2 * math.pi / N

    t1 = time.time()
    _transpose(x, n)  # step 1
    t2 = time.time()
    _apply_fft0(x, y, n, c_thata)  # step 2
    t3 = time.time()
    _transpose_and_twiddle_factor(x, n, theta_init)  # step 3 & 4
    t4 = time.time()
    _apply_fft0(x, y, n, c_thata)  # step 5
    t5 = time.time()
    _transpose(x, n)   # step 6
    t6 = time.time()
    
    out = np.array([t1, t2, t3, t4, t5, t6])

    return out

As provided above, our _sixstep_fft function returns an array of length 6 from which one can calculate the running time of each step (Note that step 3&4 are merged together)

Note that the log2 of length of input in _sixstep_fft(x, y) must be even. Therefore, I did profiling for different lengths of input 2^p, where p is in range(2, 24 + 1, 2). I used the following code:

pmax = 24
power_values = np.arange(2, pmax + 1, 2)

seed = 0
np.random.seed(seed)
T = np.random.rand(2 ** pmax).astype(np.complex_)

n_iter = 1000
running_time_collect = np.empty((len(power_values), 5), dtype=np.longdouble)

for i, p in enumerate(power_values):
    print(f'p --> {p}')
    n = 2 ** p
    x_main = T.copy()[:n]
    y_main = np.empty_like(x_main)

    x = x_main.copy()
    y = y_main.copy()
    _sixstep_fft(x, y)
    
    running_time = np.empty((n_iter, 5), dtype=np.longdouble)
    for j in range(n_iter):
        x = x_main.copy()
        y = y_main.copy()
        out = _sixstep_fft(x, y)
        running_time[j] = out[1:] - out[:-1]

    running_time_collect[i] = np.median(running_time, axis=0)

Each row of running_time_collect contains 5 values corresponding to the running time of step 1 - 2 - 3&4 - 5 - 6.

image

Note1: I did not plot the running time for input with length 2^p, p < 10, as the calculated running time for almost all steps were zero.
Note 2: The steps 1 and 6 are both "Transpose", and the steps 2 and 5 are both "apply_fft_0". Therefore, to simplify the plot above, we can sum the running time of steps that call the same function. Hence:

image

Note that the transpose part (steps 1 & 6 together) takes around 30% of the running time for p >= 16.


Comparing the performances of different functions for "Transpose" step

def func0(x, n):
  x[:] = np.transpose(x.reshape(n, n)).reshape(-1,)
  return


def func1(x, n):
  x[:] = x.reshape(n,n, order='F').ravel()
  return


def func2(x, n):
    x[:] = np.transpose(x.reshape(n, n)).ravel()
    return


@njit
def func3(x, n):
    x[:] = np.transpose(x.reshape(n, n)).copy().reshape(-1,)
    return


@njit
def func4(x, n):
    x[:] = np.transpose(x.reshape(n, n)).ravel()
    return


@njit(fastmath=True)
def func5(x, n):
  for k in range(n):
    for p in range(k + 1, n):
      i = k + p * n
      j = p + k * n

      x[i], x[j] = x[j], x[i]

  return


@njit(fastmath=True)
def func6(x, n):
  for k in range(n):
    kn = k * n
    for p in range(k + 1, n):
        i = k + p * n
        j = p + kn

        x[i], x[j] = x[j], x[i]

  return


@njit(fastmath=True)
def func7(x, n, x_transpose):
    """
    we already have a helper variable in 6-step fft algorithm, whose length is the same as x's. 
    We can use that helper variable variable and pass it as the third argument of this function
    to store the result. We can then use x as the helper variable in the rest of code.
    """
    idx = 0
    for _, val in np.ndenumerate(np.transpose(x.reshape(n, n))):
        x_transpose[idx] = val
        idx += 1
    return


@njit(fastmath=True, parallel=True)
def func8(x, n):
  for k in prange(n):
    kn = k * n
    for p in range(k + 1, n):
        i = k + p * n
        j = p + kn

        x[i], x[j] = x[j], x[i]

  return

Note: Numba does not support the second argument, order, for np.reshape

I used the following code to get the performance of each function for inputs with different lengths, 2^p, where p is in range(2, 24 + 1, 2).

seed = 0
np.random.seed(seed)

pmin = 2
pmax = 24
T = np.random.rand(2 ** pmax)

p_range_values = range(pmin, pmax + 1, 2)

n_iter = 1000

funcs = {
    'func0' : func0,
    'func1' : func1,
    'func2' : func2,
    'func3' : func3,
    'func4' : func4,
    'func5' : func5,
    'func6' : func6,
    'func7' : func7,
    'func8' : func8,
}

funcs_with_extra_input_x_transpose = ['func7']

performance = {}
for func_name, func_object in funcs.items():
    print(f'=================== {func_name} ===================')
    running_time = []

    for p in p_range_values:
        N = 2 ** p
        n = int(np.sqrt(N))

        x = T[:N].copy()
        
        # [note] we can use functools.partial to avoid the following if-else block
        if func_name in funcs_with_extra_input_x_transpose:
            x_transpose = np.empty_like(x)
            func_object(x, n, x_transpose)
        else:
            func_object(x, n)

        lst = []
        for _ in range(n_iter):
            x = T[:N].copy()
            
            if func_name in funcs_with_extra_input_x_transpose:
                x_transpose = np.empty_like(x)
                ts = time.time()
                func_object(x, n, x_transpose)
                te = time.time()
            else:
                ts = time.time()
                func_object(x, n)
                te = time.time()

            lst.append(te - ts)

        running_time.append(
            np.median(lst)
        )

    performance[func_name] = running_time

# save
with open('performance_transpose_funcs.pickle', 'wb') as handle:
    pickle.dump(performance, handle, protocol=pickle.HIGHEST_PROTOCOL)

We use the performance of func0 as our base (reference). The performances of other functions w.r.t to func0 are shown below:

performance ratio (in MacOS)

A value below the horizontal line y=1 means better performance than the performance of func0 for the same input size.

image

Note: func8 (the multithreaded function) has poor performance for 2^p, with p <= 12. To better reflect the performance of other functions, I am going to show the performance of func8 for p > 12 only.

image

It is interesting to see that the function func5 (and func6) ....

@njit(fastmath=True)
def func5(x, n):
  for k in range(n):
    for p in range(k + 1, n):
        i = k + p * n
        j = p + k * n

        x[i], x[j] = x[j], x[i]

  return

performs poorly for len(T) between 2^16 and 2^20 (inclusively). It is interesting to note that we saw a similar bump in the performance of FFT before. Can this be the reason? Also, why do we have a bump in the performance here?


Interestingly, func7 ....

@njit(fastmath=True)
def func7(x, n, x_transpose):
    """
    we already have a helper variable in 6-step fft algorithm, whose length is the same as x's. 
    We can use that helper variable and pass it as the third argument of this function
    to store the result. We can then use x as the helper variable in the rest of code.
    """
    idx = 0
    for _, val in np.ndenumerate(np.transpose(x.reshape(n, n))):
        x_transpose[idx] = val
        idx += 1
    return

shows a good performance overall. However, it is considerably outperformed by func8 (the multithreaded code) when input size is >= 2^20.

I also ran the code in Linux OS (using Google Colab)

performance of transpose functions (in Linux OS via Google Colab, with two threads)

image

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

This cache-oblivious transpose algorithm (and matrix multiplication algorithm) might be useful for future reference as it relates to blocking/tiling

(TLDR) Tiling seems to be effective for matrix transpose! (Check out func6 below)


Let's set our baseline first. Of the three following functions, we choose func_C (In fact, the performances of the three following functions are very close to each other.)

def func_A(x, n):
  x[:] = np.transpose(x.reshape(n, n)).reshape(-1,)
  return

def func_B(x, n):
  x[:] = x.reshape(n,n, order='').ravel()
  return

def func_C(x, n):
    x[:] = np.transpose(x.reshape(n, n)).ravel()
    return

Alternative, more performant functions:

@njit
def func1(x, n):
    # the njit-decorated version of `func_C`
    x[:] = np.transpose(x.reshape(n, n)).ravel()
    return

@njit(fastmath=True)
def func2(x, n):
  # SingleThreaded function that is currently used in our six-step FFT implementation
  for k in range(n):
    for p in range(k + 1, n):
      i = k + p * n
      j = p + k * n
      x[i], x[j] = x[j], x[i]

  return

@njit(fastmath=True, parallel=True)
def func3(x, n):
  # MultiThreaded function that is currently used in our six-step FFT implementation
  for k in prange(n):
    for p in range(k + 1, n):
      i = k + p * n
      j = p + k * n
      x[i], x[j] = x[j], x[i]

  return


@njit(fastmath=True)
def func4(x, n, x_transpose):
    # a function that was recently proposed in this issue page...
    # for transposing matrix and store it in `x_transpose`
    idx = 0
    for _, val in np.ndenumerate(np.transpose(x.reshape(n, n))):
        x_transpose[idx] = val
        idx += 1
    return


@njit(fastmath=True)
def func5(x, n, x_transpose):
    # tiling, based on this stack overflow: 
    # https://stackoverflow.com/questions/5200338/a-cache-efficient-matrix-transpose-program
    blocksize = 32
    blocksize = min(blocksize, n)    
    for i in range(0, n, blocksize):
        for j in range(0, n, blocksize):
            for k in range(i, i + blocksize):
                for l in range(j, j + blocksize):
                    x_transpose[l + k * n] = x[k + l * n]
    return


@njit(fastmath=True)
def func6(x, n, x_transpose):
    # Similar to func5. I just use numpy operation instead of two for-loops.
    blocksize = 32
    blocksize = min(blocksize, n)

    x = x.reshape(n, n)
    x_transpose = x_transpose.reshape(n, n)
    for i in range(0, n, blocksize):
        for j in range(0, n, blocksize):
            x_transpose[i:i + blocksize, j:j + blocksize] = x[j:j + blocksize, i:i + blocksize].transpose()
            
    return

In addition to the functions above, we are going to consider one more function based on the algorithm proposed in Cache-Oblivious Algorithms.

@njit(fastmath=True)
def _func7(a, n, N):
    if n <= 32:
        for i in range(n):
            for j in range(i):
                a[i * N + j], a[j * N + i] = a[j * N + i], a[i * N + j]

      
    else:
        k = n // 2

        _func7(a, k, N)
        _func7(a[k:], k, N)
        _func7(a[k * N:], k, N)
        _func7(a[k * N + k:], k, N)

        for i in range(k):
            for j in range(k):
                a[i * N + (j + k)], a[(i + k) * N + j] = a[(i + k) * N + j], a[i * N + (j + k)]

        if n % 2:
            for i in range(n - 1):
                a[i * N + n - 1], a[(n - 1) * N + i] = a[(n - 1) * N + i], a[i * N + n - 1]

@njit
def func7(a, n):
     # based on the algo provided in the following link: 
     # https://en.algorithmica.org/hpc/external-memory/oblivious/
    _func7(a, n, n)

The code to check the performance is provided in this Colab notebook.

The following result is obtained by running the code on my MacOS machine (not the Google Colab). I got 500 samples of running time per case, and used its median. Each data point in the plot below is the running time ratio w.r.t the running time of baseline function.

image

Running the code in Colab gives (roughly) similar result (In Colab, I got 100 samples per case).

image

I think the winner is func6. Numba, titling, and Numpy operation seem to work well together

Previously, we noticed that a for-loop can be faster than Numpy operation when we use Numba. However, it now seems that Numpy operation is faster if we use Numba AND consider Cache.


@seanlaw
func6 is basically func5 but with Numpy operation in each tile. The tiling approach is based on the answer provided in this stackoverflow post. I noticed that this approach is aligned with the approach provided in this paper(see section 3). When size of matrix is exact power of two, the divide-and-conquer approach provided in the paper is the same as two for-loops that result in traversing the matrix tile-by-tile, and each tile is a blocksize-by-blocksize matrix.

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 1

Also, another thing I have been trying out is to use pre-computed factors instead of computing each factor in fft algorithm. (This is based on what I read before somewhere(?). It is one of the ways one can use to further improve FFT algorithm) The values of these factors ONLY depend on the length of input.

You might be able to use functools.cache (rather than lru_cache) to memoize the factors:

@cache
def fft_cosign_factor(i, theta):
    return math.cos(i * theta)

@cache
def fft_sine_factor(i, theta):
    return math.sin(i * theta)

So, when you call either of these functions, Python will check whether the given i and theta combination has been used before. If not, it will perform the computation and keep the result in memory. If the inputs have been seen before then it will simply pull the result from memory and avoid re-computing the result. Note that this will NEVER work if the input is a numpy array as it is not possible to ensure that the elements of the array have not been altered. However, in this case, since i and theta are scalar values, then we're good!

theta = math.pi / m
for k in range(1, m // 2):
    # factor = math.cos(k * theta) - 1j * math.sin(k * theta)  # The next line is equivalent to this
    factor = fft_cosine_factor(k, theta) - 1j * fft_sine_factor(k, theta)

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024 1

@NimaSarajpoor I discovered something really interesting while looking into cache-oblivious FFT algorithms. The inventor of cache-oblivious algorithms, Matteo Frigo, had provided a cache-oblivious version of FFT in his original paper (see Section 3) and it turns out that Frigo is also the creator of FFTW! Which, you guessed it, uses a cache-oblivious algorithm!

I am starting to think that (in a separate PR) it might be helpful to create a notebook to help explain cache-oblivious algorithms (with diagrams) starting with the basics of how cache sits in between the CPU and main memory (RAM), to understanding blocks, a simple for-loop of how to divide up a matrix and traverse blocks, and how it can be used in a matrix transpose, and finally to FFT. Ultimately, I'm hoping that this can be generalized to help us think about using blocks/tiles for the stump function as cache-oblivious algorithms (let's call it COA from now on as I'm getting tired of writing it out 😆).

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

So, when you call either of these functions, Python will check whether the given i and theta combination has been used before. If not, it will perform the computation and keep the result in memory. If the inputs have been seen before then it will simply pull the result from memory and avoid re-computing the result.

Precomputing the array has not shown promising result. I am going with the cache approach. Thanks for the very informative explanation!!!

Frigo is also the creator of FFTW! Which, you guessed it, uses a cache-oblivious algorithm!

Interesting! I skim the paper before. And, our current cache-oblivious matrix transpose is basically following the approach proposed in the paper. I didn't notice though that he is one of the authors of FFTW! It might be too soon to say this (?) but it seems we are on the right track!

I am starting to think that (in a separate PR) it might be helpful to create a notebook to help explain cache-oblivious algorithms (with diagrams) starting with the basics of how cache sits in between the CPU and main memory (RAM), to understanding blocks,

Agreed! I think that will be fun!

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

[Update]

I am going with the cache approach.

Apparently the decorators @cache and @njit can work together for a function unless that function itself is being called by another njit-decorated function! (see this GitHub issue)

To the see the impact of cache on performance

from numba import njit
from functools import cache
import math

@cache
@njit
def math_cosine_cache(i, theta):
  N = 10000000
  total = 0
  for _ in range(i, i + N):
    total = total + math.cos(i * theta)
  return total


@njit
def math_cosine(i, theta):
  N = 10000000
  total = 0
  for _ in range(i, i + N):
    total = total + math.cos(i + theta)
  return total

And the problem appears here:

@njit
def wrapper(i, theta):
  return math_cosine_cache(i, theta)

wrapper(0, math.pi)

In fact, cache can decorate the njit-decorated function. However, njit cannot decorate cache-decorated function.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

Surprising indeed! Will do! Also need to share one to show the impact of different block sizes on performance.

I am trying to see if I can replace the fft recursive function with for-loop. Can that affect the performance? I read that it might be faster as a recursive function pushes each call to stack. I will continue working on it for the next few days and provide an update.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

Can you share a small, self contained, reproducible example with only the sine/cosine parts?

I created three version and put them all in this notebook. The first version is the original version. The second version takes advantage of the identity equation $e^{j\theta} = cos(\theta) + jsin(\theta)$. In the third version, we pass an array as argument filled with precomputed values. No performance gain is observed in the third version. I think we are dealing with memory access time here.

[Update]
Running the code in my MacOS gives me different result. It shows improvement when using precomputed array.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

Also need to share one to show the impact of different block sizes on performance.

I checked the impact of blocksize in cache-oblivious algorithm for matrix transpose. You can see this notebook. I do not see any (significant) improvement in changing the blocksize.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

I am trying to see if I can replace the fft recursive function with for-loop. Can that affect the performance?

I wrote the for-loop version of the recursive function we use in 6-step FFT (see this notebook). Ran it in MacOS and noticed no performance gain.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1
  1. It looks like fill_c_array isn't even called by func_version3 so is there even a need to use njit for fill_c_array rather than using straight numpy? Otherwise, why not consider prange for filling the contents of the array?

You are right (re: isn't even called by func_version3)!

I initially wrote this function to be used in func_version3. Later, I decided to not consider it in the running time (as the content of array just depends on its length. It can be computed once and be used in several calls of func_version3 (More on this below)

So is there even a need to use njit for fill_c_array rather than using straight numpy? Otherwise, why not consider prange for filling the contents of the array?

I think it is worth it to use njit to speed it up if our goal is to compute it when DAMP is initialized.

You've split fill_c_array into two functions (a public function and a private function) and will incur the cost of invoking the private function.

The private function is actually a recursive function. I can convert it to a for-loop, and then keep everything in one function. I will work on it.

However, it looks like the c_array is only being computed once and the values are use only once and that same c_array is not being reused for other lengths of n.

Two notes:
(1) I simplified the code in the provided notebook. The recursive function func_version3(n, s, x, c_arr) is actually called 2 * n times by the 6-step FFT algorithm (n is $\sqrt{len(x)}$). For instance, for an array x with size 2^22, the function is going to be called 2 * (2^11) times. So, we are going to use the precomputed array 2^12 times in 6-step function.

(2) In DAMP, for example, with each new batch of data point, we need to use FFT. Suppose that the window size m is set to 2^8. When the new batch arrives, we are going to apply FFT on slice with length 2^9, then slice with length 2^10, and so on till we exhaust all historical data point or abandon the search early. Then, we do the same process for the next batch of data. So, in DAMP, for 1M new data points, we may re-use the precomputed array (with length 2^10) 1M times (And, recall that the function func_version3(2^5, s, x, c_arr) is going to be called 2 * (2^5) times.)

Is the for loop in _fill_c_array the same as np.power(c_thata, np.arange(m))?

Right! Will test its performance as it is definitely cleaner!

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

I initially wrote this function to be used in func_version3. Later, I decided to not consider it in the running time (as the content of array just depends on its length. It can be computed once and be used in several calls of func_version3 (More on this below)

(1) I simplified the code in the provided notebook. The recursive function func_version3(n, s, x, c_arr) is actually called 2 * n times by the 6-step FFT algorithm (n is ). For instance, for an array x with size 2^22, the function is going to be called 2 * (2^11) times. So, we are going to use the precomputed array 2^12 times in 6-step function.

To properly show the impact of using precomputed array, I can write a function to call func_version3(n, ...) n times (Like what we have in 6-step FFT). This should help us get better / more accurate idea about the impact of this approach on the running time.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024 1

[WIP]

I initially wrote this function to be used in func_version3. Later, I decided to not consider it in the running time (as the content of array just depends on its length. It can be computed once and be used in several calls of func_version3 (More on this below)

(1) I simplified the code in the provided notebook. The recursive function func_version3(n, s, x, c_arr) is actually called 2 * n times by the 6-step FFT algorithm (n is ). For instance, for an array x with size 2^22, the function is going to be called 2 * (2^11) times. So, we are going to use the precomputed array 2^12 times in 6-step function.

To properly show the impact of using precomputed array, I can write a function to call func_version3(n, ...) n times (Like what we have in 6-step FFT). This should help us get better / more accurate idea about the impact of this approach on the running time.

The Colab notebook is now modified to address the point above. I also ran the code in my MacOS and got the following result (code)

image

In BOTH functions func_version3 and func_version4, we are using precomputed array that contains the factors. In func_version4, however, we take into account the time required for computing that array.


Now let's consider the full steps 2 & 5 (of our 6-step algorithm) rather than just its cos/sin part (code). I ran it in my MacOS and got the following result:

image

Note that fft_v1 and fft_v2 and fft_v3 are exactly the same except for the "factor" part. In fft_v2, we use precomputed array (the approach we took in func_version4 above). In fft_v3, we pass a factor, and, based on that, we compute the other factors (the approach we took in func_version5 above)

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

If I understand correctly, the stockham algorithm is NOT faster than scipy.fft.convolve (or they are about the same after some optimizations). Is that correct? And it also means that the stockham algo is much slower than FFTW?

I wonder if there might be some clues in the OTFFT source code in terms of how they might have parallelized it using OpenMP. I'd be pretty happy if we could get within 2x (slower) than FFTW. I'm also guessing that the sawtooth shape observed in scipy.signal.convolve likely comes from switching between two FFT algorithms. It's reassuring to see that OTFFT is pretty stable in performance across different distances. I'm confused as to why using prange wouldn't give you the necessary speedup but that also depends on the hardware that you are using. Maybe I can find some time to test it out on my Mac with many threads.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

According to this Mathworks webpage, MATLAB is equipped with built-in multithreading.

The built-in multithreading feature in MATLAB automatically parallelizes computations using underlying libraries at runtime, resulting in significant performance improvements for linear algebra and numerical functions such as fft, eig, mldivide, svd, and sort. Since MATLAB does this implicitly, you do not need to enable it manually. If you would like to control the number of computational threads, you can use the maxNumCompThreads function.

A few notes:
(1) As suggested, we can use maxNumCompThreads to set the thread to 1, and then compare MATLAB code (for sliding dot product) with stumpy.core.sliding_dot_product. we use MATLAB online server.

(2) We can implement 6-step / 8-step fft, and parallelize it, and then compare it with MATLAB code. we use MATLAB online server.

(3) should we expect the 6-step and 8-step implementation (without multithreading) be faster than the scipy.fft.fft ? The reason that I am asking this question is because scipy.fft.fft is written in C++ and I was wondering if it makes sense to expect it to be outperformed by 6-step / 8-step fft written in python? If not, then we can just stick to (2), and see how it performs compared to the MATLAB code.

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

I implemented the 8-step algo as well. I ran it on MATLAB online server, and got this:

This is 8-step with njit and parallel=True? Any idea if the 8-step is faster than scipy?

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

It's surprising that it jumps so dramatically at 2^24 to 2^27. I wonder if FFTW suffers from this too or if it still scales linearly. Technically, this algo should be n*logn in computational complexity.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

Will check the behaviour of MATLAB FFTW for large-size inputs.

It seems that MATLAB FFTW shows similar behaviour! (see below)


Also, the following figures confirm that six-step / eight-step FFT outperform FFTW in MATLAB

image

Let's zoom in for log2(len(T)) in range(2, 16):
image

How about p in range(2, 21)?
image

MATLAB

# MATLAB
fft_running_time = [];
pmax = 27;

rng(1,"twister");
T = rand(1, 2^pmax);
save('T_input.mat', 'T');  # we will use this for performance of our FFT 

p_list = 2:1:pmax;
for p = p_list
    idx = 2^p;
    data = T(1, 1:idx);
 
    t = tic; 
    X = fft(data);
    fft_running_time(end+1) = toc(t);
end

Python

# Python
fft(np.random.rand(4), is_rfft=False, y=None) # dummy

fft_running_time_python = []
T = np.array(loadmat('./T_input.mat')['T'][0]).astype(np.float64)
p_list = range(2, 28)
for p in p_list:
    idx = 2 ** p
    data = T[:idx]
 
    tic = time.time()
    X = fft(data, is_rfft=False, y=None)
    toc = time.time()
    fft_running_time_python.append(toc - tic)

The fft function used here takes advantage of rfft but returns full-length fft at the end. The fft python code used for this exact performance comparison is provided here:

https://gist.github.com/NimaSarajpoor/5cf2d4f0b0aad5e0898651dddff35c17

@seanlaw
I created and pushed a new notebook to PR #939 , I am now thinking of adding functions from the link above to it, one by one. What do you think?

Maybe we just ignore the recent pushes to the new notebook, and I start over by adding one function at a time.

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

Thanks @NimaSarajpoor. Can you please provide the raw timing data in a table? I can see that we are doing better at longer time series lengths and with 8 threads. This is great. However, I think the majority of use cases will likely be in the 2^20 range (what do you think?) and so I'd like to see how close we are to MATLAB-FFTW. If we are several magnitudes off (in the <= 2^20 range) in timing then we still have work to do. If we are within 2x then I am okay with that.

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

If it too much, you may just take a look at the figure provided the bottom of this comment

This is great! Thank you

TL; DR: It does not look good

Hmm, maybe I am misunderstanding the data but I'm not sure I agree. For the most part, it looks like we are within around 2x of MATLAB for lengths longer than 2^9. This is great and I do not expect to beat MATLAB/FFTW without a ton of fine tuning. Sure, for lengths that are shorter than 2^9, MATLAB is running at 10^-6 and Python is running at 10^-5 but this is basically instantaneous. Yes, the time can add up if we need to do millions+/billions+ of FFT calculations but I am very optimistic as we are actually much, much better than I had expected (thanks to your work)!

I have a question. Why is Python rfft not faster than Python fft?

I think that I need to dig into section 7 And then check the source code. I have not checked it yet.

For lengths shorter than 2^20, my suspicion is that the majority of our Python time is spent in non-FFT overhead (e.g., creating an array). I will take a look

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

Maybe I am not taking advantage of parallelization in the proper way. I am thinking of profiling the blocks of code in this function, and see if it can give me some clue. Do you think it is a good approach?

I started looking at the function but I deleted my comments because I don't think my (non-parallel-related) suggestions would help in a significant way to reduce the time given that the timings are so fast. Overall, I think you should post it in the numba discussion (or maybe post it as a numba Github issue may be better) to see if they could help us improve the performance from the LLVM level (as they can see how things get JIT compiled) and possibly at the prange level.

If you can do some profiling (for various length input) without njit (i.e., purely in Python) then I think that is a good start. Then, ask the numba devs.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

As suggested in 7.2.2, I started to consider "loop collapsing" for some steps of the 6-step fft algorithm. Although I did not go through the listing 7.4 in detail, I started to do an experiment. I did the "loop collapsing" for two steps for now by using the "load balancing" approach used in stumpy.stump (i.e. balancing the load across the threads considering the number of elements in the diagonals)

For instance, I replaced this part

for k in prange(n):
    kn = k * n
    for p in range(k + 1, n):
        i = k + p * n
        j = p + kn
        x[i], x[j] = x[j], x[I]

with this:

from stumpy import core

n_threads = numba.config.NUMBA_NUM_THREADS

diags = np.arange(0, n - 1)
ndist_counts = core._count_diagonal_ndist(diags, 1, n - 1, n - 1)
diags_ranges = core._get_array_ranges(ndist_counts, n_threads, False)

for thread_idx in prange(n_threads):
    start = diags_ranges[thread_idx, 0]
    stop = diags_ranges[thread_idx, 1]
    for diag_idx in range(start, stop):
        g = 1 + diags[diag_idx]  
        for k in range(0, min(n, n - g)):
            p = k + g

            i = k + p * n
            j = p + k * n
            x[i], x[j] = x[j], x[I]
image

A few notes:
(1) I will try this again to make sure I did not make any mistake
(2) There is another step that can have loop collapsing. I have not changed it yet. For now, I just tried to balance the load for two steps.
(3) While the running time gets better for very large-size time series, I noticed the running time became slightly worse when the time series is short.
(4) Both core._count_diagonal_ndist and core._get_array_ranges create arrays. So, that can increase the running time. We might be able to modify those functions for our fft use case.
(5) For time series with length 2^p, the rfft trick halves the length, so, the computation is going to be for this length 2^(p-1). And, n in diags = np.arange(0, n - 1) will be: 2^r, where r = int((p-1) / 2).

So, for time series with length 2^4 = 16, the rfft will apply fft on length 2^3=8, and n becomes 2. In such case, the overhead of prange might become more than the computation itself (?!). The following two posts might have some clues:

[As an aside, I have been doing some profiling. Will provide an update]

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

(4) Both core._count_diagonal_ndist and core._get_array_ranges create arrays. So, that can increase the running time. We might be able to modify those functions for our fft use case.

It seems strange to call the core._count_diagonal_ndist function as FFT has no concept of "diagonals". I wouldn't mix the two (i.e., I wouldn't modify those functions and, instead, create new functions that would be more direct - though, if I understand correctly, you just need to split/divide n up into n_threads and so this feels like overkill).

Would something simpler like this work?

from stumpy import core

n_threads = numba.config.NUMBA_NUM_THREADS
chunk_size = n // n_threads

for thread_idx in prange(n_threads):
    start = thread_idx * chunk_size
    stop = min(start + chunk_size, n)
    for idx in range(start, stop):
        g = 1 + idx  
        for k in range(0, min(n, n - g)):
            p = k + g

            i = k + p * n
            j = p + k * n
            x[i], x[j] = x[j], x[I]

I have no idea what idx is representing... Although, even this seems overly complicated (as evidenced by the triple-nested-for-loop)! Again, I think we should post our code to the numba Github issues to see if there are any obvious (big) things that can be changed in the previous code to speed it up. Otherwise, we might waste a lot of time with adding small gains when there might be something more obvious at the LLVM level.

So, for time series with length 2^4 = 16, the rfft will apply fft on length 2^3=8, and n becomes 2. In such case, the overhead of prange might become more than the computation itself (?!). The following two posts might have some clues:

I think it might be a good idea to allow the number of threads to be specified as a function parameter (with the default being n_threads = -1 (i.e., use ALL available threads and, otherwise, add a check to make sure we don't specify MORE than the available number of threads) OR we might have some heuristic approach that uses only a single thread for smaller lengths))

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

Let's define a few terms:

  • T : an input array filled with real-valued elements
  • Parallel: Means using all threads
  • Single: Means using 1 thread AND NOT using prange
  • full_fft: Means that the output has length len(T), computed by applying fft on T
  • rfft: Means that the output has length len(T) //2 + 1 , computed by applying fft on X, where X = T[::2] + 1j * T[1::2] (and some post processing)
  • full_fft_with_rfft: Means that the output has length len(T), computed by using the rfft trick (i.e. expanding the output of rfft to obtain an output that is equivalent to full_fft)

ourfft is based on OTFFT 6-step algorithm.


Intuitively, I think we should see the following relationship regarding the running time:

rfft < full_fft_with_rfft < full_fft

Let's first check this in both Single and Parallel mode. The following results are obtained by running code on MATLAB online server.

Single

image

Note that the performance of rfft and rfft_with_full_fft are very close. This is because almost all of the computational load is on the shoulder of rfft. Let's consider rfft as base, and let's compute the following ratios for the running time:

  • rfft_with_full_fft / rfft
  • full_fft / rfft
image

Note that the performance of full_fft seems to be %50 slower than rfft

Parallel

image image

Python OTFFT-6step with Single+Parallel, versus MATLAB

So far, we have seen that computing fft with rfft trick is faster than computing fft directly (i.e. applying OTFFT 6-step algo on full T directly). So, in what follows, we do NOT consider full_fft in our comparison. Since MATLAB compute full fft (i.e. the output has the same length as input T), we compare its performance against our full_fft_with_rfft case (and NOT just rfft in which the output has length len(T) // 2 + 1). Fortunately, as observed, there is a very little cost in computing full fft using rfft trick.

image

The intersection of Single (blue-curve) and Parallel (red-curve) is ~ 2^12 in our settings.


Now, let's combine blue and red curves and keep the smallest value for each length of input. And, let's add scipy.fft.fft to the plot...

image

And, even if I use scipy.fft.rfft, we will see:

image

Note: In this latter case, the output of scipy.fft.rfft is equivalent to the rfft. In other words. that little cost of converting the rfft output to full fft has not been considered yet for the scipy.fft.rfft case.

My conclusion: OTFFT-6step Single+Parallel combo seems to have a decent performance.


Code

The code can be found here: https://gist.github.com/NimaSarajpoor/d08b8a574a9cae92146fae03e8142e30

For the Single mode, just replace prange with range and modify the njit decorators by removing parallel=True

Code to check performance

MATLAB

LASTN = maxNumCompThreads(8);
disp(strcat("LASTN: ",num2str(LASTN)));

Nth = maxNumCompThreads;
disp(strcat("MAX N threads: ",num2str(Nth)));

fft_running_time = [];
pmax = 27;

rng(1,"twister");
T = rand(1, 2^pmax);
save('T_input.mat', 'T')

p_list = 2:1:27;
i_range = 1:1:20;
fft(T(1, 1:8));
for p = p_list
    idx = 2^p;
    data = T(1, 1:idx);
    
    t = tic; 
    for i = i_range
        X=fft(data);
    end
    fft_running_time(end+1) = toc(t);
  
end

Python

import numba
numba.set_num_threads(8)
n_threads = numba.get_num_threads()

FFT_FUNCTION(np.random.rand(4)) # dummy

running_time = []
T = np.array(loadmat('./T_input.mat')['T'][0]).astype(np.float64)

p_list = range(2, 28)
for p in p_list:
    idx = 2 ** p
    data = T[:idx]

    tic = time.time()
    for _ in range(20):
        X = FFT_FUNCTION(data)
    toc = time.time()
    running_time.append(toc - tic) 
    

Also, in the code above, FFT_FUNCTION can be any of the following functions:

  • rfft
  • fft_full_with_rfft
  • fft_full
  • scipy.fft.fft
  • scipy.fft.rfft

Btw, I wrapped everything by a njit-decorated function. So, in contrast to my previous implementations, this time the arrays were created inside a njit-decorated function.(Not sure how much that affects the performance though!. Haven't done any profiling yet regarding that matter)

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

full_fft_with_rfft: Means that the output has length len(T), computed by using the rfft trick (i.e. expanding the output of rfft to obtain an output that is equivalent to full_fft)

Okay, correct me if I'm wrong but, from a STUMPY standpoint, I don't actually care about full_fft_with_rfft, right? Since my input is ALWAYS real valued, I only care about rfft, right? I mean, I understand that it might be "interesting" to use a trick to "expand the output of rfft to obtain an output that is equivalent to full_fft" but the "complex" part of the full fft is useless for STUMPY. I just want to make sure I'm not missing anything (i.e., full_fft_with_rfft doesn't add any extra value)?

This is because almost all of the computational load is on the shoulder of rfft.

Right. Either way, it is rfft underneath the hood and full_fft_with_rfft is (I'm oversimplifying it) rfft + "few additional steps that don't require a lot of computation". Is that correct? Is that how I should be thinking about it?

Note that the performance of full_fft seems to be %50 slower than rfft

This is interesting. It means that if one actually cared about the "complex" part then it would still be "faster" to simply compute the rfft and then expand to full_fft essentially "for free". Of course, we don't need the complex part for STUMPY.

Since MATLAB compute full fft (i.e. the output has the same length as input T), we compare its performance against our full_fft_with_rfft case (and NOT just rfft in which the output has length len(T) // 2 + 1).

This is to say that rfft would, in theory, be slightly faster than full_fft_with_rfft anyways? I mean, I understand that you want to compare apples-to-apples in terms of the output in order to be "fair" but, again, I should really be caring that rfft is faster than MATLAB fft (even though it outputs the complex part).

Fortunately, as observed, there is a very little cost in computing full fft using rfft trick.

Right. Minimally, we should add rfft. If possible (I haven't looked at the code yet), maybe we can add a fft function that adds the additional steps as some sort of wrapper around rfft?

My conclusion: OTFFT-6step Single+Parallel combo seems to have a decent performance.

I agree! There are some parts (around 2^17 to 2^21) where, for some reason, OTFFT and SciPy are around 2x slower than MATLAB. Any idea why that might be?

Does your current OTFFT implementation create new arrays every time? Or is there a way to implement it so that one or more arrays can be reused? I'm just thinking about DAMP where rfft would be called many times and we might be able to save some array-creation overhead time. For a single or few calls to rfft, it certainly doesn't matter. But for thousands or millions of calls, it can add up (maybe this is a numba optimization/compilation question?).

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

Just to make sure we are on the same page, I would like to mention that the output returned by real fft (rfft) contains complex part. This rfft output is just the first half of full_fft(T). The second half is computed by "conjugating" the first half. Do we need the second half? See below.

I think I have a conceptual gap/misunderstanding. I didn't realize that rfft still produces an output that has BOTH a real component and a complex component. Naively, I assumed that rfft produces real output only. I think I'm missing the point on what/how we benefit from rfft (as compared to full_fft). Would you mind educating me please?

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

I had the same misunderstanding. I thought that the term real in real FFT (rfft) means the output is pure real numbers. I later realized that values are of type complex.

For a given input T, its full-length fft (let's call it F) has length n = len(T), and its elements are complex. So, one needs to compute n values in F. However, if T is filled with real-valued numbers, then there exists this nice relationship that says the second half of F can be computed by conjugating (and mirroring) its first half. So, in such case, we just need to compute the elements in F[: n // 2 + 1]. In other words, we avoid doing redundant computation for the second half of F.

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

rfft: Means that the output has length len(T) //2 + 1 , computed by applying fft on X, where X = T[::2] + 1j * T[1::2] (and some post processing)

I'm kind of confused here because it looks like input for rfft is complex? However, above, you said:

However, if T is filled with real-valued numbers, then there exists this nice relationship that says the second half of F can be computed by conjugating (and mirroring) its first half. So, in such case, we just need to compute the elements in F[: n // 2 + 1]. In other words, we avoid doing redundant computation for the second half of F.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

rfft: Means that the output has length len(T) //2 + 1 , computed by applying fft on X, where X = T[::2] + 1j * T[1::2] (and some post processing)

I'm kind of confused here because it looks like input for rfft is complex?

Note that the input for the function rfft is T, and this function calls fft to compute the fft of X. See below:

def rfft(T):
    """
    input T: a real-valued 1D array, with length n
    out: is equivalent to F[:n//2 + 1], where F = fft(T)
    '"""
    X = T[::2] + 1j * T[1::2]  
    out = fft(X)    # this is full-length fft
    post_process(out)
    
    return out

Please let me know if further clarification is needed.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

@seanlaw

I think I'm missing the point on what/how we benefit from rfft (as compared to full_fft).

I previously said:

In other words, we avoid doing redundant computation for the second half of F.

But, I think we should look at it from this perspective that we need to compute fft(X) in rfft instead of fft(T) (see the function rfft provided in my previous comment), and since len(X) is half of len(T), rfft is faster.

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

Please let me know if further clarification is needed.

So, from what I can tell in:

X = T[::2] + 1j * T[1::2]

Here, we create a new "complex" array where the even numbered elements of T are stored in the real part of the array and the odd numbered elements of T are stored in the imaginary space. Thus, our overall "complex" X array is half the length of the "complex" array used in full fft and then, when we do fft(X), there is less work to be done than if we do fft(T). Is that correct?

Btw, is this (along with post_process) what you are referring to as the "rfft trick"?

So, what is unclear to me is how does the output of post_process(fft(X)) differ from fft(T)? I know that they are different due to the length of the complex input arrays but this also implies that there are parts of fft(T) that are useless to us. Are there parts of fft(X) that are useless or are BOTH the real and imaginary outputs from fft(X) necessary for obtaining (via post_process) the "real" part that is equivalently produced in fft(T) (i.e., (post_process(fft(X))).real == fft(T).real)? Or does fft(T).real contain redundant information that needs to get truncated (I vaguely remember seeing QT.real[m - 1 : n] in our sliding_dot_product function).

Perhaps, what I should really be asking is what does the real and imaginary parts of fft(X) represent AND what is happening in the post_process step (i.e., what real+imaginary output is it producing)?

So, if we can write a Numba-friendly function that has the same functionality as scipy.fft.irfft, then the answer to your question is: yes, we do not care. I need to think how I can revise the algo provided in OTFFT to achieve irfft.

Yes! I completely forgot that we need the inverse function as well in order to have a full convolution. It would be great if we could also figure out irfft. To some extent, it would've been super helpful to have a scipy.fft.rconvolve (real-convolve), right?

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

@seanlaw
I have some "rough" idea about the computation of ifft from STUMPY's standpoint. I am going to share my thoughts as that may result in creating some sparks in our minds in the future.

The sliding dot product, QT, can be computed like this:

F = fft(A) * fft(B) 
QT = ifft(F).real[slice]

Where A and B are real-valued arrays based on T and Q, and slice is just a specific range of indices that are of our interest. F is the full-length fft

We already showed that we can reduce the computing time by taking advantage of rfft when input array has real-valued elements:

R = rfft(A) * rfft(B)
F = CONVERT_RFFT_TO_FULL_FFT(R)  

QT = ifft(F).real[slice]

Okay, now that we took care of the fft part, let's see what we can do regarding the ifft part. The ifft function is:

def ifft(arr):
       arr[:] = np.conjugate(arr)  # step 1
       arr[:] = fft(arr)  # step 2
       arr[:] = np.conjugate(arr)  # step 3
       arr[:] =  arr / len(arr)   # step 4
       return arr

The heavy load of computation is on fft(arr). Note that arr is complex. So, we cannot take advantage of rfft here. But, I am thinking of breaking down arr to real and imaginary part.

So, let's say: $arr = R + jX$, then we will have the following equations that correspond to the steps provided in the ifft function.

step 1: conjugating arr
$arr = (R + j X)^{*}$
$arr = R - j X$

step 2: computing fft of arr
$fft(arr) = fft(R) - j fft(X)$

Let's use the following naming: F=fft(R) and G=fft(X). Therefore:

$fft(arr) = (F.real + jF.imag) - j (G.real + jG.imag)$

step 3: apply conjugate
$fft(arr)^{*} = {(F.real + jF.imag)}^{*} - {j}^{*} \times {(G.real + jG.imag)}^{*}$
$fft(arr)^{*} = (F.real - jF.imag) - (-j) \times (G.real - jG.imag)$
$fft(arr)^{*} = (F.real - jF.imag) + j \times (G.real - jG.imag)$
$fft(arr)^{*} = F.real - jF.imag + j G.real + G.imag$
$fft(arr)^{*} = F.real + G.imag - j (F.imag - G.real)$
$fft(arr)^{*} = F.real + G.imag - j (F.imag - G.real)$

Note that at the end, to compute QT, we only care about the real part of it. So, we just care about: F.real and G.imag

Note that F=fft(R) and we can use rfft-trick here (to reduce the computing time) because R itself is real.
Note that G=fft(X) and we can use rfft-trick here (to reduce the computing time) because X itself is real.

So, we are halving the length of arrays (when we want to use rfft) but we need to do rfft-trick twice, versus doing fft only once just for arr = R + jX, i.e. computing fft(arr). We may test this approach for very long time series data to check its performance.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

I checked the performance of both scipy.fft.ifft and scipy.fft.irfft. I realized the latter is faster (on my system) when length is > 2 ^ 10 .

This old MATLAB code shows the implementation of "irfft". Finding other sources can also be helpful in understanding the algorithm of irfft.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

IRFFT

@njit(fastmath=True)
def _ifft(x):
    x = x.copy()
    n = len(x)
    x[:] = np.conjugate(x)
    x[:] = scipy.fft.fft(x)
    x[:] = np.conjugate(x / n)
    
    return x

def ifft(x):
    return _ifft(x)

@njit(fastmath=True)
def _irfft(y):
    m = len(y)
    if m == 1:
        out = np.real(y)
        return out
    
    n = 2 * m - 2  # default output length
    v = y.copy()
    v[m - 1] = np.real(v[m - 1])  # force nyquist element real
    t = -0.5j * np.exp((2j * np.pi / n) * np.arange(m))
    v[:] = (t + 0.5) * (np.conj(np.flipud(v)) - v) + v
    v[:-1] = _ifft(v[:-1])
    
    out = np.zeros(n)
    out[0:n:2] = np.real(v[:-1])
    out[1:n:2] = np.imag(v[:-1])

    return out


def irfft(y):
    return _irfft(y)

And to test it:

for p in range(3, 11):
    print(f'p --> {p}')
    n = 2 ** p
    t = np.random.rand(n)
    print('t: \n', t)
    
    R = scipy.fft.rfft(t)
    ref = scipy.fft.irfft(R)
    comp = irfft(R)
    
    np.testing.assert_almost_equal(ref, comp)

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

Why is there a dashed black line? What does it represent?

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

Why is there a dashed black line? What does it represent?

Nothing :) I just got the minimum of parallel and single mode. Then, I plotted each part. Then, I realized one ended at x=11 and the other starts at x=12. And there was a gap :) So, just to show it as a continuous plot line, I used that dashed black line.

I now think "scatter plot" would have been a better choice. Do you you think I should just remove that dashed line, or should I use scatter plot?

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

[WIP]

[Warning]
Previously, I mentioned that it is not clear how one can set the number of threads in MATLAB. In addition to that, I just discovered that the [average] performance of MATLAB code gets better as the number of iterations increases (generally speaking).

Examples:

image image image

(I found this when I was trying to re-run codes, and I used a higher number of iterations. I noticed that the performance ratio changed.)

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

In addition to that, I just discovered that the [average] performance of MATLAB code gets better as the number of iterations increases (generally speaking).

So, if MATLAB uses FFTW then what might be happening is that MATLAB is creating FFTW "plans" in the background that allows it to improve the computation speed: https://www.fftw.org/fftw3_doc/Using-Plans.html

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

This cache-oblivious transpose algorithm (and matrix multiplication algorithm) might be useful for future reference as it relates to blocking/tiling

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

@NimaSarajpoor That's awesome! Great work. It's nice to see that tiling works so well to help reduce cache misses! Also, there's no prange/multithreading either which is good. It's really great and rewarding to be able to learn and leverage published academic research like this. I really wonder if we can rethink tiling for the _stump function based on this new knowledge.

Should we really turn on fastmath? If so, we will need to remember to correct the final output if the original inputs are not finite right? So a _fft function assumes finite inputs (and uses fastmath) but any function that calls _fft will need to handle/correct non-finite values first.

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

@seanlaw

It's nice to see that tiling works so well to help reduce cache misses

Yes! This is the first time I see the benefit of considering cache with my own eyes 😅 Thank you for the guidance! That was a very good lesson for me! I still need to study more about it, but I think that was a very good experience!!!

I really wonder if we can rethink tiling for the _stump function based on this new knowledge.

Right! We can revisit it.

Should we really turn on fastmath? If so, we will need to remember to correct the final output if the original inputs are not finite right?

Good question! I am going to check a few things:

(1) what will be the performance after removing fastmath? (or just considering certain flags?)
(2) what should be the output of FFT when input contains inf? Going to check the output of scipy fft & numpy fft.

Btw, STUMPY cares about sliding-dot-product. So, the private _fft that assumes input data is already preprocessed and has no inf/nan should be enough. What do you think? Should we still consider adding fft function that handles inf / nan data ?

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

Good question! I am going to check a few things:

(1) what will be the performance after removing fastmath? (or just considering certain flags?) (2) what should be the output of FFT when input contains inf? Going to check the output of scipy fft & numpy fft.

Btw, STUMPY cares about sliding-dot-product. So, the private _fft that assumes input data is already preprocessed and has no inf/nan should be enough. What do you think? Should we still consider adding fft function that handles inf / nan data ?

(1) Removing fastmath does not affect the performance of the transpose function func6 provided in this comment. This is good!

(2) Let denote F as the RFFT of a 1D array, then F.real becomes all inf (nan) if there is at least one inf (nan) in the input array. And, the Inverse RFFT (i.e. IRFFT) of F becomes all nan/ inf. We can try to create a fft function that handles this case (i.e. when one/more element is nan or inf) but this is not going to be useful from STUMPY's standpoint. What do you think, @seanlaw ?

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

We can try to create a fft function that handles this case (i.e. when one/more element is nan or inf) but this is not going to be useful from STUMPY's standpoint. What do you think, @seanlaw ?

I understand. If we only do _fft then we need to make sure that this private function clearly states the assumptions and then reminds the user that:

F.real becomes all inf (nan) if there is at least one inf (nan) in the input array. And, the Inverse RFFT (i.e. IRFFT) of F becomes all nan/ inf.

In other words, the _fft docstring tells you exactly what you need if you want a public function that can handle any inputs. Of course, it probably takes a lot less effort just to write the fft public function than to add additional documentation to _fft

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

I also forgot to ask if you've played around with blocksize = 32 or if it is possible that this choice is hardware dependent? If so, could/should we make this a STUMPY config variable which we could possibly allow the user to run an optimization to find the best blocksize for their hardware?

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

Of course, it probably takes a lot less effort just to write the fft public function than to add additional documentation to _fft

Makes sense! Got the point. I will address it after wrapping up the private function.


I also forgot to ask if you've played around with blocksize = 32 or if it is possible that this choice is hardware dependent? If so, could/should we make this a STUMPY config variable which we could possibly allow the user to run an optimization to find the best blocksize for their hardware?

I recently thought about this as I know you usually prefer to use config variable, if possible / needed, rather than having a constant value in a code. Regarding transpose, I tried blocksize with value in {16, 32, 64, 128} in my MacOS and Google Colab. Noticed no significant improvement. Will check again, and share the notebook.


[WIP] Notebooks for profiling different components of our FFT implementation

Also, I have been profiling EACH block of code, to figure out what works best per block.

For instance, regarding x[:] = T[::2] + 1j * T[1::2] (let's call it compress T), I noticed that we can simply use a for-loop with fastmath=True, and we can avoid using prange for T unto length 2^24 (See this notebook). I also tried it in my MacOS with 8 threads and got similar conclusion. (We might get a different conclusion though for different number of threads. something I need to check, and see if it needs to be considered in our implementation or not)

Compress T
Transpose
Apply fft on n blocks, each with size n
Tranpose_with_Twiddle_Factor
8step-function-preprocessing-part
8step-function-postprocessing-part

(To be continued)


Also, another thing I have been trying out is to use pre-computed factors instead of computing each factor in fft algorithm. (This is based on what I read before somewhere(?). It is one of the ways one can use to further improve FFT algorithm) The values of these factors ONLY depend on the length of input.

For instance, let's take a look at the following block of code that is used in the last stage of rfft algorithm.

# n = len(T)
m = n // 2

theta = math.pi / m
for k in range(1, m // 2):
    factor = math.cos(k * theta) - 1j * math.sin(k * theta)

We can see a similar part in the 8-step fft algroithm, which is being called when T has length $2^{even}$.

theta = math.pi / (m // 2)   # m is len(T) // 2
for i in range(m // 2):
   factor = math.cos(i * theta) - 1j * math.sin(i * theta)

We can save factors in an array of length m // 2. The factor in the second block is just the square of factors in the first block (this particular part may not be necessarily faster than math.cos + 1j * math.sin operation! I am not sure if it makes sense to pre-compute the factors and save them in 1D array, and re-use them when fft is called. I am trying to see if I can use the same trick in the recursive function that is used in 6-step fft algorithm.

Note: It is cleaner for sure If we can reach a reasonable performance without this!

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

Precomputing the array has not shown promising result.

Hmm, this seems surprising. Can you share a small, self contained, reproducible example with only the sine/cosine parts?

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

I created three version and put them all in this notebook. The first version is the original version. The second version takes advantage of the identity equation ejθ=cos(θ)+jsin(θ). In the third version, we pass an array as argument filled with precomputed values. No performance gain is observed in the third version. I think we are dealing with memory access time here.

Here a few observations for you to consider:

  1. It looks like fill_c_array isn't even called by func_version3 so is there even a need to use njit for fill_c_array rather than using straight numpy? Otherwise, why not consider prange for filling the contents of the array?
  2. You've split fill_c_array into two functions (a public function and a private function) and will incur the cost of invoking the private function. This may slow things down ever so slightly (test this with and without njit) but probably not significant enough to matter. Of course, now you are compiling two functions instead of one.
  3. So, when I talking about result caching (i.e., "memoization" with lru_cache), I was really referring to the point that values were constantly being recomputed. However, it looks like the c_array is only being computed once and the values are use only once and that same c_array is not being reused for other lengths of n. This is inefficient. Zooming out a bit and thinking about the bigger picture (DAMP or MADRID), we should remember that we'll be calling FFT many, many times but for different lengths of n (that are strictly a power of 2), which means that all of our possible c_array would/should have n in this same range too. It makes no sense to precompute the array once and then throw it away and never use it again. Otherwise, we should test for performance of re-using the c_array. Maybe I misunderstood something?
  4. Is the for loop in _fill_c_array the same as np.power(c_thata, np.arange(m))?

from stumpy.

seanlaw avatar seanlaw commented on June 12, 2024

Btw is c_thata a typo and should be c_theta?

from stumpy.

NimaSarajpoor avatar NimaSarajpoor commented on June 12, 2024

Right!😅 Thanks for catching it! Will fix it.

from stumpy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.