clmathlibraries / clfft Goto Github PK

a software library containing FFT functions written in OpenCL

License: Apache License 2.0

C++ 94.05% C 3.33% Python 1.47% CMake 1.12% Objective-C 0.03%

clfft's Introduction

Build Status

Build branch	master	develop
GCC/Clang x64
Visual Studio x64

clFFT

clFFT is a software library containing FFT functions written in OpenCL. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and heterogeneous programming.

Pre-built binaries are available here.

What's New

Support for powers of 11&13 size transforms
Support for 1D large size transforms with no extra memory allocation requirement with environment flag CLFFT_REQUEST_LIB_NOMEMALLOC=1 for complex FFTs of powers of 2,3,5,10 sizes

Note

clFFT requires platform/runtime that supports OpenCL 1.2

Introduction to clFFT

The FFT is an implementation of the Discrete Fourier Transform (DFT) that makes use of symmetries in the FFT definition to reduce the mathematical intensity required from O(N^2) to O(N log2(N)) when the sequence length N is the product of small prime factors. Currently, there is no standard API for FFT routines. Hardware vendors usually provide a set of high-performance FFTs optimized for their systems: no two vendors employ the same interfaces for their FFT routines. clFFT provides a set of FFT routines that are optimized for AMD graphics processors, but also are functional across CPU and other compute devices.

The clFFT library is an open source OpenCL library implementation of discrete Fast Fourier Transforms. The library:

provides a fast and accurate platform for calculating discrete FFTs.
works on CPU or GPU backends.
supports in-place or out-of-place transforms.
supports 1D, 2D, and 3D transforms with a batch size that can be greater than 1.
supports planar (real and complex components in separate arrays) and interleaved (real and complex components as a pair contiguous in memory) formats.
supports dimension lengths that can be any combination of powers of 2, 3, 5, 7, 11 and 13.
Supports single and double precision floating point formats.

clFFT library user documentation

Library and API documentation for developers is available online as a GitHub Pages website

Google Groups

Two mailing lists exist for the clMath projects:

[email protected] - group whose focus is to answer questions on using the library or reporting issues
[email protected] - group whose focus is for developers interested in contributing to the library code

API semantic versioning

Good software is typically the result of the loop of feedback and iteration; software interfaces no less so. clFFT follows the semantic versioning guidelines. The version number used is of the form MAJOR.MINOR.PATCH.

clFFT Wiki

The project wiki contains helpful documentation, including a build primer

Contributing code

Please refer to and read the Contributing document for guidelines on how to contribute code to this open source project. The code in the /master branch is considered to be stable, and all pull-requests must be made against the /develop branch.

License

The source for clFFT is licensed under the Apache License, Version 2.0

Example

The following simple example shows how to use clFFT to compute a simple 1D forward transform

#include <stdlib.h>

/* No need to explicitely include the OpenCL headers */
#include <clFFT.h>

int main( void )
{
    cl_int err;
    cl_platform_id platform = 0;
    cl_device_id device = 0;
    cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
    cl_context ctx = 0;
    cl_command_queue queue = 0;
    cl_mem bufX;
	float *X;
    cl_event event = NULL;
    int ret = 0;
	size_t N = 16;

	/* FFT library realted declarations */
	clfftPlanHandle planHandle;
	clfftDim dim = CLFFT_1D;
	size_t clLengths[1] = {N};

    /* Setup OpenCL environment. */
    err = clGetPlatformIDs( 1, &platform, NULL );
    err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL );

    props[1] = (cl_context_properties)platform;
    ctx = clCreateContext( props, 1, &device, NULL, NULL, &err );
    queue = clCreateCommandQueue( ctx, device, 0, &err );

    /* Setup clFFT. */
	clfftSetupData fftSetup;
	err = clfftInitSetupData(&fftSetup);
	err = clfftSetup(&fftSetup);

	/* Allocate host & initialize data. */
	/* Only allocation shown for simplicity. */
	X = (float *)malloc(N * 2 * sizeof(*X));

    /* Prepare OpenCL memory objects and place data inside them. */
    bufX = clCreateBuffer( ctx, CL_MEM_READ_WRITE, N * 2 * sizeof(*X), NULL, &err );

    err = clEnqueueWriteBuffer( queue, bufX, CL_TRUE, 0,
	N * 2 * sizeof( *X ), X, 0, NULL, NULL );

	/* Create a default plan for a complex FFT. */
	err = clfftCreateDefaultPlan(&planHandle, ctx, dim, clLengths);

	/* Set plan parameters. */
	err = clfftSetPlanPrecision(planHandle, CLFFT_SINGLE);
	err = clfftSetLayout(planHandle, CLFFT_COMPLEX_INTERLEAVED, CLFFT_COMPLEX_INTERLEAVED);
	err = clfftSetResultLocation(planHandle, CLFFT_INPLACE);

    /* Bake the plan. */
	err = clfftBakePlan(planHandle, 1, &queue, NULL, NULL);

	/* Execute the plan. */
	err = clfftEnqueueTransform(planHandle, CLFFT_FORWARD, 1, &queue, 0, NULL, NULL, &bufX, NULL, NULL);

	/* Wait for calculations to be finished. */
	err = clFinish(queue);

	/* Fetch results of calculations. */
	err = clEnqueueReadBuffer( queue, bufX, CL_TRUE, 0, N * 2 * sizeof( *X ), X, 0, NULL, NULL );

    /* Release OpenCL memory objects. */
    clReleaseMemObject( bufX );

	free(X);

	/* Release the plan. */
	err = clfftDestroyPlan( &planHandle );

    /* Release clFFT library. */
    clfftTeardown( );

    /* Release OpenCL working objects. */
    clReleaseCommandQueue( queue );
    clReleaseContext( ctx );

    return ret;
}

Build dependencies

Library for Windows

To develop the clFFT library code on a Windows operating system, ensure to install the following packages on your system:

Windows® 7/8.1
Visual Studio 2012 or later
Latest CMake
An OpenCL SDK, such as APP SDK 3.0

Library for Linux

To develop the clFFT library code on a Linux operating system, ensure to install the following packages on your system:

GCC 4.6 and onwards
Latest CMake
An OpenCL SDK, such as APP SDK 3.0

Library for Mac OSX

To develop the clFFT library code on a Mac OS X, it is recommended to generate Unix makefiles with cmake.

Test infrastructure

To test the developed clFFT library code, ensure to install the following packages on your system:

Googletest v1.6
Latest FFTW
Latest Boost

Performance infrastructure

To measure the performance of the clFFT library code, ensure that the Python package is installed on your system.

clfft's People

Contributors

Stargazers

Watchers

Forkers

jayavanth mvc4 glder matze oscarbg arrayfire k3ack3r amd-firepro xyuan hmaal benjamincoquelle victusfate strategist922 jakebolewski jslhs zh4ngx juxiangwu allen-ecu andreygursky zhanglele7921pzr liangyaozhan d-meiser dalongxia oznayang elisbyberi curtis-bd chiahungtai hgaspar cmoxiv mhossny edkeith mikeseven kif simudream kelomaniack ptahmose sschaetz glehmann philsee josephwinston ymm000596 tyler-d ljohnson1228 butterflynetwork soledad89 standardperson amirgholami nagyistoce wgapl hemmingway pradeeptrgit bhuvanap linan7788626 nagyist john-colvin matcheydj crycrane victorcarlquist iotamudelta drlight-code holocronweaver sinozope santanu-thangaraj geggo shoichiro-yamada dgmonkeyking 10imaging jfpoole dpo hongyunnchen ckehl listenlink thefiddler cpplover fmarrabal cyrillebonamy timmyliu aonorin jszuppe xfong dlyshare warrenwg zxsuny a3213105 ghulands shyamalschandra qingyunli apc-llc anastas-ruskov foo123 liangfu biotrump sixsamuraisoldier gjacquenot bkmgit ilibx ethan999 phweber andrasfuchs tass-lab

clfft's Issues

More flexible handling of GTest in test suite

Right now, clFFT requires libgtest to be detectable via its corresponding cmake find module.The test suite is then compiled and linked with the available pre-compiled version of libgtest.

This approach contradicts with the official recommendation in [1], which advise to build and use GTest on a per project basis. As a consequence, Debian and other downstream OS only provides the source for GTest instead of a pre-compiled binary of it, in accordance with what upstream recommends.

It would be nice if the build system could be somewhat adapted to use this workflow by:

enabling the location of the GTest source directory to be passed as an argument to the build system,
performing the build of libgtest with the suitable build flags for the project,
linking the existing test suite with the project-wise build of libgtest.

I believe ArrayFire [2] does that properly, although they use git submodules to provide the source for GTest.

Many thanks,

[1] https://code.google.com/p/googletest/wiki/V1_7_FAQ#Why_is_it_not_recommended_to_install_a_pre-compiled_copy_of_Goog
[2] https://github.com/arrayfire/arrayfire/blob/devel/test/CMakeLists.txt

Examples for 2D and 3D FFT

Is there any example / can some examples be provided for computing 2D and 3D FFT?

clFFT displays (possibly) unnecessary log messages even in release mode

This is in relation to arrayfire/arrayfire#171

The output of the fft seems to be correct. However the following message is displayed only on AMD hardware.

OPENCL_V< CLFFT_INVALID_PLAN > (3083): fftRepo.getPlan failed

This is for a real to complex transform. Input of size 8 elements.

OSX clFFT on CPU with max workgroup size of [1,1,1]

I get this from attempting to enqueue the kernel on the CPU:
"[CL_INVALID_WORK_GROUP_SIZE] : OpenCL Error : clEnqueueNDRangeKernel failed: total work group size (243) is greater than the device can support (1)"

http://stackoverflow.com/questions/10065681/opencl-kernel-work-group-size-restriction
"Apple OpenCL doesn't support work-groups larger than [1, 1, 1] on the CPU. I have no idea why, but that's how it's been at least up to OSX 10.9.2. Larger work-groups are fine on the GPU, though."

Any way to make this work, although painfully slow, for CPU based debugging?

Add Radix-N support

Our kernel generator can only handle FFT vector sizes that contains 2, 3, 5 as prime factors. This enhancement is to add support for generating kernels that can handle arbitrary vectors sizes that do not easily factor into our supported prime factors.

Installation suffixes problematic on Linux

Depending on the build architecture, you append either 32 or 64 to the installation dirs (for example bin32 or lib64). The FHS does not specify such suffixes for binary directories, hence your binaries won't be found on most distributions because something like /usr/bin32 is not in $PATH.

Moreover, appending a 64 to the lib directory should be left for the distributions to decide and should be made configurable and/or overridable. See for example this approach posted on the CMake mailing lists.

Latex formulae in documentation not rendering correctly

I see that the documentation generated by doxygen (version 1.8.6 in ubuntu 14.04) using /doc/clFFT.doxy does not render the maths formula correctly. I managed to get this right by changing

USE_MATHJAX            = NO

in line 1412 of clFFT.doxy.
It will be great if you guys can update http://clmathlibraries.github.io/clFFT/ pages so that the formulae are displayed properly. Many thanks in advance.

Can't measure the performance of 2D or 3D FFT

When using those scripts in the perf directory to measure 1D FFT, everything is OK.

But when it comes to 2D or 3D, it didn't work out.

When I run measurePerformance.py -x 192 -y 128 -b max , a window pop out, says Client.exe stop working

and the CMD output is:

xxx\staging\Debug\measurePerformance.py -x 192 -y 128 -b max
=========================MEASURE PERFORMANCE START===========================
Process id of Measure Performance:36796
Executing measure performance for label: None
Executing for label: None
table header---->lengthx,lengthy,lengthz,batch,device,inlay,outlay,place,precision,label,GFLOPS
Total combinations =  1

preparing command: 1
Executing Command: ['Client.exe', '--gpu', '-x', '192', '-y', '128', '-z', '1', '--batchSize', '42', '--inLayout', '1', '--outLayout', '1', '', '', '-p', '10']
stdout:


========================StdDev ( 2 )========================


stderr:

Execution Successfull---------------

ERROR: Exception occurs in GFLOP parsing
=========================MEASURE PERFORMANCE ENDS===========================

But when I run Client.exe -x 192 -y 128, everything is OK,

xxx\staging\Debug\Client.exe -x 192 -y 128


        Internal Client Test *****PASS*****

My GPU is a Nvidia GeForce 605, though this is not a good GPU, but since the Client.exe 2D FFT passed, the measurePerformance.py should be OK.

OpenCL platform [ 0 ]:
    CL_PLATFORM_PROFILE:     FULL_PROFILE
    CL_PLATFORM_VERSION:     OpenCL 1.1 CUDA 4.2.1
    CL_PLATFORM_NAME:        NVIDIA CUDA
    CL_PLATFORM_VENDOR:      NVIDIA Corporation
    CL_PLATFORM_EXTENSIONS:  cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll 

OpenCL devices [ 0 ]:
    CL_DEVICE_NAME:                      GeForce 605
    CL_DEVICE_VERSION:                   OpenCL 1.1 CUDA
    CL_DRIVER_VERSION:                   327.23
    CL_DEVICE_TYPE:                      GPU
    CL_DEVICE_MAX_CLOCK_FREQUENCY:       1046
    CL_DEVICE_ADDRESS_BITS:              32
    CL_DEVICE_AVAILABLE:                 TRUE
    CL_DEVICE_COMPILER_AVAILABLE:        TRUE
    CL_DEVICE_OPENCL_C_VERSION:          OpenCL C 1.1 
    CL_DEVICE_MAX_WORK_GROUP_SIZE:       1024
    CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:  3
                         Dimension[ 0 ]  1024
                         Dimension[ 1 ]  1024
                         Dimension[ 2 ]  64
    CL_DEVICE_HOST_UNIFIED_MEMORY:       FALSE
    CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:  65536 ( 64 KB )
    CL_DEVICE_LOCAL_MEM_SIZE:            49152 ( 48 KB )
    CL_DEVICE_GLOBAL_MEM_SIZE:           536870912 ( 512 MB )
    CL_DEVICE_MAX_MEM_ALLOC_SIZE:        134217728 ( 128 MB )
    CL_DEVICE_EXTENSIONS:                cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll  cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64

Fails to build: missing boost libraries

Hi!
While clBLAS builds on Debian Jessie (testing), clFFT doesn't. Commenting out the following lines in /src/CMakeLists.txt fixes this. What they have been used for previously (clBLAS is free of them)?

# Default Boost_NO_SYSTEM_PATHS to TRUE if the user does not specify themselves
if( NOT DEFINED Boost_NO_SYSTEM_PATHS AND UNIX )
    set( Boost_NO_SYSTEM_PATHS ON )
endif( )

Moreover I had to add -pthread here:

    if( BUILD64 )
-       set( CMAKE_CXX_FLAGS "-m64 ${CMAKE_CXX_FLAGS}" )
+       set( CMAKE_CXX_FLAGS "-m64 -pthread ${CMAKE_CXX_FLAGS}" )
        set( CMAKE_C_FLAGS "-m64 -pthread ${CMAKE_C_FLAGS}" )
    else( )

Enhance Twiddle factors constant table.

Please Code abduct from fftw3.

clFFT-2.0/src/library/generator.stockham.cpp:
#define K2PI 6.2831853071795864769252867665590057683943388
#define by2pi(m, n) ((K2PI * (m)) / (n))
/*
 * Improve accuracy by reducing x to range [0..1/8]
 * before multiplication by 2 * PI.
 */
static void real_cexp(int m, int n, double * si, double * co)
{
     double theta, c, s, t;
     unsigned octant = 0;
     int quarter_n = n;
     n += n; n += n;
     m += m; m += m;
     if (m < 0) m += n;
     if (m > n - m) { m = n - m; octant |= 4; }
     if (m - quarter_n > 0) { m = m - quarter_n; octant |= 2; }
     if (m > quarter_n - m) { m = quarter_n - m; octant |= 1; }
     theta = by2pi(m, n);
     c = cos(theta); s = sin(theta);
     if (octant & 1) { t = c; c = s; s = t; }
     if (octant & 2) { t = c; c = -s; s = t; }
     if (octant & 4) { s = -s; }
     *co = c;
     *si = s;
}
....
// Twiddle factors
for(size_t k=0; k<(L/radix); k++)
{
        double theta = TWO_PI * ((double)k)/((double)L);
        for(size_t j=1; j<radix; j++)
        {
                //double c = cos(((double)j) * theta);
                //double s = sin(((double)j) * theta);

                double c,s;
                real_cexp(k*j,L,&s,&c);
                s = -s;
                wc[nt]   = c;
                ws[nt++] = s;
        }
}

Unable to build under Debian7 ...

Dear developers,

I am unable to build clFFT under Linux. I am sorry, I have limited knowledge of cmake.
jerome@patagonia:~~/workspace/clFFT/build$ BOOST_ROOT=/usr GTEST_LIBRARY=/tmp/gtest-1.7.0~~svn20130629/obj-x86_64-linux-gnu cmake ../src
-- UNICODE feature disabled on linux
-- 64bit build - FIND_LIBRARY_USE_LIB64_PATHS TRUE
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:566 ] _boost_TEST_VERSIONS = 1.46.1;1.46;1.44.0;1.44;1.56.0;1.56;1.55.0;1.55;1.54.0;1.54;1.53.0;1.53;1.52.0;1.52;1.51.0;1.51;1.50.0;1.50;1.49.0;1.49;1.48.0;1.48;1.47.0;1.47;1.46.1;1.46.0;1.46;1.45.0;1.45;1.44.0;1.44;1.43.0;1.43;1.42.0;1.42;1.41.0;1.41;1.40.0;1.40;1.39.0;1.39;1.38.0;1.38;1.37.0;1.37;1.36.1;1.36.0;1.36;1.35.1;1.35.0;1.35;1.34.1;1.34.0;1.34;1.33.1;1.33.0;1.33
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:568 ] Boost_USE_MULTITHREADED = ON
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:570 ] Boost_USE_STATIC_LIBS = ON
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:572 ] Boost_USE_STATIC_RUNTIME =
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:574 ] Boost_ADDITIONAL_VERSIONS = 1.46.1;1.46;1.44.0;1.44
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:576 ] Boost_NO_SYSTEM_PATHS = ON
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:644 ] Declared as CMake or Environmental Variables:
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:646 ] BOOST_ROOT = /usr
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:648 ] BOOST_INCLUDEDIR =
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:650 ] BOOST_LIBRARYDIR =
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:652 ] _boost_TEST_VERSIONS = 1.46.1;1.46;1.44.0;1.44;1.56.0;1.56;1.55.0;1.55;1.54.0;1.54;1.53.0;1.53;1.52.0;1.52;1.51.0;1.51;1.50.0;1.50;1.49.0;1.49;1.48.0;1.48;1.47.0;1.47;1.46.1;1.46.0;1.46;1.45.0;1.45;1.44.0;1.44;1.43.0;1.43;1.42.0;1.42;1.41.0;1.41;1.40.0;1.40;1.39.0;1.39;1.38.0;1.38;1.37.0;1.37;1.36.1;1.36.0;1.36;1.35.1;1.35.0;1.35;1.34.1;1.34.0;1.34;1.33.1;1.33.0;1.33
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:734 ] location of version.hpp: /usr/include/boost/version.hpp
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:753 ] version.hpp reveals boost 1.49.0
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:785 ] guessed _boost_COMPILER = -gcc47
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:795 ] _boost_MULTITHREADED = -mt
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:838 ] _boost_RELEASE_ABI_TAG = -
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:840 ] _boost_DEBUG_ABI_TAG = -d
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:883 ] _boost_LIBRARY_SEARCH_DIRS = /usr/lib;/usr/stage/lib;/usr/include/lib;/usr/include/../lib;/usr/include/stage/lib
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:961 ] Searching for PROGRAM_OPTIONS_LIBRARY_RELEASE: boost_program_options-gcc47-mt-1_49;boost_program_options-gcc47-mt;boost_program_options-mt-1_49;boost_program_options-mt;boost_program_options
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:993 ] Searching for PROGRAM_OPTIONS_LIBRARY_DEBUG: boost_program_options-gcc47-mt-d-1_49;boost_program_options-gcc47-mt-d;boost_program_options-mt-d-1_49;boost_program_options-mt-d;boost_program_options-mt;boost_program_options
-- [ /usr/share/cmake-2.8/Modules/FindBoost.cmake:1107 ] Boost_FOUND = TRUE
-- Boost version: 1.49.0
-- Found the following Boost libraries:
-- program_options
-- Boost_PROGRAM_OPTIONS_LIBRARY: /usr/lib64/libboost_program_options-mt.a
-- Could NOT find GTest (missing: GTEST_LIBRARY GTEST_MAIN_LIBRARY)
GoogleTest unit testing will NOT be built
-- Detected GNU fortran compiler.
-- CMAKE_CXX_COMPILER flags: -m64
-- CMAKE_CXX_COMPILER debug flags: -g
-- CMAKE_CXX_COMPILER release flags: -O3 -DNDEBUG
-- CMAKE_CXX_COMPILER relwithdebinfo flags: -O2 -g
-- CMAKE_EXE_LINKER link flags:
GoogleTest unit tests will NOT be built
-- Configuring done
-- Generating done
-- Build files have been written to: /home/jerome/workspace/clFFT/build
jerome@patagonia:~/workspace/clFFT/build$ make
Scanning dependencies of target clFFT
[ 6%] Building CXX object library/CMakeFiles/clFFT.dir/transform.cpp.o
[ 12%] Building CXX object library/CMakeFiles/clFFT.dir/accessors.cpp.o
[ 18%] Building CXX object library/CMakeFiles/clFFT.dir/plan.cpp.o
[ 25%] Building CXX object library/CMakeFiles/clFFT.dir/repo.cpp.o
[ 31%] Building CXX object library/CMakeFiles/clFFT.dir/generator.stockham.cpp.o
[ 37%] Building CXX object library/CMakeFiles/clFFT.dir/generator.transpose.cpp.o
[ 43%] Building CXX object library/CMakeFiles/clFFT.dir/generator.copy.cpp.o
[ 50%] Building CXX object library/CMakeFiles/clFFT.dir/lifetime.cpp.o
[ 56%] Building CXX object library/CMakeFiles/clFFT.dir/stdafx.cpp.o
Linking CXX shared library libclFFT.so
[ 56%] Built target clFFT
Scanning dependencies of target Client
[ 62%] Building CXX object client/CMakeFiles/Client.dir/client.cpp.o
[ 68%] Building CXX object client/CMakeFiles/Client.dir/openCL.misc.cpp.o
[ 75%] Building CXX object client/CMakeFiles/Client.dir/stdafx.cpp.o
Linking CXX executable ../staging/Client
../library/libclFFT.so.2.1.0: undefined reference to pthread_mutexattr_settype' ../library/libclFFT.so.2.1.0: undefined reference topthread_mutexattr_destroy'
../library/libclFFT.so.2.1.0: undefined reference to `pthread_mutexattr_init'
collect2: error: ld returned 1 exit status
make[2]: *** [staging/Client-2.1.0] Erreur 1
make[1]: *** [client/CMakeFiles/Client.dir/all] Erreur 2
make: *** [all] Erreur 2

Error -45 after successful init

I am writing up an infinitely simple clFFT example using batch and multi-GPU. I am subdividing my input manually (because internal multi-GPU is not implemented yet). I am creating a plan for every device seperately (as they are using different queues, naturally), and all of the plans are baked before trying to execute them. Executing result in error -45 (INVALID_PROGRAM_OBJECTS).

Files:

CMake: http://pastebin.com/XrAkSM9x
Header: http://pastebin.com/Bx17Zdwg
Source: http://pastebin.com/Dy5YLuur

Layout:

CMakeLists.txt
src/Source.cpp
inc/Header.hpp
cmake/Modules/FindclFFT.cmake

Client does not detect AMD APU GPUs as OpenCL devices

The "client" test/profile application does not detect the GPU component of AMD APUs as OpenCL devices. For example, when run on Windows 7 on an A10-6800K, only the CPU is detected as an OCL device when using the "-i" command to show CL platform information.

5x slowdown on AMD W8100 when using ECC

clFFT runs 5x slower in my application when using ECC. I would normally expect 10-15% (definitely not more than 50%).

Are there any existing benchmarks in the clFFT community showing ECC performance?

OSX clFFT AMD FFT results in blank FFT

Hi, firstly - many thanks for maintaining and enhancing this library.

I've got a FFT based image alignment working with the GPU for astronomy - however I've noted that when given specific sizes of 2D FFT (that are allowed according to the plan checking dims in the plan source) that everything works except no FFT is produced.

The 2D 3645x450 FFT bakes, executes the encoding but does not pass an error or log an error to the Console log.

I'm using OSX 10.8.5 on a MBP - using the AMD Radeon HD 6750M 1024 MB GPU and will shortly be upgrading the OSX (which I know has some OpenCL "improvements").

I know that nVidia cards had some issues with texture sizes but I wasn't aware of AMD ones.

The same output debug printing route works for the other FFT plan in use - both inputs and FFTs produced result in FFT images. Input printed immediately before the FFT enqueuing shows data input but a blank output (after executing clFinish()).

Any ideas? It sounds like either the code or driver is failing quietly. Looking at the clFFT sources I think it's probably more a driver issue.

Rename client executable to clFFT-client instead of Client

A generic name like Client is very likely to shadow some other system utility. I suggest to use a similar naming convention as in clBLAS, i.e. client -> clFFT-client.

Here is the patch I am currently applying downstream to the Debian packaging:

Description: fix client executable name
Author: Ghislain Antony Vaillant <[email protected]>
Forwarded: not-needed

---
This patch header follows DEP-3: http://dep.debian.net/deps/dep3/
--- a/src/client/CMakeLists.txt
+++ b/src/client/CMakeLists.txt
@@ -47,8 +47,8 @@

 target_link_libraries( Client clFFT ${Boost_LIBRARIES} ${OPENCL_LIBRARIES} ${DL_LIB} )

-set_target_properties( Client PROPERTIES VERSION ${CLFFT_VERSION} )
-set_target_properties( Client PROPERTIES RUNTIME_OUTPUT_DIRECTORY "${PROJECT_BINARY_DIR}/staging" )
+#set_target_properties( Client PROPERTIES VERSION ${CLFFT_VERSION} )
+set_target_properties( Client PROPERTIES RUNTIME_OUTPUT_DIRECTORY "${PROJECT_BINARY_DIR}/staging" OUTPUT_NAME clFFT-client)

 # CPack configuration; include the executable into the package
 install( TARGETS Client

This patch also removes the versioning of the binary, which looks unnecessary to me. Besides, it is not applied to the clBLAS client, so why in clFFT ? Feel free to re-use part or all of the patch, all the additional Debian specific code is licensed under the same terms as clFFT in order to ease collaboration.

Cheers,
Ghislain

Error on Xeon Phi

I'm running clFFT on a Xeon Phi and I'm getting errors with certain problem sizes. When I perform 1D complex-to-complex FFTs with 4, 8, 16, 32, 64, or 128 points I get the following error:

Stack dump:
0. Running pass 'Intel OpenCL Vectorizer' on module 'Program'.

Running pass 'Intel OpenCL VectorizerCore' on function '@_Vectorized.fft_fwd'
Running pass 'PacketizeFunction' on function '@_Vectorized.fft_fwd'

Running with 2, 256, 512, or 1024 points does not produce this error. I get similar errors for 2D transforms, but I haven't tested the problem sizes as exhaustively.

The relevant output of lspci is
Intel Corporation Xeon Phi coprocessor 5100 series

Thank you kindly,

~Malcolm

Possible NVidia Tesla C1060 Bug

1D FFTs computed under windows 7 on an NVidia C1060 return noisy results. As a sanity check, I've tested the code from http://dournac.org/info/fft_gpu which also yields erroneous results.
This is with latest NVidia drivers and CUDA/OpenCL SDK installed.
Will check on other hardware when I get a chance.

Is this an issue with running on NVidia hardware, or something else?

clFFT examples not building when using the static library

This effects Windows and Linux. The problem does not seem to appear on OSX.

The link error for windows can be seen here:

https://gist.github.com/shehzan10/40126b46e111c4e2db62

The link error for Linux can be seen here:

[ 67%] /usr/bin/ld: ../library/libclFFT.a(lifetime.cpp.o): undefined reference to symbol 'dlclose@@GLIBC_2.2.5'
/usr/lib/libdl.so.2: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
examples/CMakeFiles/example_examples_fft3d.dir/build.make:87: recipe for target 'examples/examples/fft3d' failed
make[2]: *** [examples/examples/fft3d] Error 1
CMakeFiles/Makefile2:400: recipe for target 'examples/CMakeFiles/example_examples_fft3d.dir/all' failed
make[1]: *** [examples/CMakeFiles/example_examples_fft3d.dir/all] Error 2

This needs to be fixed before arrayfire/arrayfire#454 can make it into our repo.

Radix-2 Problem on Windows Intel HD 4400

Hi,

I ran into a problem with clFFT today: I'm doing 1D power-of-2 complex to complex transforms on Windows using an Intel HD 4400 in a Surface Pro 3 i5.

The problem only appears in power-of-2 transforms which seems to be wrong (see images below), all other supported FFT sizes seem fine. I was not able to reproduce the problem on any other platform (CUDA device on MacBook using OSx and Linux , Intel HD device on MacBook using OSx).

I have a reproducer in python (using gpyfft) but I see the exact same behaviour when using clFFT from a C++ application so I rule out gpyfft or pyopencl as a source of the problem. I suspect clFFT and its interaction with the OpenCL driver for Intel HD on Windows. Both the master and the develop branch of clFFT have this problem.

Below you see images generated with the same code, only the used device (CPU/GPU) is changed. If I compare the generated kernel files for both CPU and GPU, they are identical.

Update 4

It looks like Intel has confirmed this and is looking into it: https://software.intel.com/en-us/forums/topic/559915#comment-1828156 It is probably not a clFFT problem though. I'll update this issue with information from Intel as this might potentially affect other clFFT users.

Update 3

I homed in on the problem and could reduce the FFT kernel to a write-read on local memory problem. The gist that contains the reproducer can be found here https://gist.github.com/sschaetz/f37e15ec2f059e13777b I also submitted a request for help with Intel here: https://software.intel.com/en-us/forums/topic/559915

Update 2

It looks like if the access pattern to local memory is done the way it is done for powers of 2 FFTs, the memory barrier between the access is ignored. If I mix up the memory access pattern (for example by adding +1 everywhere) it works.

Update

I could gather more information:
If I run the same transform multiple times, the output differs each time on this particular GPU. The Surface 3 (not Pro) GPU does not have this problem. Transforms up to and including 32 do not have this problem. The problem starts to appear with a transform size of 64. I started debugging the OpenCL code. The major difference between 32 samples and 64 samples seems to be FwdRad8B1. I run this code in a stand-alone fashion and it seems to run correctly on CPU on GPU. Major difference between the two devices I can see:
CPU: max_work_group_size 8192, max_work_item_dimensions 8192 8192 8192
GPU: max_work_group_size 512, max_work_item_dimensions 512 512 512

Has anybody encountered this problem? Any suggestions how this problem could be tracked down? Here is the reproducer and the output:

# coding: utf-8

# In[1]:

get_ipython().magic(u'matplotlib inline')
get_ipython().magic(u"config InlineBackend.figure_format = 'retina'")
import numpy as np
import matplotlib.pyplot as plt
import os
import pyopencl as cl
import pyopencl.array
import pyopencl.array as cla
import scipy.io
import gpyfft
import matplotlib.pylab as pylab
import time

G = gpyfft.GpyFFT(debug=True)

print "clAmdFft Version: %d.%d.%d"%(G.get_version())


# In[11]:

# Dimensions.
d1 = 1024
d2 = 1024

# Context, queue, FFT plan.
## Step #1. Obtain an OpenCL platform.
platform = cl.get_platforms()[0]

## It would be necessary to add some code to check the check the support for
## the necessary platform extensions with platform.extensions

## Step #2. Obtain a device id for at least one device (accelerator).
device = platform.get_devices()[1]
print device

## Step #3. Create a context for the selected device.
ctx = cl.Context([device])

queue = cl.CommandQueue(ctx, 
                properties=cl.command_queue_properties.PROFILING_ENABLE)

# data
bounds = (d1, d2)
size = d1 * d2

inputData = (np.random.randn(size) + 1.j*np.random.randn(size)).astype(np.complex64).reshape(d1,d2)
outputData = (10000 * np.ones(size) + 1.j*10000 * np.ones(size)).astype(np.complex64).reshape(d1,d2)

# clFFT plan
plan = G.create_plan(ctx, (d1, ))
plan.batch_size = d2
plan.strides_in = (1024, )
plan.strides_out = (1024, )
plan.distances = (1, 1)
plan.inplace = False
plan.bake(queue)

mf = cl.mem_flags
inputDataBuffer = cl.Buffer(ctx, mf.READ_WRITE, inputData.nbytes)
outputDataBuffer = cl.Buffer(ctx, mf.READ_WRITE, outputData.nbytes)
cl.enqueue_copy(queue, inputDataBuffer, inputData)
cl.enqueue_copy(queue, outputDataBuffer, outputData)
plan.enqueue_transform((queue,), (inputDataBuffer,), (outputDataBuffer,))
cl.enqueue_copy(queue, inputData, inputDataBuffer)
cl.enqueue_copy(queue, outputData, outputDataBuffer)
queue.finish()

fig, ax = plt.subplots(figsize=(10,10))

cax = ax.imshow(np.transpose(np.abs(outputData)), cmap='gray')
cbar = fig.colorbar(cax)
fig.show()

The output of this code, if I select the Intel HD 4400 can be seen in this image:

If I select the CPU as a device, the output is correct:

Support on NVIDIA GPUs

Hi,
This is more of a question rather than issue. Does clFFT runs on NVIDIA GPUs? So is clBlas?

Thank you!

Kernel code compilation error

clfftBakePlan() gives me this strange error: http://pastebin.com/TLYcDm6e.

It happens on Ubuntu (not on Windows) with releases 2.2 and 2.4.

I was able to fix this problem by exporting a new value for the environment variable LC_NUMERIC ("en_US.UTF-8" instead of the previous "cs_CZ.UTF-8").

My guess is that while the implementation internally generates some code it uses formatting of decimal numbers according to the localization set for the environment, hence the Czech comma instead of the English period.

Failure selecting second OpenCL device when different than first device

If there are multiple GPU's present in a system, and the user selects the second device to run FFT computation on, they will receive an CLFFT_INVALID_PROGRAM_EXECUTABLE error when the device is different from the first device.

This is a failure in the library as we attempt to use the kernel compiled for the first device as the kernel for the second. If the devices are the same, this works OK but returns an error code when different.

Limitation about the plan size

Hi,

I am also testing clFFT as a competitor for cuFFT, mainly because most of my code-base is on OpenCL. I need to do large 3D-FFTs, typically 512^3 but more once the memory will be available.
The "hard-coded" limit of 224 is documented for clFFT, cuFFT limits the plan to 227. I wonder where this limit comes from and how hard it is to overcome ?

I made a debian package, would you be interested to see it integrated in the distribution ?

Cheers,

2D FFTs fail on NVIDIA GPUs with width >= 1024.

fft2 fails with OpenCL error -36 on nvidia GPUs.
I have narrowed down the problem to the auto-generated transpose kernel. I am still investigating if the problem is with the kernel or with nvidia run time.

I have included the relevant code over here. It includes the following:

fft.cpp: The stand alone file that reproduces the error using clFFT
trans.cpp, trans.cl: The stand alone files that reproduce the error using auto-generated kernels.

contigious pair

Hi ,
In the manual it is mentioned that "In an interleaved format, where the real and imaginary components are stored as contiguous pairs"

Now if i allocate buffer of cl_float2, how the data will be stored for the forward transform in the interleaved format. Will it be something as follows:

__global void (__global float2 *data)
{
.........
.........
data[index].x = real part
data[index].y = imaginary part.
}

Some example of it will be very helpful .

Thanks

Bake plan hangs when baking plans of certain transform sizes on Intel platform 4.5.0.8

Testing different transform sizes on Intel platform 4.5.0.8 I found that baking the plan hangs for certain transform sizes. Here are some examples for sizes working and not working:

Working:
x = 2, y = 2
x = 2, y = 3
x = 2, y = 5
x = 3, y = 2
x = 3, y = 3
x = 3, y = 5
x = 5, y = 2
x = 5, y = 3
x = 5, y = 5
x = 6, y = 6
x = 10 y = 10

Not working:
x = 2, y = 4
x = 3, y = 4
x = 4, y = 2
x = 4, y = 3
x = 4, y = 4
x = 4, y = 5
x = 5, y = 4
x = 8, y = 8
x = 9, y = 9
x = 12 y = 12
x = 15 y = 15
x = 16 y = 16

Can anyone confirm this issue?

clFFT vs CUFFT

I'm comparing CUFFT on GeForce Titan and clFFT on W9000 (and GeForce Titan). The tests run 500ms each. I'm not benchmarking the first run of each FFT call.

512x512 complex to complex in place 1 batch

Titan + clFFT min 246.000000 max 3132.000000 mean 284.540125 stdev 90.330571 runs 1757
Titan + CUFFT min 75.000000 max 447.000000 mean 84.112308 stdev 13.193472 runs 5939

W9000 + clFFT min 263.000000 max 609.000000 mean 281.706479 stdev 21.922476 runs 1775

Is this performance of clFFT expected/known or am I doing something wrong?

Can't install both clFFT and clBLAS at the same time, version.h is overwritten

double precision crash on Windows

Not sure how to say this, but on a program that I compiled that uses DP, it crashes to desktop, while setting CLFFT_SINGLE works(but of course wrong answers)
When I removed line 122 to 132 in accessors.cpp, it worked fine with correct answers, but I have no idea whatsoever why it would crash though... the same thing works perfectly in linux.
on Radeon HD 7770.
Thanks

Preserved deprecated interface breaks client code using dlopen

It's appreciated, you've preserved the old interface clAmdBlas, but since you're using simple defines, it breaks (against intention) the old client code, e.g. OpenCV. Even recompiling doesn't help, since OpenCV dynamically opens the clAmdBlas (clBLAS) library and looks for the old symbols. Though defines don't introduce the overhead, but for old code it would be more important just to work even if the additional wrapper function call must be made.

Enhance the BakePlan function for broader performance profiling

The current implementation of BakePlan hard-codes a table of known good prime factors of supported vector sizes. This allows a consistent performance profile known to work well on specific sizes. However, BakePlan could be extended to decompose vector sizes into multiple factors or different permutations and empirically timed to determine the best plan for the users hardware. This will increase the cost of BakePlan considerably, but BakePlan is well documented that it could run slowly. This feature is considerably more useful with persistent plans.

port clMath to Macos?

Presumably Accelereyes will use clMath in arrayfire for Macos so can we expect they will share port of clMath to MacOs here on github?

OSX FFTBinaryLookup::populateCache() causes crash with Intel CPU context

Hi,
Small issue here I'm trying to trace down. Utilising a CPU context.

During the bake, the compilation of the kernel works ( build from source and build program come back with zero cl_status in FFTAction::compileKernel) however the populate cache fails here, through the stockroom generator:

cl_int FFTBinaryLookup::populateCache()
{
// FIXME: support MSB
this->m_header.magic_key[0] = 'C';
this->m_header.magic_key[1] = 'L';
this->m_header.magic_key[2] = 'B';
this->m_header.magic_key[3] = '\0';

std::vector<unsigned char*> data;
cl_int err = getSingleBinaryFromProgram(this->m_program, data); <<-fails with EXC_BAD_ACCESS

It seems that commenting out the populate cache solves the problem.

instructions on compiling?

hi, I'm a bit of a newbie with C++/C so I'm not too sure on how to get this running. I've tried two ways.

first way: downloading the binary, extracting it, and trying to link it. for example, if I have the extracted files in a folder called clFFT, and my file main.c (the example in the readme) is at the top level, then I would assume that I would run something like:

gcc -I./clFFT/include main.c -o main.out -lOpenCL -L./clFFT/lib64 -lclFFT
./main.out

this didn't work. I also tried compiling the git repo code using:

cmake .
cd src/
make
make install

and then running the same commands (with the directories changed a bit), and I get this error:

./main.out: error while loading shared libraries: libclFFT.so.2: cannot open shared object file: No such file or directory

any suggestions would be greatly appreciated!

Visual Studio 2012 support

Hi,

I realize that VS 2012 is currently not one of the supported compilers. What would it take to support this compiler?

Cheers,
Dominic

OSX support?

Has anyone had much luck getting this to work on OS X and the intel CPUs/integrated GPUs (or in general on CPUs/integrated GPUs on Linux)?

I kept getting this:

OPENCL_V< CLFFT_INVALID_WORK_GROUP_SIZE > (1223): clEnqueueNDRangeKernel failed

Unexpected output with CLFFT_REAL

Hello,

I am having trouble with CLFFT_REAL transforms, depending on the size of the array. I tried with several versions of openCL (cuda-6.5 or cuda-7.0) and with the master and develop branches of clFFT.
I am using Ubuntu 14.04 and g++-4.8.2
I wrote a test program (see below) to highlight what's going wrong and here is the output:

CLFFT with type float and with layouts Real to Complex interleaved succeeded 
CLFFT with type double and with layouts Real to Complex interleaved succeeded 
CLFFT with type float and with layouts Real to Hermitian interleaved succeeded 
CLFFT with type double and with layouts Real to Hermitian interleaved failed.

By the way, changing the size of the array will not give exactly the same results - this is for N = 2.

Thank you.

---main.cpp:

#include <stdlib.h>
#include <iostream>
#include <clFFT.h>
#include <cmath>

const char * layoutToString( clfftLayout_ layout )
{
   switch( layout ) {
   case CLFFT_COMPLEX_INTERLEAVED: return "Complex interleaved";
   case CLFFT_COMPLEX_PLANAR: return "Complex planar";
   case CLFFT_HERMITIAN_INTERLEAVED: return "Hermitian interleaved";
   case CLFFT_HERMITIAN_PLANAR: return "Hermitian planar";
   case CLFFT_REAL: return "Real";
   case ENDLAYOUT: return "Unknown";
   }
   return "Not found";
}

template< class T >
void testCLFFT( clfftLayout_ startLayout, clfftLayout_ intermediaryLayout )
{
   bool success = true;
   const unsigned int N = 2;
   cl_int err;
   cl_platform_id platform = 0;
   cl_device_id device = 0;
   cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
   cl_context _ctx = 0;
   cl_command_queue _queue = 0;
   cl_mem _bufferIn;
   T *X;

   clfftPlanHandle _planHandle;
   clfftDim clfftDim = CLFFT_1D;
   size_t clLengths[1] = {N};

   err = clGetPlatformIDs( 1, &platform, NULL );
   err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL );
   props[1] = (cl_context_properties)platform;
   _ctx = clCreateContext( props, 1, &device, NULL, NULL, &err );
   _queue = clCreateCommandQueue( _ctx, device, 0, &err );

   clfftSetupData fftSetup;
   err = clfftInitSetupData(&fftSetup);
   err = clfftSetup(&fftSetup);

// Filling array
   //    unsigned int N1 = N/2 + 1;
   X = (T *) malloc( N * 2 * sizeof(T) );
   for( unsigned int i = 0; i < 2 * N; ++i ) X[ i ] = static_cast< T >( 0 );
   for( unsigned int i = 0; i < N; ++i ) X[ i ] = static_cast< T >( i );

// Forward transform
   err = clfftCreateDefaultPlan(&_planHandle, _ctx, clfftDim, clLengths);
   if( sizeof(T) == 4 ) err = clfftSetPlanPrecision(_planHandle, CLFFT_SINGLE);
   else err = clfftSetPlanPrecision(_planHandle, CLFFT_DOUBLE);
   err = clfftSetLayout(_planHandle, startLayout, intermediaryLayout );
   err = clfftSetResultLocation(_planHandle, CLFFT_INPLACE);
   err = clfftBakePlan(_planHandle, 1, &_queue, NULL, NULL);
   _bufferIn = clCreateBuffer( _ctx, CL_MEM_READ_WRITE, N * 2 * sizeof(T), NULL, &err );
   err = clEnqueueWriteBuffer( _queue, _bufferIn, CL_TRUE, 0, N * 2 * sizeof(T), X, 0, NULL, NULL );
   err = clfftEnqueueTransform(_planHandle, CLFFT_FORWARD, 1, &_queue, 0, NULL, NULL, &_bufferIn, NULL, NULL);
   err = clFinish(_queue);
   err = clEnqueueReadBuffer( _queue, _bufferIn, CL_TRUE, 0, N * 2 * sizeof(T), X, 0, NULL, NULL );

// Backward transform
   err = clfftCreateDefaultPlan(&_planHandle, _ctx, clfftDim, clLengths);
   if( sizeof(T) == 4 ) err = clfftSetPlanPrecision(_planHandle, CLFFT_SINGLE);
   else err = clfftSetPlanPrecision(_planHandle, CLFFT_DOUBLE);
   err = clfftSetLayout(_planHandle, intermediaryLayout, startLayout );
   err = clfftSetResultLocation(_planHandle, CLFFT_INPLACE);
   err = clfftBakePlan(_planHandle, 1, &_queue, NULL, NULL);
   _bufferIn = clCreateBuffer( _ctx, CL_MEM_READ_WRITE, N * 2 * sizeof(T), NULL, &err );
   err = clEnqueueWriteBuffer( _queue, _bufferIn, CL_TRUE, 0, N * 2 * sizeof(T), X, 0, NULL, NULL );
   err = clfftEnqueueTransform(_planHandle, CLFFT_BACKWARD, 1, &_queue, 0, NULL, NULL, &_bufferIn, NULL, NULL);
   err = clFinish(_queue);
   err = clEnqueueReadBuffer( _queue, _bufferIn, CL_TRUE, 0, N * 2 * sizeof(T), X, 0, NULL, NULL );

// Checking we find the initial array values
   for(unsigned int i = 0; i < N; ++i) {
      if( X[i] != i ) {
         success =  false;
         break;
      }
   }

// Releasing memory
   clReleaseMemObject( _bufferIn );
   free(X);
   err = clfftDestroyPlan( &_planHandle );
   clfftTeardown( );
   clReleaseCommandQueue( _queue );
   clReleaseContext( _ctx );

// Displaying FFT success status
   std::cout << "CLFFT with type " << ( sizeof(T) == 4 ? "float" : "double") << " and with layouts " <<
         layoutToString( startLayout ) << " to " << layoutToString( intermediaryLayout ) <<
         ( success ? " succeeded " : " failed " ) << std::endl;
}

int main( void )
{
   testCLFFT< float >( CLFFT_REAL, CLFFT_COMPLEX_INTERLEAVED );
   testCLFFT< double >( CLFFT_REAL, CLFFT_COMPLEX_INTERLEAVED );
   testCLFFT< float >( CLFFT_REAL, CLFFT_HERMITIAN_INTERLEAVED );
   testCLFFT< double >( CLFFT_REAL, CLFFT_HERMITIAN_INTERLEAVED );
   return 0;
}

Compile line:

g++ main.cpp -m64 -pipe -O2 -Wno-unused-but-set-parameter -Wall -std=c++11 -I/usr/local/mkt-dev/install/cuda-7.0/include  -I/usr/local/include  -L/usr/local/lib64 -lclFFT -L/usr/local/mkt-dev/install/cuda-7.0/lib64 -lOpenCL  -o fft1D

Support for Cygwin and MinGW?

Hi,

I was wondering if you could support Cygwin and MinGW here, I know MSVC is most used under Windows but it would be nice.

Thanks!

Test build failure with clang

On Mac OS X 10.10 with clang XXX, I get this error when building the Test executable:

/usr/local/opt/ccache/libexec/c++   -DGTEST_USE_OWN_TR1_TUPLE -std=c++11 -stdlib=libc++  -O3 -DNDEBUG -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk -I/usr/local/include -Igtest-external-prefix/src/gtest-external/include -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks -Iinclude -I../include -MMD -MT tests/CMakeFiles/Test.dir/accuracy_test_common.cpp.o -MF tests/CMakeFiles/Test.dir/accuracy_test_common.cpp.o.d -o tests/CMakeFiles/Test.dir/accuracy_test_common.cpp.o -c ../tests/accuracy_test_common.cpp
In file included from ../tests/accuracy_test_common.cpp:22:
In file included from ../tests/fftw_transform.h:24:
../tests/buffer.h:248:18: error: comparison between pointer and integer ('const size_t *' (aka 'const unsigned long *') and 'size_t' (aka 'unsigned long'))
  if( strides_in == tightly_packed) {
      ~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
1 error generated.
[12/20] Building CXX object tests/CMakeFiles/Test.dir/buffer.cpp.o
FAILED: /usr/local/opt/ccache/libexec/c++   -DGTEST_USE_OWN_TR1_TUPLE -std=c++11 -stdlib=libc++  -O3 -DNDEBUG -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk -I/usr/local/include -Igtest-external-prefix/src/gtest-external/include -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks -Iinclude -I../include -MMD -MT tests/CMakeFiles/Test.dir/buffer.cpp.o -MF tests/CMakeFiles/Test.dir/buffer.cpp.o.d -o tests/CMakeFiles/Test.dir/buffer.cpp.o -c ../tests/buffer.cpp
In file included from ../tests/buffer.cpp:20:
../tests/buffer.h:248:18: error: comparison between pointer and integer ('const size_t *' (aka 'const unsigned long *') and 'size_t' (aka 'unsigned long'))
  if( strides_in == tightly_packed) {
      ~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
1 error generated.
[12/20] Building CXX object tests/CMakeFiles/Test.dir/accuracy_test_pow2.cpp.o
FAILED: /usr/local/opt/ccache/libexec/c++   -DGTEST_USE_OWN_TR1_TUPLE -std=c++11 -stdlib=libc++  -O3 -DNDEBUG -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk -I/usr/local/include -Igtest-external-prefix/src/gtest-external/include -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks -Iinclude -I../include -MMD -MT tests/CMakeFiles/Test.dir/accuracy_test_pow2.cpp.o -MF tests/CMakeFiles/Test.dir/accuracy_test_pow2.cpp.o.d -o tests/CMakeFiles/Test.dir/accuracy_test_pow2.cpp.o -c ../tests/accuracy_test_pow2.cpp
In file included from ../tests/accuracy_test_pow2.cpp:22:
In file included from ../tests/fftw_transform.h:24:
../tests/buffer.h:248:18: error: comparison between pointer and integer ('const size_t *' (aka 'const unsigned long *') and 'size_t' (aka 'unsigned long'))
  if( strides_in == tightly_packed) {
      ~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
1 error generated.
[12/20] Building CXX object tests/CMakeFiles/Test.dir/accuracy_test_pow3.cpp.o
FAILED: /usr/local/opt/ccache/libexec/c++   -DGTEST_USE_OWN_TR1_TUPLE -std=c++11 -stdlib=libc++  -O3 -DNDEBUG -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk -I/usr/local/include -Igtest-external-prefix/src/gtest-external/include -F/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks -Iinclude -I../include -MMD -MT tests/CMakeFiles/Test.dir/accuracy_test_pow3.cpp.o -MF tests/CMakeFiles/Test.dir/accuracy_test_pow3.cpp.o.d -o tests/CMakeFiles/Test.dir/accuracy_test_pow3.cpp.o -c ../tests/accuracy_test_pow3.cpp
In file included from ../tests/accuracy_test_pow3.cpp:22:
In file included from ../tests/fftw_transform.h:24:
../tests/buffer.h:248:18: error: comparison between pointer and integer ('const size_t *' (aka 'const unsigned long *') and 'size_t' (aka 'unsigned long'))
  if( strides_in == tightly_packed) {
      ~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
1 error generated.

This is in this line:

        if( strides_in == tightly_packed) {

that I'm tempted, without knowing much about what it does, to change in

        if( *strides_in == tightly_packed) {

to actually compare two size_t instead of a size_t and a size_t*.

The test build when doing that, but a segfault occur during the run on this exact same line, so this is probably not the expected fix ;-)

Enhance transpose kernel generator to support additional sizes

The kernel transpose generator can only generate kernels whose sizes in the X and the Y dimensions that are mod-32. Given the fact that our FFT kernel generators support prime factors of 3 and 5, we have weak coverage of the vector sizes we support. The transpose kernel generator should be extended to handle all sizes that the FFT kernel generator supports, if not arbitrary dimension lengths.

OSX clRetainContext

Hi,
On OSX, as part of an OpenGL view integration with a shared OpenCL context, clRetainContext() shouldn't be called - as the application does not own the context. If you create your own OpenCL context separately - this is fine.

I've removed this previously from my own build for the reason that OS X blows up if you attempt this, it may be worth adding something around this.

Nick

Add Radix-7 support

Our kernel generator can only handle FFT vector sizes that contains 2, 3, 5 as prime factors. This enhancement is to add support for generating kernels that can handle vectors sizes containing a 7 prime factor.

OSX 10.10 - Sandbox file access exception

Hi,
Just a note that clFFT causes an exception to be thrown for the Apple Sandbox. From what I can see is Boost creates temporary files outside of the area normally allowed - representing a security risk. There is a NSTemporaryDirectory that offers a user unique location for temporary files (unix tmpname will not work). Just checking to see if there's a CoreFoundation version that will work..

To use Boost/temporary files a user location needs to be added to the sandbox.

Nick

FindclFFT.cmake

To conveniently use clFFT in my CMAKE based projects I put together a (very simple) FindclFFT.cmake file. It works well for me under Linux (Ubuntu 12.04). Maybe it makes sense to include this script with the library - feel free to do that if you so desire. It should probably be extended to work on Windows/Mac OS.

Anyway, here it is:
https://github.com/sschaetz/aura/blob/develop/cmake/FindclFFT.cmake
It is based on LibFindMacros.cmake
https://github.com/sschaetz/aura/blob/develop/cmake/LibFindMacros.cmake

Cheers,
Sebastian

Enhance transpose kernel generator to support sparse vector data

The kernel transpose generator can only generate kernels whose data is tightly packed. Given the fact that our FFT API supports arbitrary strides and pitches, we have weak coverage of the input parameters we support. The transpose kernel generator should be to handle arbitrary strides and pitches.

Missing linkage with libdl in libclfft

After attempting to make a Debian package for clFFT, the following warnings were thrown by the package build system:

dpkg-shlibdeps: warning: symbol dlclose used by debian/libclfft2/usr/lib/x86_64-linux-gnu/libclFFT.so.2.4.0 found in none of the libraries
dpkg-shlibdeps: warning: symbol dlopen used by debian/libclfft2/usr/lib/x86_64-linux-gnu/libclFFT.so.2.4.0 found in none of the libraries
dpkg-shlibdeps: warning: symbol dlsym used by debian/libclfft2/usr/lib/x86_64-linux-gnu/libclFFT.so.2.4.0 found in none of the libraries

I believe clFFT uses these symbols in the sharedLibrary.h header. As a result, libclfft should be linked with libdl which is appropriate to the platform, similar to what is done in the client's CMakeLists.txt.

clFFT work group size problem on Radeon 4600 GPUs

Hi,

we're testing clFFT 2.2 on our Radeon HD 4670 card and are experiencing some problems: For the generated FFT kernel our OpenCL runtime (Catalyst 13.9, APP SDK 2.8.1) reports a maximum work group size of 32. However, as far as I could see in the sources, a minimum work group size of 64 or higher is coded by the kernel generators.

Unfortunately this totally prevents clFFT from executing on our Radeon 4670, the client test program stops with an insufficient resources error.

So, without knowing much about the internals of the FFT kernels: Would it be possible to set down the required work group size to 32 in the kernel generators? In our application we would need two-dimensional FFTs with block sizes ranging from 64x64 to 256x256.

We are aware that the Radeon 4600 family nowadays is a legacy core with limited resources, stemming from the early days of OpenCL support on Radeon GPUs. However, we eventually would use the FFT on a long-running embedded E4690 platform and it would be really great to have the FFT capability on this core.

Thanks,
Philipp

Centralized location for path management

The rationales behind this issue are similar to what is explained in:

arrayfire/arrayfire#721

Feel free to re-use the solution I submitted in the following PR:

arrayfire/arrayfire#722

i.e., a corresponding clFFTInstallDirs module containing all the installation paths (for libraries, runtime, include, cmake modules...) which is then imported in the main CMakeLists.txt and used to throughout the rest of the project. That way, all install paths associated with the project can be changed / overridden from one single location.

Add a persistent plan concept

Our library uses the model of an FFT plan; a common idea in the FFT world where an object manipulated by the user represents the 'state' of an FFT operation. The plan also contains references to the OpenCL state that are relevant to the FFT operation, such as the compiled kernel binaries. Some of these operations can be expensive, and it would be nice to allow users to pay the cost once and then serialize the data out to disk, to preserve the work across FFT runs. Further FFT runs could then load the plan state off of disk and continue the operation, hopefully resulting in a performance boost.