clmathlibraries / clblas Goto Github PK

a software library containing BLAS functions written in OpenCL

License: Apache License 2.0

CMake 1.60% C 45.86% C++ 50.39% Makefile 0.01% Shell 0.07% Python 2.07% Objective-C 0.02%

clblas's Introduction

Build Status

Build branch	master	develop
GCC/Clang x64
Visual Studio x64

clBLAS

This repository houses the code for the OpenCL™ BLAS portion of clMath. The complete set of BLAS level 1, 2 & 3 routines is implemented. Please see Netlib BLAS for the list of supported routines. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming. APPML 1.12 is the most current generally available pre-packaged binary version of the library available for download for both Linux and Windows platforms.

The primary goal of clBLAS is to make it easier for developers to utilize the inherent performance and power efficiency benefits of heterogeneous computing. clBLAS interfaces do not hide nor wrap OpenCL interfaces, but rather leaves OpenCL state management to the control of the user to allow for maximum performance and flexibility. The clBLAS library does generate and enqueue optimized OpenCL kernels, relieving the user from the task of writing, optimizing and maintaining kernel code themselves.

clBLAS update notes 01/2017

v2.12 is a bugfix release as a rollup of all fixes in /develop branch
- Thanks to @pavanky, @iotamudelta, @shahsan10, @psyhtest, @haahh, @hughperkins, @tfauck @abhiShandy, @IvanVergiliev, @zougloub, @mgates3 for contributions to clBLAS v2.12
Summary of fixes available to read on the releases tab

clBLAS library user documentation

Library and API documentation for developers is available online as a GitHub Pages website

Google Groups

Two mailing lists have been created for the clMath projects:

[email protected] - group whose focus is to answer questions on using the library or reporting issues
[email protected] - group whose focus is for developers interested in contributing to the library code itself

clBLAS Wiki

The project wiki contains helpful documentation, including a build primer

Contributing code

Please refer to and read the Contributing document for guidelines on how to contribute code to this open source project. The code in the /master branch is considered to be stable, and all pull-requests should be made against the /develop branch.

License

The source for clBLAS is licensed under the Apache License, Version 2.0

Example

The simple example below shows how to use clBLAS to compute an OpenCL accelerated SGEMM

    #include <sys/types.h>
    #include <stdio.h>

    /* Include the clBLAS header. It includes the appropriate OpenCL headers */
    #include <clBLAS.h>

    /* This example uses predefined matrices and their characteristics for
     * simplicity purpose.
    */

    #define M  4
    #define N  3
    #define K  5

    static const cl_float alpha = 10;

    static const cl_float A[M*K] = {
    11, 12, 13, 14, 15,
    21, 22, 23, 24, 25,
    31, 32, 33, 34, 35,
    41, 42, 43, 44, 45,
    };
    static const size_t lda = K;        /* i.e. lda = K */

    static const cl_float B[K*N] = {
    11, 12, 13,
    21, 22, 23,
    31, 32, 33,
    41, 42, 43,
    51, 52, 53,
    };
    static const size_t ldb = N;        /* i.e. ldb = N */

    static const cl_float beta = 20;

    static cl_float C[M*N] = {
        11, 12, 13,
        21, 22, 23,
        31, 32, 33,
        41, 42, 43,
    };
    static const size_t ldc = N;        /* i.e. ldc = N */

    static cl_float result[M*N];

    int main( void )
    {
    cl_int err;
    cl_platform_id platform = 0;
    cl_device_id device = 0;
    cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
    cl_context ctx = 0;
    cl_command_queue queue = 0;
    cl_mem bufA, bufB, bufC;
    cl_event event = NULL;
    int ret = 0;

    /* Setup OpenCL environment. */
    err = clGetPlatformIDs( 1, &platform, NULL );
    err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL );

    props[1] = (cl_context_properties)platform;
    ctx = clCreateContext( props, 1, &device, NULL, NULL, &err );
    queue = clCreateCommandQueue( ctx, device, 0, &err );

    /* Setup clBLAS */
    err = clblasSetup( );

    /* Prepare OpenCL memory objects and place matrices inside them. */
    bufA = clCreateBuffer( ctx, CL_MEM_READ_ONLY, M * K * sizeof(*A),
                          NULL, &err );
    bufB = clCreateBuffer( ctx, CL_MEM_READ_ONLY, K * N * sizeof(*B),
                          NULL, &err );
    bufC = clCreateBuffer( ctx, CL_MEM_READ_WRITE, M * N * sizeof(*C),
                          NULL, &err );

    err = clEnqueueWriteBuffer( queue, bufA, CL_TRUE, 0,
        M * K * sizeof( *A ), A, 0, NULL, NULL );
    err = clEnqueueWriteBuffer( queue, bufB, CL_TRUE, 0,
        K * N * sizeof( *B ), B, 0, NULL, NULL );
    err = clEnqueueWriteBuffer( queue, bufC, CL_TRUE, 0,
        M * N * sizeof( *C ), C, 0, NULL, NULL );

        /* Call clBLAS extended function. Perform gemm for the lower right sub-matrices */
        err = clblasSgemm( clblasRowMajor, clblasNoTrans, clblasNoTrans,
                                M, N, K,
                                alpha, bufA, 0, lda,
                                bufB, 0, ldb, beta,
                                bufC, 0, ldc,
                                1, &queue, 0, NULL, &event );

    /* Wait for calculations to be finished. */
    err = clWaitForEvents( 1, &event );

    /* Fetch results of calculations from GPU memory. */
    err = clEnqueueReadBuffer( queue, bufC, CL_TRUE, 0,
                                M * N * sizeof(*result),
                                result, 0, NULL, NULL );

    /* Release OpenCL memory objects. */
    clReleaseMemObject( bufC );
    clReleaseMemObject( bufB );
    clReleaseMemObject( bufA );

    /* Finalize work with clBLAS */
    clblasTeardown( );

    /* Release OpenCL working objects. */
    clReleaseCommandQueue( queue );
    clReleaseContext( ctx );

    return ret;
    }

Build dependencies

Library for Windows

Windows® 7/8
Visual Studio 2010 SP1, 2012
An OpenCL SDK, such as APP SDK 2.8
Latest CMake

Library for Linux

GCC 4.6 and onwards
An OpenCL SDK, such as APP SDK 2.9
Latest CMake

Library for Mac OSX

Recommended to generate Unix makefiles with cmake

Test infrastructure

Googletest v1.6
Latest Boost
CPU BLAS

Netlib CBLAS (recommended) Ubuntu: install by "apt-get install libblas-dev" Windows: download & install lapack-3.6.0 which comes with CBLAS
or ACML on windows/linux; Accelerate on Mac OSX

Performance infrastructure

Python

clblas's People

Contributors

Stargazers

Watchers

Forkers

jayavanth lonemage joker-hyt arrayfire abergeron xyuan hmaal saifthe1 benjamincoquelle gicmo strategist922 jslhs zh4ngx juxiangwu vdt meligordman yilab odellus grantbrown elisbyberi chiahungtai teru3760 jinhou antomicx cmoxiv maxbareiss mhossny emergentorder traw mikeseven cfandy marcino239 tmagomedov cnanlmlin yshysh sschaetz biotrump pavanky tillalex glehmann yiyie1 josephwinston lunochod leonardt tyler-d omargw soledad89 guacamoleo hughperkins wgapl hemmingway dfrsg cnugteren bhuvanap linan7788626 shrutee06 nagyist sidec matcheydj crycrane tingxingdong holocronweaver aschneck oere hhoareau qianglan majorbreakfast haibo031031 10imaging ckehl timmyliu mpekalski iotamudelta fmarrabal anadon geospatialdaryl frodohtx yuckify aonorin jcredi jszuppe xmchen1987 xfong babooppa6 syamgadde cambriantech tfauck a3213105 princeofdarkness76 shyamalschandra evitself andreasmiller akafalsh hal2001 sh1r0 yochju exmakhina jsdelfino foo123 liangfu

clblas's Issues

undefined reference to symbol 'pthread_key_delete@@GLIBC_2.2.5'

clBLAS fails to link in master and develop branch. I opened a pull request that was closed. The fix can be found here:

https://github.com/lunochod/clBLAS/blob/master/src/tests/CMakeLists.txt

Sgemm crashes on Intel Platform when transA is false and transB is true

Using Sgemm of clBlas version 2.3. with transA = false and transB = true results in clblasBuildProgramFailure on my machine when using Intel platform 4.5.0.8. On NVidia/AMD platform I do not get this error. Can anyone confirm this?

Kernel compilation Error when performing transpose(A) * B on intel devices

The code and the error can be seen here:

https://gist.github.com/pavanky/b19b5c7fca58e557510c

This only happens when using clblasTranspose for A and clblasNoTranspose for B. Everything else seems to work fine.

Failures in ColumnMajor_SmallRange/TRMM.ztrmm test set

Many tests are failing under this test set which can lead to a GPU hang. This has been observed on a variety of AMD hardware. To run tests without testing ztrmm, one can use a googletest filter like so:
./test-short --gtest_filter=-ztrmm

[OSX] test-short: 1713 tests fail (10.9.3, AMD FirePro D300)

System is a MacPro (Late 2013) with two AMD FirePro D300.
More detailed info is below.

% ./staging/test-short --gtest_filter="-*nrm2*"
[----------] Global test environment tear-down
[==========] 9792 tests from 122 test cases ran. (90050 ms total)
[  PASSED  ] 8079 tests.
[  FAILED  ] 1713 tests, listed below:

The full list can be found at the gist here

Most tests seem to fail with error -43 aka CL_INVALID_BUILD_OPTIONS:

Calling clblas xNRM2 routine...
========================================================

AN INTERNAL KERNEL BUILD ERROR OCCURRED!
device name = AMD Radeon HD - FirePro D300 Compute Engine
error = -43

The NRM2 actually crashes the test program so I had to exclude it:

[  FAILED  ] SelectedSmall0_NRM2/NRM2.dnrm2/0, where GetParam() = (61, 4, 0, 1, 1) (5 ms)
[ RUN      ] SelectedSmall0_NRM2/NRM2.dnrm2/1
N = 61, offx = 0, incx = -11
offNRM2 = 1
queues = 1
number of command queues : 1

Generating input data... Done
Process 44384 stopped
* thread #1: tid = 0xcd0c5, 0x00007fff98480cc3 libBLAS.dylib`cblas_dnrm2 + 243, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x9038393a8)
    frame #0: 0x00007fff98480cc3 libBLAS.dylib`cblas_dnrm2 + 243
libBLAS.dylib`cblas_dnrm2 + 243:
-> 0x7fff98480cc3:  movsd  (%rsi), %xmm2
   0x7fff98480cc7:  andpd  %xmm1, %xmm2
   0x7fff98480ccb:  ucomisd %xmm4, %xmm2
   0x7fff98480ccf:  ja     0x7fff98480cf0            ; cblas_dnrm2 + 288
(lldb) bt
* thread #1: tid = 0xcd0c5, 0x00007fff98480cc3 libBLAS.dylib`cblas_dnrm2 + 243, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x9038393a8)
  * frame #0: 0x00007fff98480cc3 libBLAS.dylib`cblas_dnrm2 + 243
    frame #1: 0x0000000100007609 test-short`dnrm2(n=61, x=0x0000000103839400, incx=-11) + 41 at blas-lapack.c:848
    frame #2: 0x00000001003984cb test-short`blasDnrm2(N=61, X=0x0000000103839400, offx=0, incx=-11) + 59 at blas.c:4945
    frame #3: 0x000000010039e08b test-short`clMath::blas::nrm2(N=61, X=0x0000000103839400, offx=0, incx=-11) + 43 at blas-wrapper.cpp:2439
    frame #4: 0x00000001002a7381 test-short`void nrm2CorrectnessTest<double, double>(params=0x00007fff5fbfe9f0) + 1745 at corr-nrm2.cpp:124
    frame #5: 0x00000001002a43cb test-short`NRM2_dnrm2_Test::TestBody(this=0x0000000102d05670) + 43 at corr-nrm2.cpp:203
    frame #6: 0x00000001003ea0a3 test-short`void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 131
    frame #7: 0x00000001003d4677 test-short`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 119
    frame #8: 0x00000001003ad9f5 test-short`testing::Test::Run() + 197
    frame #9: 0x00000001003aeccb test-short`testing::TestInfo::Run() + 219
    frame #10: 0x00000001003afbf7 test-short`testing::TestCase::Run() + 231
    frame #11: 0x00000001003bc6e8 test-short`testing::internal::UnitTestImpl::RunAllTests() + 952
    frame #12: 0x00000001003e7033 test-short`bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 131
    frame #13: 0x00000001003d6f07 test-short`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 119
    frame #14: 0x00000001003bc2a6 test-short`testing::UnitTest::Run() + 422
    frame #15: 0x00000001002f7d51 test-short`RUN_ALL_TESTS() + 17 at gtest.h:2288
    frame #16: 0x00000001002d7ab7 test-short`main(argc=1, argv=0x00007fff5fbff498) + 1015 at test-correctness.cpp:3397

More detailed system info:

% sw_vers                                                                                                                                                                                  [develop|…]
ProductName:    Mac OS X
ProductVersion: 10.9.3
BuildVersion:   13D65

clinfo output can be found here

clblasDaxpy with "incr" input argument set to zero

I'm trying to perform a sum between a scalar and a matrix using clblasDaxpy.
To perform that, I set to zero the increment of the scalar but clblasDaxpy does not accept this value.

daxpy of Ref BLAS and cuBLAS allow this use, I think that clblasDaxpy should allow this use also.

Is it possible to develop this feature in clBLAS ?

clBLASgemm: "mixed vector-scalar operation not allowed unless up-convertable" on Hawaii

Related to this Caffe PR, when I run the Caffe tests on an R9-290X, I see a test failure apparently trying to compile clBLAS kernel for clBLASgemm. This is the error. If instead I run the same tests on the same machine but specify them to use the 7950/Tahiti device, all the tests pass. So is clBLASgemm broken on Hawaii? Also, if I run the tests choosing the CPU (FX6350) as the OpenCL device, I also get the same compile error that I get as when I specify the Hawaii device. This is what clinfo shows on the machine, an Ubuntu all-AMD and OpenCL machine with no Cuda.

To run these tests, I'm building what's in this branch. I expect the same problem would manifest with @lunochod 's branch, but it doesn't have the ability to specify which OpenCL device the tests should be run on, whereas mine does. So I can't verify on his branch, since device defaults to the first GPU with his branch, which is the Tahiti device. Namely, when you run test.testbin you add one command line argument to specify which device. In my case 0 (the default) is the CPU, 1 is the Tahiti device and 2 is the Hawaii card. If someone wants to reproduce this by building and running the tests, you will need to install all the caffe prerequisites first (except CUDA), and build using cmake, not the Makefile that comes in the caffe directory. I can help if you have questions about that.

Add logging mechanism

Incorporate logger functionality that creates a human readable text file with all clBLAS function calls and all parameter values logged. This gives users visibility to which clBLAS functions are called and how many times. The library should allow for multiple levels of logging; one level that only logs API calls, then extra verbosity levels that also log warnings, errors and optimization hints.

sger fails on nvidia

following code falis, with error -36 'invalid queue'. This usually menas an out of bounds array access occurred.

/* ************************************************************************
 * Copyright 2013 Advanced Micro Devices, Inc.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 * ************************************************************************/
#include <sys/types.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
/* Include CLBLAS header. It automatically includes needed OpenCL header,
 * so we can drop out explicit inclusion of cl.h header.
 */
#include <clBLAS.h>
/* This example uses predefined matrices and their characteristics for
 * simplicity purpose.
 */
static const clblasOrder order = clblasColumnMajor;
static const size_t M = 5;
static const size_t N = 5;
static const cl_float alpha = 1;
static const size_t lda = 5;
static const int incx = 1;
static const int incy = 1;
/*static void*/
/*printResult(void)*/
/*{*/
/*    size_t i, j;*/
/*    printf("\nResult:\n");*/
/*    for (i = 0; i < M; i++) {*/
/*        for(j = 0; j < N; j++)*/
/*            printf("\t%f", A[ i*N + j ]);*/
/*        printf("\n");*/
/*    }*/
/*}*/
int
main(void)
{
    cl_int err;
    cl_platform_id platform = 0;
    cl_device_id device = 0;
    cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
    cl_context ctx = 0;
    cl_command_queue queue = 0;
    cl_mem bufA, bufX, bufY;
    cl_event event = NULL;
    int ret = 0;
    /* Setup OpenCL environment. */
    err = clGetPlatformIDs(1, &platform, NULL);
    if (err != CL_SUCCESS) {
        printf( "clGetPlatformIDs() failed with %d\n", err );
        return 1;
    }
    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
    if (err != CL_SUCCESS) {
        printf( "clGetDeviceIDs() failed with %d\n", err );
        return 1;
    }
    props[1] = (cl_context_properties)platform;
    ctx = clCreateContext(props, 1, &device, NULL, NULL, &err);
    if (err != CL_SUCCESS) {
        printf( "clCreateContext() failed with %d\n", err );
        return 1;
    }
    queue = clCreateCommandQueue(ctx, device, 0, &err);
    if (err != CL_SUCCESS) {
        printf( "clCreateCommandQueue() failed with %d\n", err );
        clReleaseContext(ctx);
        return 1;
    }
    /* Setup clblas. */
    err = clblasSetup();
    if (err != CL_SUCCESS) {
        printf("clblasSetup() failed with %d\n", err);
        clReleaseCommandQueue(queue);
        clReleaseContext(ctx);
        return 1;
    }
    cl_float *pX = (cl_float *)malloc(M * sizeof(cl_float));
    cl_float *pY = (cl_float *)malloc(N * sizeof(cl_float));
    cl_float *pA = (cl_float *)malloc(M*N* sizeof(cl_float));
    memset(pX, 0, M*sizeof(cl_float));
    memset(pY, 0, N*sizeof(cl_float));
    memset(pA, 0, M*N*sizeof(cl_float));
    /* Prepare OpenCL memory objects and place matrices inside them. */
    bufA = clCreateBuffer(ctx, CL_MEM_READ_WRITE, M * lda * sizeof(cl_float),
                          NULL, &err);
    bufX = clCreateBuffer(ctx, CL_MEM_READ_ONLY, ( 1 + ( M - 1 )*abs( incx ) ) * sizeof(cl_float),
                          NULL, &err);
    bufY = clCreateBuffer(ctx, CL_MEM_READ_ONLY, ( 1 + ( N - 1 )*abs( incy ) ) * sizeof(cl_float),
                          NULL, &err);
    err = clEnqueueWriteBuffer(queue, bufA, CL_TRUE, 0,
        M * lda * sizeof(cl_float), pA, 0, NULL, NULL);
    err = clEnqueueWriteBuffer(queue, bufX, CL_TRUE, 0,
        ( 1 + ( M - 1 )*abs( incx ) ) * sizeof(cl_float), pX, 0, NULL, NULL);
    err = clEnqueueWriteBuffer(queue, bufY, CL_TRUE, 0,
        ( 1 + ( N - 1 )*abs( incy ) ) * sizeof(cl_float), pY, 0, NULL, NULL);
    /* Call clblas function. */
    err = clFinish(queue);
    printf("err %i\n", err);
    err = clblasSger(order, M, N, alpha, bufX, 0, incx, bufY, 0, incy,
        bufA, 0, lda, 1, &queue, 0, NULL, &event);
    if (err != CL_SUCCESS) {
        printf("clblasSger() failed with %d\n", err);
        ret = 1;
    }
    else {
        /* Wait for calculations to be finished. */
        err = clWaitForEvents(1, &event);
        /* Fetch results of calculations from GPU memory. */
        err = clEnqueueReadBuffer(queue, bufA, CL_TRUE, 0, (M * lda * sizeof(cl_float)),
                                  pA, 0, NULL, NULL);
        /* At this point you will get the result of SGER placed in A array. */
/*        printResult();*/
    }
    err = clFinish(queue);
    printf("err %i\n", err);
    /* Release OpenCL memory objects. */
    clReleaseMemObject(bufY);
    clReleaseMemObject(bufX);
    clReleaseMemObject(bufA);
    free(pX);
    free(pY);
    free(pA);
    /* Finalize work with clblas. */
    clblasTeardown();
    /* Release OpenCL working objects. */
    clReleaseCommandQueue(queue);
    clReleaseContext(ctx);
    return ret;
}

Documentation for the callbacks on the CBLAS-like API

I just got clBLAS compiling on OS X (see #7) and did a very simple DGEMM test, see https://github.com/fommil/netlib-java/

The results are, frankly, a little unbelievable... so I'm going to have to check that the DGEMM is actually being performed.

However, as part of the setup I found it very hard to understand the clblasDgemm API. It looks like you've added offsets to the arrays (which doesn't make much sense in C, since this can be done by just moving the pointer) and also added the following

    cl_uint numCommandQueues,
    cl_command_queue *commandQueues,
    cl_uint numEventsInWaitList,
    const cl_event *eventWaitList,
    cl_event *events

I just set these to NULL or 0 as appropriate: https://github.com/fommil/netlib-java/blob/master/perf/src/main/c/clwrapper.c

What are these for and what are sensible defaults just to get me up and running?

Segmentation fault after linking to lib64/clAmdBlas.lib in MinGW compilation on Windows

I am trying to use clAmdBlas 1.10 package on Windows 7 (64 bit) notebook with Nvidia GTX540M. I am using MinGW (with gcc 4.8.1) to compile the programs. I've noticed that there is a difference (in native 64-bit mode) between linking to clAmdBlas.lib and directly to clAmdBlas.dll (renamed into libclAmdBlas.dll.a).

If I link to clAmdBlas.dll everything is fine. If I link to clAmdBlas.lib, the compilation goes OK, but I get segmentation fault when calling clAmdBlasSetup. I have tried to run those two executables (without recompiling) on another Windows7 computer with AMD7950 GPU - had the same results. This behavior is the same both with my program and with examples from clAmdBlas package (e.g. example_saxpy.c).

Is it a designed behavior (e.g. because .lib is designed for Visual Studio and not for MinGW) or is it a bug? I have encountered such differences neither for 32-bit library nor for clAmdFft (both .lib and .dll can be used for linking, resulting in executables. which are not binary identical but produce identical results when ran).

I report a problem for rather old version of the package, because I do not see at easy way to obtain a more recent DLL for Windows.

Example running on GPUs not CPU

Hi again all,

I've managed to get the README example for SGEMM working through my Ruby FFI into clBLAS but for some reason it will only run on my GPUs, not the CPU. A collaborator of mine is using identical code and it's running just fine on his CPU, so I don't suspect the code.

The exact issue is that I'm getting an INVALID_EVENT error from openCL. So it seems that for some reason the kernel isn't getting enqueued on my CPU and the event isn't getting created.

Is there any relevant system information I could give that would be helpful?

Me and my collaborator are both using Intel CPUs. I'm on openCL 1.1.

I don't really expect anyone to dig into the Ruby FFI for me (though that would be awesome) but perhaps you can give me a direction to bug hunt?

Generalize the tuning code in tools/tune/tune.c

Right now the code works as expected only if the platform is "Advanced Micro Devices, Inc." AND the platform has a GPU.

Generalize this tuning code to work for a given (platform, device) combination.

Delegating libblas to ease user uptake

I wrote this up in my own project as it applies to all the GPU implementations

fommil/netlib-java#50

If we could work out the dynamic library loading bit, this sounds like a pretty easy thing to do (but incredibly monotonous!). Possibly worth a feasibility with DDOT/DGEMM.

Generate makefile without Test dependencies still requires dependencies

I built a makefile with cmake on Ubuntu. Even thought the Test dependencies are turned off (Boost, GTest, and ACML. The make file still builds source files with those dependencies.

kernel sgemmBlock does not compile

I'm trying to use clBLAS on ARM Mali GPU and I'm getting the error below. However if I amend the code by hand, the code compiles ok:

    uC.f8v = C + (coord.y * ldc + coord.x)/8;

I am not sure if this is done purposely, or it is ARM OpenCL compiler that is complaining. If you let me know where this code is generated, I can give you a hand to fix this issue.
Wrong code:

AN INTERNAL KERNEL BUILD ERROR OCCURRED!
device name = Mali-T604
error = -11
memory pattern = Cached global memory based block gemm, computing kernel generator
Subproblem dimensions: dims[0].itemY = 8, dims[0].itemX = 64, dims[0].y = 8, dims[0].x = 64, dims[0].bwidth = 4; ; dims[1].itemY = 1, dims[1].itemX = 8, dims[1].y = 1, dims[1].x = 8, dims[1].bwidth = 4; ; 
Parallelism granularity: pgran->wgDim = 2, pgran->wgSize[0] = 8, pgran->wgSize[1] = 8, pgran->wfSize = 64
Kernel extra flags: 2816
Source:

typedef union GPtr {
    __global float *f;
    __global float2 *f2v;
    __global float4 *f4v;
    __global float8 *f8v;
    __global float16 *f16v;
} GPtr;

typedef union LPtr {
    __local float *f;
    __local float2 *f2v;
    __local float4 *f4v;
    __local float8 *f8v;
    __local float16 *f16v;
} LPtr;

typedef union PPtr {
    float *f;
    float2 *f2v;
    float4 *f4v;
    float8 *f8v;
    float16 *f16v;
} PPtr;

__attribute__((reqd_work_group_size(8, 8, 1)))
void __kernel
sgemmBlock(
    uint M,
    uint N,
    uint K,
    const float alpha,
    const float beta,
    const __global float4 *restrict A,
    const __global float8 *restrict B,
    __global float8 *C,
    uint lda,
    uint ldb,
    uint ldc)
{
    float4 a0;
    float8 b0, b1, b2, b3;
    float8 c0;
    uint4 coord = 0u; /* contains coordB, coordA, k */

    lda /= 4;
    ldb /= 8;
    uint kif;
    uint get_group_id_0;
    uint get_global_id_0;
    A += 1u * (uint)get_global_id(1) * lda;
    get_group_id_0 = (get_group_id(0) + get_group_id(1))% get_num_groups(0);
    get_global_id_0 = get_group_id_0 * get_local_size(0) + get_local_id(0);
    kif = (N % 512 != 0);
    get_global_id_0 = (kif*(uint)get_global_id(0)) + ((1-kif)*get_global_id_0);
    B += get_global_id_0;
    coord.y = 1u * (uint)get_global_id(1);
    coord.x = 8u * (uint)get_global_id_0;
    if ((coord.y >= M) || (coord.x >= N)) {
        return;
    }

    c0 = 0;

    for (uint k1 = 0; k1 < K; k1 += 4) {
        /* -- Tiles multiplier -- */
        b0 = B[0];
        b1 = B[ldb];
        b2 = B[(ldb << 1)];
        b3 = B[mad24(3u, ldb, 0u)];

        a0 = A[0];

        c0 += b0 * a0.s0;
        c0 += b1 * a0.s1;
        c0 += b2 * a0.s2;
        c0 += b3 * a0.s3;

        A += 1;
        B += (ldb << 2);
        /* ---------------------- */
    }


    GPtr uC;

    uC.f = C + (coord.y * ldc + coord.x)/8;

    __global float8 *pC = uC.f8v;

    float8 tempC0;

    tempC0 = c0 * alpha + 0;
    pC[0] = tempC0;
}



--------------------------------------------------------

Build log:

<source>:86:10: error: assigning to 'float *' from incompatible type 'float8 *'
    uC.f = C + (coord.y * ldc + coord.x)/8;
         ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

if statement containing a for loop, wont this be quite slow?

Hi,

In sger kernel, we have:

if( row + float4size < numrows ) {
   // use vector maths
} else {
   for( i = row; i < numrows; i++ ) {
      // use scalar maths
   }
}

Per my understanding:

this will execute both branches of the 'if' systematically?
And in the second branch, for some threads, it will loop over the entire matrix, every row in the matrix?
and therefore, since all threads in the warp must be in lock-step, actually every thread in the entire warp will iterate over every row in the entire matrix?

Centralized location for path management

The rationales behind this issue are similar to what is explained in:

arrayfire/arrayfire#721

Feel free to re-use the solution I submitted in the following PR:

arrayfire/arrayfire#722

i.e., a corresponding clBLASInstallDirs module containing all the installation paths (for libraries, runtime, include, cmake modules...) which is then imported in the main CMakeLists.txt and used to throughout the rest of the project. That way, all install paths associated with the project can be changed / overridden from one single location.

Extend the clBLAS tuning executable to read a log file

In combination with issue #1, extend the tuning executable to read and parse the log file that is generated. The log file contains all the clBLAS functions and their parameters that a particular app needed or called into, and this allows the tuning executable to create a kernel database .kdb file that is specifically optimized for a particular application.

build of development version on NVIDIA platform fails

clEnqueueFillBuffer() used by clBLAS requires OpenCL 1.2. The NV driver and OpenCL runtime have recently been updated to support OpenCL 1.2, but CUDA-7.0 still ships with the OpenCL 1.1 header files. To build with NV OpenCL, download OpenCL header files from khronos here:

https://www.khronos.org/registry/cl/api/1.2/

and place in /usr/local/cuda-7.0/include/CL/

Build Error

Hi, I am attempting to run some examples and the tests before I incorporate clBLAS into my project. Pretty much everything I run results in the following error

AN INTERNAL KERNEL BUILD ERROR OCCURRED!
device name = Quadro FX 5800
error = -11
memory pattern = Cached global memory based block gemm, computing kernel generator
Subproblem dimensions: dims[0].itemY = 64, dims[0].itemX = 32, dims[0].y = 64, dims[0].x = 32, dims[0].bwidth = 4; ; dims[1].itemY = 8, dims[1].itemX = 4, dims[1].y = 8, dims[1].x = 4, dims[1].bwidth = 4; ;
Parallelism granularity: pgran->wgDim = 2, pgran->wgSize[0] = 8, pgran->wgSize[1] = 8, pgran->wfSize = 64
Kernel extra flags: 939554576
Source:
source dump cut

Build log:

Error in processing command line: Don't understand command line argument "-g"!

Would you be able to assist me with this problem?
Thanks

Basic gemm application not computing correctly

I am trying to adapt the example_sgemm.c file for a full matrix multiplication (not the 'lower-right submatrix'). However I can not figure out how to get the numbers to come out correctly.

If I take the exact matrices to another language like R. I get the output between the matrices simply with this:

A <- matrix(c(11, 12, 13, 14, 15,
              21, 22, 23, 24, 25,
              31, 32, 33, 34, 35,
              41, 42, 43, 44, 45 ), nrow=4)

B <- matrix(c( 11, 12, 13,
               21, 22, 23,
               31, 32, 33,
               41, 42, 43,
               51, 52, 53), nrow=5)

A %*% B
     [,1] [,2] [,3]
[1,] 2230 4297 6273
[2,] 2369 4612 6729
[3,] 2513 4932 7225
[4,] 2697 5257 7726

Now, I saw this previously closed issue #15 and tried adapt the statements to set the offsets to zero. I also set the alpha = 1 and beta=0. But the output returned is:

2115 2180 2245 
3665 3780 3895 
5215 5380 5545 
6765 6980 7195

I have been trying for hours to just do what should be a relatively simple task but have come up short. Could someone please help me sort out what is causing the numbers to be different when the inputs are identical? Clearly I must have made some mistake in my arguments to clblasSgemm. I have provided my current file in a temporary repository here.

Move clBLAS.h and related files into include.

Right now clBLAS.h clAmdBlas.h and others are included in the src/ directory. These should be in src/include directory.
Right now src/include contains files that are required for internal purposes only. These should probably move to src/library/include.

[OSX] tests fail on the CPU with error -54 (CL_INVALID_WORK_GROUP_SIZE)

It seems that tests consistently fail with the error -54 when executed on the CPU on OSX. Looking at the OpenCL Framework header, error code -54 stands for CL_INVALID_WORK_GROUP_SIZE (see below).
I did some initial investigations on where this could come from. After some digging around, I checked the CL_DEVICE_MAX_WORK_ITEM_SIZES for the CPU and this is 1024, 1, 1 for CPU (as compared to 1024, 1024, 64 for the GPU).
With the debugger I checked, and e.g. for the xGEMM test, and the local_work_size arguments to clEnqueueNDRangeKernel are 8, 8 (see below).

I reproduced this bug with a small test program and indeed any local_work_size[1] > 1 will fail with error -54. (Btw, I my opinion it should be CL_INVALID_WORK_ITEM_SIZE [-55] not CL_INVALID_WORK_GROUP_SIZE[-54] according to http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html)

I guess that the fix would be to make sure all the device constraints (e.g. CL_DEVICE_MAX_WORK_ITEM_SIZES) are fulfilled when calculating the work group sizes (I guess somewhere between getStepGranulation and subgGetDefaultDecomp but I have to dig deeper).

(lldb) print kernDesc->localThreads
(size_t [3]) $34 = {
  [0] = 8
  [1] = 8
  [2] = 808
}

====

Device name: Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz
Device vendor: Intel
Platform (bit): Apple OS X
clblas version: 2.1.0
Driver version: 1.1
Device version: OpenCL 1.2
Global mem size: 16384 MB
---------------------------------------------------------

Note: Google Test filter = ColumnMajor_SmallRange/GEMM.cgemm/31
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ColumnMajor_SmallRange/GEMM
[ RUN      ] ColumnMajor_SmallRange/GEMM.cgemm/31
clblasColumnMajor, clblasTrans, clblasNoTrans
M = 128, N = 128, K = 128
offA = 0, offB = 0, offC = 0
lda = 128, ldb = 128, ldc = 128
seed = 12345
queues = 1
Generating input data... Done
Calling reference xGEMM routine... Done
Calling clblas xGEMM routine... /Users/gicmo/Coding/src/clBLAS/src/tests/correctness/corr-gemm.cpp:180: Failure
Value of: err
  Actual: -54
Expected: 0
::clMath::clblas::GEMM() failed
[  FAILED  ] ColumnMajor_SmallRange/GEMM.cgemm/31, where GetParam() = (1, 1, 0, 128, 128, 128, 48-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, 1) (97 ms)
[----------] 1 test from ColumnMajor_SmallRange/GEMM (97 ms total)

Segmentation fault of sample code in readme.md on mac

When I try to run the sample code in readme.md on my mac, it gives segmentation fault, which is for the function clblasSgemm.

clBLAS does not build with cmake < 2.8

When using cmake < 2.8, this is the error we see.

CMake Error at library/CMakeLists.txt:282 (include):
  include could not find load file:

    ExternalProject


CMake Error at library/CMakeLists.txt:283 (ExternalProject_Add):
  Unknown CMake command "ExternalProject_Add".

The build should either be checking for cmake version or we should alternative means to build this particular section.

Disabling 'build client' in develop build causes build to fail

to replicate:

clone develop branch, on ubuntu 14.04 64-bit
run ccmake
disable everything except shared library and print malloc errors
press 'c' and 'g' to generate
=> error about tyring to install non-existent target Library

Tests fail on NVIDIA

Several GEMM tests fail on NVIDIA OpenCL with a mismatch in m: 0 n: 0

Subsequently, the test routine terminates while calling the reference GEMM with the following error:

"Calling reference xGEMM routine... unknown file: error: SEH exception with code 0xc0000094 thrown in the test body.
[ FAILED ] ColumnMajor_SmallRange/GEMM.dgemm/0, where GetParam() = (1, 0, 0, 63, 63, 63, 48-byte object <00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, 1) (1 ms)"

Tested on Windows 7 (64-bit) with NVIDIA GTX 680 and CUDA 6.5

Seqmentation fault of Zaxpy on 32-bit Windows

I am trying to use clAmdBlas 1.10 package on Windows 7 (64 bit) notebook with Nvidia GTX540M. I am using MinGW (with gcc 4.8.1) to compile the programs both in native (64-bit) and in 32-bit mode. I managed to link my program against clAmdBlas and run it in 64-bit, but had issues in 32-bit mode. After investigating it boiled down to double_complex functions.

To test it I took the example example_saxpy.c from the package and modified into all other three modes (D,C, and Z). Then I compiled and tested all 4 (with -m32). Compilation was fine for all 4. All ran fine except Zaxpy. The latter produced segmentation fault when calling clAmdBlasZaxpy. I attach corresponding test in the end.

I have also tested the same executables (without recompiling) on another Windows7 computer with AMD7950 GPU with the same conclusions.

As additional information, in my code I called several complex functions: Zaxpy and Zscal failed, while Zdotu and Dznrm2 worked fine. However, I haven't tested them in details as I did for Zaxpy.

I report a problem for rather old version of the package, because I do not see at easy way to obtain a more recent DLL for Windows.

#include <sys/types.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

/* Include CLBLAS header. It automatically includes needed OpenCL header,
 * so we can drop out explicit inclusion of cl.h header.
 */
#include <clAmdBlas.h>

/* This example uses predefined matrices and their characteristics for
 * simplicity purpose.
 */
static const size_t N = 2;
static const cl_double2 alpha = {{10,0}};
static cl_double2 X[] = {
    {{11,0}},
    {{21,0}},
};
static const int incx = 1;

static cl_double2 Y[] = {
    {{15,0}},
    {{11,0}},
};
static const int incy = 1;


static void
printResult(void)
{
    size_t i;
    printf("\nResult:\n");

    printf("Y\n");
    for (i = 0; i < N; i++) {
             printf("\t%f + %fi\n",Y[i].s[0],Y[i].s[1]);
    }
}

int
main(void)
{
    cl_int err;
    cl_platform_id platform = 0;
    cl_device_id device = 0;
    cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
    cl_context ctx = 0;
    cl_command_queue queue = 0;
    cl_mem bufX, bufY;
    cl_event event = NULL;
    int ret = 0;
    int lenX = 1 + (N-1)*abs(incx);
    int lenY = 1 + (N-1)*abs(incy);

    /* Setup OpenCL environment. */
    err = clGetPlatformIDs(1, &platform, NULL);
    if (err != CL_SUCCESS) {
        printf( "clGetPlatformIDs() failed with %d\n", err );
        return 1;
    }

    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
    if (err != CL_SUCCESS) {
        printf( "clGetDeviceIDs() failed with %d\n", err );
        return 1;
    }

    props[1] = (cl_context_properties)platform;
    ctx = clCreateContext(props, 1, &device, NULL, NULL, &err);
    if (err != CL_SUCCESS) {
        printf( "clCreateContext() failed with %d\n", err );
        return 1;
    }

    queue = clCreateCommandQueue(ctx, device, 0, &err);
    if (err != CL_SUCCESS) {
        printf( "clCreateCommandQueue() failed with %d\n", err );
        clReleaseContext(ctx);
        return 1;
    }

    /* Setup clAmdBlas. */
    err = clAmdBlasSetup();
    if (err != CL_SUCCESS) {
        printf("clAmdBlasSetup() failed with %d\n", err);
        clReleaseCommandQueue(queue);
        clReleaseContext(ctx);
        return 1;
    }

    /* Prepare OpenCL memory objects and place matrices inside them. */
    bufX = clCreateBuffer(ctx, CL_MEM_READ_ONLY, (lenX*sizeof(cl_double2)), NULL, &err);
    bufY = clCreateBuffer(ctx, CL_MEM_READ_WRITE, (lenY*sizeof(cl_double2)), NULL, &err);

    err = clEnqueueWriteBuffer(queue, bufX, CL_TRUE, 0, (lenX*sizeof(cl_double2)), X, 0, NULL, NULL);
    err = clEnqueueWriteBuffer(queue, bufY, CL_TRUE, 0, (lenY*sizeof(cl_double2)), Y, 0, NULL, NULL);

    /* Call clAmdBlas function. */
    err = clAmdBlasZaxpy( N, alpha, bufX, 0, incx, bufY, 0, incy, 1, &queue, 0, NULL, &event); 
    if (err != CL_SUCCESS) {
        printf("clAmdBlasZaxpy() failed with %d\n", err);
        ret = 1;
    }
    else {
        /* Wait for calculations to be finished. */
        err = clWaitForEvents(1, &event);

        /* Fetch results of calculations from GPU memory. */
        err = clEnqueueReadBuffer(queue, bufY, CL_TRUE, 0, (lenY*sizeof(cl_double2)),
                                    Y, 0, NULL, NULL);

        /* At this point you will get the result of SAXPY placed in vector Y. */
        printResult();
    }

    /* Release OpenCL memory objects. */
    clReleaseMemObject(bufY);
    clReleaseMemObject(bufX);

    /* Finalize work with clAmdBlas. */
    clAmdBlasTeardown();

    /* Release OpenCL working objects. */
    clReleaseCommandQueue(queue);
    clReleaseContext(ctx);

    return ret;
}

Getting the generated kernels

Is there a way to see how a generated kernel for a certain blas operation looks like? I am looking at how to use clBLAS kernels in another library, but can not follow the generation process in clBLAS. Is there something like "dump the generated kernels to a string"?

Port clBLAS to MacOS!

Presumably Accelereyes will use clMath in arrayfire for Macos so can we expect they will share port of clMath to MacOs here on github?

CL build error when running the tune tool

I ran the following command
tune.exe --gemm --double --store-kernels

and got the following output :

attribute((reqd_work_group_size(32, 2, 1)))
void __kernel
dgemmBlock(
uint M,
uint N,
uint K,
const double alpha,
const double beta,
const __global _restrict A,
const __global *restrict B,
_global *C,
uint lda,
uint ldb,
uint ldc)
{
double2 a0;
double2 b0, b1;
double2 c0, c1, c2, c3, c4, c5, c6, c7;
uint4 coord = 0u; / contains coordB, coordA, k */

uint kif;
uint get_group_id_0;
uint get_global_id_0;
A += 8u * (uint)get_global_id(1) * lda;
get_group_id_0 = (get_group_id(0) + get_group_id(1))% get_num_groups(0);
get_global_id_0 = get_group_id_0 * get_local_size(0) + get_local_id(0);
kif = (N % 256 != 0);
get_global_id_0 = (kif*(uint)get_global_id(0)) + ((1-kif)*get_global_id_0);
B += get_global_id_0 * 2;
coord.y = 8u * (uint)get_global_id(1);
coord.x = 2u * (uint)get_global_id_0;
c0 = 0;
c1 = 0;
c2 = 0;
c3 = 0;
c4 = 0;
c5 = 0;
c6 = 0;
c7 = 0;

for (uint k1 = 0; k1 < K; k1 += 2) {
    /* -- Tiles multiplier -- */
    b0 = B[0];
    b0 = B[1];
    b1 = B[ldb];
    b1 = B[ldb + 1];

    a0 = A[0];
    a0 = A[1];

    c0 = mad(b0, a0.s0, c0);
    c0 = mad(b1, a0.s1, c0);

    a0 = A[lda];
    a0 = A[lda + 1];

    c1 = mad(b0, a0.s0, c1);
    c1 = mad(b1, a0.s1, c1);

    a0 = A[(lda << 1)];
    a0 = A[mad24(2u, lda, 1u)];

    c2 = mad(b0, a0.s0, c2);
    c2 = mad(b1, a0.s1, c2);

    a0 = A[mad24(3u, lda, 0u)];
    a0 = A[mad24(3u, lda, 1u)];

    c3 = mad(b0, a0.s0, c3);
    c3 = mad(b1, a0.s1, c3);

    a0 = A[(lda << 2)];
    a0 = A[mad24(4u, lda, 1u)];

    c4 = mad(b0, a0.s0, c4);
    c4 = mad(b1, a0.s1, c4);

    a0 = A[mad24(5u, lda, 0u)];
    a0 = A[mad24(5u, lda, 1u)];

    c5 = mad(b0, a0.s0, c5);
    c5 = mad(b1, a0.s1, c5);

    a0 = A[mad24(6u, lda, 0u)];
    a0 = A[mad24(6u, lda, 1u)];

    c6 = mad(b0, a0.s0, c6);
    c6 = mad(b1, a0.s1, c6);

    a0 = A[mad24(7u, lda, 0u)];
    a0 = A[mad24(7u, lda, 1u)];

    c7 = mad(b0, a0.s0, c7);
    c7 = mad(b1, a0.s1, c7);

    A += 2;
    B += (ldb << 1);
    /* ---------------------- */
}


GPtr uC;

uC.d = C + coord.y * ldc + coord.x;

__global  *pC = uC.d0v;

double2 tempC0, tempC1;

tempC0 = pC[0];
tempC0 = pC[1];
tempC1 = pC[ldc];
tempC1 = pC[ldc + 1];
tempC0 = mad(tempC0, beta, 0);
tempC1 = mad(tempC1, beta, 0);
tempC0 = mad(c0, alpha, tempC0);
tempC1 = mad(c1, alpha, tempC1);
pC[0] = tempC0;
pC[1] = tempC0;
pC[ldc] = tempC1;
pC[ldc + 1] = tempC1;

tempC0 = pC[(ldc << 1)];
tempC0 = pC[mad24(2u, ldc, 1u)];
tempC1 = pC[mad24(3u, ldc, 0u)];
tempC1 = pC[mad24(3u, ldc, 1u)];
tempC0 = mad(tempC0, beta, 0);
tempC1 = mad(tempC1, beta, 0);
tempC0 = mad(c2, alpha, tempC0);
tempC1 = mad(c3, alpha, tempC1);
pC[(ldc << 1)] = tempC0;
pC[mad24(2u, ldc, 1u)] = tempC0;
pC[mad24(3u, ldc, 0u)] = tempC1;
pC[mad24(3u, ldc, 1u)] = tempC1;

tempC0 = pC[(ldc << 2)];
tempC0 = pC[mad24(4u, ldc, 1u)];
tempC1 = pC[mad24(5u, ldc, 0u)];
tempC1 = pC[mad24(5u, ldc, 1u)];
tempC0 = mad(tempC0, beta, 0);
tempC1 = mad(tempC1, beta, 0);
tempC0 = mad(c4, alpha, tempC0);
tempC1 = mad(c5, alpha, tempC1);
pC[(ldc << 2)] = tempC0;
pC[mad24(4u, ldc, 1u)] = tempC0;
pC[mad24(5u, ldc, 0u)] = tempC1;
pC[mad24(5u, ldc, 1u)] = tempC1;

tempC0 = pC[mad24(6u, ldc, 0u)];
tempC0 = pC[mad24(6u, ldc, 1u)];
tempC1 = pC[mad24(7u, ldc, 0u)];
tempC1 = pC[mad24(7u, ldc, 1u)];
tempC0 = mad(tempC0, beta, 0);
tempC1 = mad(tempC1, beta, 0);
tempC0 = mad(c6, alpha, tempC0);
tempC1 = mad(c7, alpha, tempC1);
pC[mad24(6u, ldc, 0u)] = tempC0;
pC[mad24(6u, ldc, 1u)] = tempC0;
pC[mad24(7u, ldc, 0u)] = tempC1;
pC[mad24(7u, ldc, 1u)] = tempC1;

}

Build log:

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 2: warning: OpenCL
extension is now part of core
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 54: warning: explicit
type is missing ("int" assumed)
const __global *restrict A,
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 55: warning: explicit
type is missing ("int" assumed)
const __global *restrict B,
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 56: warning: explicit
type is missing ("int" assumed)
__global *C,
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 149: warning: a value
of type "__global int *" cannot be assigned to an entity of type
"__global double *"
uC.d = C + coord.y * ldc + coord.x;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 151: warning:
explicit type is missing ("int" assumed)
__global *pC = uC.d0v;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 151: error: union
"GPtr" has no field "d0v"
__global *pC = uC.d0v;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 163: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[0] = tempC0;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 164: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[1] = tempC0;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 165: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[ldc] = tempC1;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 166: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[ldc + 1] = tempC1;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 176: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[(ldc << 1)] = tempC0;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 177: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[mad24(2u, ldc, 1u)] = tempC0;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 178: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[mad24(3u, ldc, 0u)] = tempC1;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 179: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[mad24(3u, ldc, 1u)] = tempC1;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 189: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[(ldc << 2)] = tempC0;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 190: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[mad24(4u, ldc, 1u)] = tempC0;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 191: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[mad24(5u, ldc, 0u)] = tempC1;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 192: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[mad24(5u, ldc, 1u)] = tempC1;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 202: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[mad24(6u, ldc, 0u)] = tempC0;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 203: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[mad24(6u, ldc, 1u)] = tempC0;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 204: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[mad24(7u, ldc, 0u)] = tempC1;
^

"C:\Users\DELL2\AppData\Local\Temp\OCLD77D.tmp.cl", line 205: error: a value
of type "double2" cannot be assigned to an entity of type "int"
pC[mad24(7u, ldc, 1u)] = tempC1;
^

17 errors detected in the compilation of "C:\Users\DELL2\AppData\Local\Temp\OCLD
77D.tmp.cl".

Frontend phase failed compilation.

I don't have this output if I don't use the --store-kernels option
But I guess the same problem exists.

clkern.c::launchClKernel()::clWaitForEvents() never called

Hi,
In solution_seq.c::enqueueKernel() lines 186-188, where nowait = 1 and needExecTime = 0, then
we got two if-conditions in clkern.c::launchClKernel() lines 104-120 to call the clWaitForEvents() function.

The problem is the (nowait, needExecTime) values set above seems make the execution not to
jump to any of these two if-condition, hence the event is never waited and may cause race-condition.

status = clEnqueueNDRangeKernel(...);
if ((status == CL_SUCCESS) && !kernDesc->nowait) {
       status = clWaitForEvents(1, kernDesc->event);
}
...
if ((status == CL_SUCCESS) && kernDesc->needExecTime && kernDesc->event) {
       if (kernDesc->nowait) {
            status = clWaitForEvents(1, kernDesc->event);
             ...
       }
}

Furthermore, I don't quiet understand the semantic of 'nowait' here, when we run kernels in sequence,
they may need to be synchronized sometimes (even often) to update results ?

Regards
Quan

Compiled libraries for Mac

It would be nice if you can provide compiled libraries for the Mac platform.

make fails at linking

I'm trying to compile clblas on Debian, but fails at linking

https://gist.github.com/Alessandro-Barbieri/9a808ac4d2f65eb3402f#file-gistfile1-txt

Problems when working with OSX + multiple GPUs on same context

This is related to:
arrayfire/arrayfire#338

The stand alone code can be found here:
arrayfire/arrayfire#338 (comment)

The issue is not being reproduced on Windows (as tested by @TimmyLiu) or Linux. It is also not reproducible on OSX machines with just the one GPU. The problem only occurs when the context is created to handle both gpus (both nvidia and intel).

Looks like this is perhaps an OSX specific problem. We are trying to implement a work around in ArrayFire to use only one device per context.

How can I build libclBLAS.so for OSX?

Hi again,

I'm new to the world of cmake and building source code so I'm looking for some help with basic information. If you'd prefer to refer me to materials that I should read I'd be happy to do so :)

Ultimately my goal is to write a Ruby FFI into clBLAS and to get started I just need a libclBLAS.so file (like the ones that are available for Linux and Windows on the releases page).

It seems at the moment that I'll need to make it myself and I'm not sure I'm even starting at the right place. I pointed cmake at the src/directory and, as described in the wiki, it generated a bunch of cmake files. But that's as far as I'm getting 😊

Reading through the wiki page on this, it seems that I need to install a few dependencies before I can move on. Mostly they are for testing and unfortunately some of them (for example the AMD APP SDK) only have Windows/Linux releases.

How can I go about getting ahold of a clBLAS binary for Mac for use with Ruby's FFI library?

crash when running the tune tool

OS : tested on win7 64 bits
Card : tested on W9000 (tahiti asic)
driver : tested on 12.104.2 and 13.15

I ran the command tune.exe --gemm --double --store-kernels
It crashes or TDR during in the middle of the tuning process

blas 3 matrix multiplication kernel build error when called from ArrayFire

I am indirectly calling the clBLAS gemm function from ArrayFire. I found that ArrayFire's matmul function crashes when it calls clBLAS's gemmS function. So, I activated build error log in clBLAS (https://github.com/clMathLibraries/clBLAS/blob/54e949e2db8ee49a170b1bbc829ceed07f16533f/src/library/blas/generic/common.c) and noticed there is build error in clBLAS.

When I run the matrix multiplication example in clBLAS repo directly (https://github.com/clMathLibraries/clBLAS/blob/54e949e2db8ee49a170b1bbc829ceed07f16533f/src/samples/example_sgemm.c), everything works just fine.

What could be an issue here? (I am posting this as an ArrayFire issue as well.)

========================================================

AN INTERNAL KERNEL BUILD ERROR OCCURRED!
device name = HD Graphics 4000
error = 0
memory pattern = Cached global memory based block gemm, computing kernel generator
Subproblem dimensions: dims[0].itemY = 8, dims[0].itemX = 8, dims[0].y = 8, dims[0].x = 8, dims[0].bwidth = 1; ; dims[1].itemY = 1, dims[1].itemX = 1, dims[1].y = 1, dims[1].x = 1, dims[1].bwidth = 1; ; 
Parallelism granularity: pgran->wgDim = 2, pgran->wgSize[0] = 8, pgran->wgSize[1] = 8, pgran->wfSize = 64
Kernel extra flags: 31508
Source:

typedef union GPtr {
    __global float *f;
    __global float2 *f2v;
    __global float4 *f4v;
    __global float8 *f8v;
    __global float16 *f16v;
} GPtr;

typedef union LPtr {
    __local float *f;
    __local float2 *f2v;
    __local float4 *f4v;
    __local float8 *f8v;
    __local float16 *f16v;
} LPtr;

typedef union PPtr {
    float *f;
    float2 *f2v;
    float4 *f4v;
    float8 *f8v;
    float16 *f16v;
} PPtr;

__attribute__((reqd_work_group_size(8, 8, 1)))
void __kernel
sgemmBlock(
    uint M,
    uint N,
    uint K,
    const float alpha,
    const float beta,
    const __global float *restrict A,
    const __global float *restrict B,
    __global float *C,
    uint lda,
    uint ldb,
    uint ldc)
{
    float a0;
    float b0;
    float c0;
    uint4 coord = 0u; /* contains coordB, coordA, k */

    uint kif;
    uint get_group_id_1;
    uint get_global_id_1;
    A += (uint)get_global_id(0);
    get_group_id_1 = (get_group_id(0) + get_group_id(1))% get_num_groups(1);
    get_global_id_1 = get_group_id_1 * get_local_size(1) + get_local_id(1);
    kif = (N % 512 != 0);
    get_global_id_1 = (kif*(uint)get_global_id(1)) + ((1-kif)*get_global_id_1);
    B += get_global_id_1;
    coord.y = 1u * (uint)get_global_id(0);
    coord.x = 1u * (uint)get_global_id_1;
    if ((coord.y >= M) || (coord.x >= N)) {
        return;
    }

    c0 = 0;

    for (uint k1 = 0; k1 < K; k1 += 1) {
        /* -- Tiles multiplier -- */
        b0 = B[0];

        a0 = A[0];

        c0 += a0 * b0;

        A += lda;
        B += ldb;
        /* ---------------------- */
    }


    GPtr uC;

    uC.f = C + coord.x * ldc + coord.y;

    __global float *pC = uC.f;

    float tempC0;

    tempC0 = c0 * alpha + 0;
    pC[0] = tempC0;
}



--------------------------------------------------------

Build log:


========================================================

The program has unexpectedly finished.

What if maxWorkGroupSize < 64

Hello,
I'm working on porting clBLAS on my company's accelerator. Our OpenCL library only support maximum
16 work-items per work group. So I fall in the unimplemented case in the kernel generator (solution_seq_make.cpp:getDefaultStepGranulation() ) where maxWorkGroupSize < 64 is not supported. I would like to implement this but don't know how to do and its algorithms back-scene. Anyone can help or explain me ? Thanks in advance.
Quan

Tests fail on many Level-3 functions with Intel OpenCL runtime

Every GEMM correctness test I've been able to run gets bad results immediately at index (0, 0). Error is very large, so it is likely the result of triggering some kind of undefined behavior. SYMM, SYRK, and TRMM, and TRSM also seem to fail (although not always).

Has anyone else tried testing running on the Intel runtime?

Travis CI build fail due to OpenCL 1.2 header requirement

Travis CI seems to provide OpenCL 1.2 headers only. The recently pushed in code requires OpenCL 1.2 and later header files.

build fails on Mac OSX 10.10.3, clang 6.1.0.6020049

Can't get a freshly cloned instance to build. The cmake, makefile outputs are here

The problem is unresolved symbols:

ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [staging/test-correctness] Error 1
make[1]: *** [tests/CMakeFiles/test-correctness.dir/all] Error 2
make: *** [all] Error 2

Am I not invoking cmake properly for OSX? This same error occurs whether I try cmake with -G "Unix makefiles" or not.

[OSX] crash in clGetProgramInfo called by fullKernelSize

Using the vanilla example program (sgemm) as a test case I get a crash on OSX (10.9) inside fullKernelSize(). Maybe this is an bug in the OpenCL implementation (similar to [1]) on OSX because it chokes on a strlen called with NULL inside clGetProgramInfo(). Commenting out that line will make the test program work.
Stacktrace is attached.

[1] http://www.mail-archive.com/[email protected]/msg00414.html

* thread #1: tid = 0x5463cc, 0x00007fff908c0812 libsystem_c.dylib`strlen + 18, queue = 'com.apple.main-thread, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x00007fff908c0812 libsystem_c.dylib`strlen + 18
    frame #1: 0x00007fff85134c60 OpenCL`clGetProgramInfo + 625
    frame #2: 0x00000001000075cc libclBLAS.2.dylib`fullKernelSize(kern=0x000000010054ed60) + 236 at kern_cache.c:428
    frame #3: 0x00000001000072b6 libclBLAS.2.dylib`addKernelToCache(kcache=0x00000001002177f0, sid=18, kern=0x000000010054ed60, key=0x00007fff5fbff000, extraCmp=0x000000010003c5e0) + 198 at kern_cache.c:311
    frame #4: 0x0000000100038c73 libclBLAS.2.dylib`makeSolutionSeq(funcID=CLBLAS_GEMM, args=0x00007fff5fbff4d0, numCommandQueues=1, commandQueues=0x00007fff5fbff830, numEventsInWaitList=0, eventWaitList=0x0000000000000000, events=0x00007fff5fbff810, seq=0x00007fff5fbff2d8) + 2915 at solution_seq_make.c:599
    frame #5: 0x000000010001503e libclBLAS.2.dylib`doGemm(kargs=0x00007fff5fbff4d0, order=clblasRowMajor, transA=clblasNoTrans, transB=clblasNoTrans, M=3, N=2, K=4, A=0x000000010021a190, offA=6, lda=5, B=0x000000010021a410, offB=4, ldb=3, C=0x000000010021a690, offC=4, ldc=3, numCommandQueues=1, commandQueues=0x00007fff5fbff830, numEventsInWaitList=0, eventWaitList=0x0000000000000000, events=0x00007fff5fbff810) + 1118 at xgemm.c:102
    frame #6: 0x0000000100014bc8 libclBLAS.2.dylib`clblasSgemm(order=clblasRowMajor, transA=clblasNoTrans, transB=clblasNoTrans, M=3, N=2, K=4, alpha=10, A=0x000000010021a190, offA=6, lda=5, B=0x000000010021a410, offB=4, ldb=3, beta=20, C=0x000000010021a690, offC=4, ldc=3, numCommandQueues=1, commandQueues=0x00007fff5fbff830, numEventsInWaitList=0, eventWaitList=0x0000000000000000, events=0x00007fff5fbff810) + 680 at xgemm.c:145
    frame #7: 0x0000000100001aaf clblastest`main + 1295 at main.cpp:156
    frame #8: 0x00007fff908765fd libdyld.dylib`start + 1

clBlas on MaxOsX (Yosemite) with gcc 4.9.2 (from Macports)

[ 0%] Creating directories for 'tplgen'
[ 1%] Performing download step (DIR copy) for 'tplgen'
[ 2%] No patch step for 'tplgen'
[ 2%] No update step for 'tplgen'
[ 2%] Performing configure step for 'tplgen'
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/admin_naths/Binaries/clBlas/library/tplgen-prefix/src/tplgen-build
[ 3%] Performing build step for 'tplgen'
Scanning dependencies of target tplgen
[100%] Building CXX object CMakeFiles/tplgen.dir/tplgen.cpp.o
Linking CXX executable tplgen
[100%] Built target tplgen
[ 4%] No install step for 'tplgen'
[ 4%] Completed 'tplgen'
[ 4%] Built target tplgen
TPLGEN Running.....
[ 4%] Built target GENERATE_CLT
[ 4%] Building C object library/CMakeFiles/clBLAS.dir/common/list.c.o
[ 4%] Building C object library/CMakeFiles/clBLAS.dir/common/clkern.c.o
[ 5%] Building C object library/CMakeFiles/clBLAS.dir/common/kern_cache.c.o
[ 5%] Building C object library/CMakeFiles/clBLAS.dir/common/kerngen_core.c.o
[ 6%] Building C object library/CMakeFiles/clBLAS.dir/common/kgen_basic.c.o
[ 6%] Building C object library/CMakeFiles/clBLAS.dir/common/kgen_loop_helper.c.o
[ 7%] Building C object library/CMakeFiles/clBLAS.dir/common/kgen_guard.c.o
[ 7%] Building C object library/CMakeFiles/clBLAS.dir/common/misc.c.o
[ 8%] Building C object library/CMakeFiles/clBLAS.dir/common/devinfo.c.o
[ 8%] Building C object library/CMakeFiles/clBLAS.dir/common/devinfo-cache.c.o
/Users/admin_naths/Sources/clBLAS/src/library/common/devinfo-cache.c:299:17: warning:
'clCreateImage2D' is deprecated: first deprecated in OS X 10.8
[-Wdeprecated-declarations]
imgIn = clCreateImage2D(ctx, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
^
/System/Library/Frameworks/OpenCL.framework/Headers/cl.h:1170:1: note:
'clCreateImage2D' has been explicitly marked deprecated here
clCreateImage2D(cl_context /* context /,
^
/Users/admin_naths/Sources/clBLAS/src/library/common/devinfo-cache.c:670:17: warning:
'clCreateImage2D' is deprecated: first deprecated in OS X 10.8
[-Wdeprecated-declarations]
imgIn = clCreateImage2D(ctx, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
^
/System/Library/Frameworks/OpenCL.framework/Headers/cl.h:1170:1: note:
'clCreateImage2D' has been explicitly marked deprecated here
clCreateImage2D(cl_context / context /,
^
2 warnings generated.
[ 8%] Building C object library/CMakeFiles/clBLAS.dir/common/mutex.c.o
[ 9%] Building C object library/CMakeFiles/clBLAS.dir/common/trace_malloc.c.o
[ 9%] Building C object library/CMakeFiles/clBLAS.dir/common/gens/dblock_kgen.c.o
[ 10%] Building C object library/CMakeFiles/clBLAS.dir/blas/init.c.o
[ 10%] Building C object library/CMakeFiles/clBLAS.dir/blas/impl.c.o
[ 11%] Building C object library/CMakeFiles/clBLAS.dir/blas/scimage.c.o
/Users/admin_naths/Sources/clBLAS/src/library/blas/scimage.c:248:13: warning:
'clCreateImage2D' is deprecated: first deprecated in OS X 10.8
[-Wdeprecated-declarations]
image = clCreateImage2D(context, CL_MEM_READ_WRITE, &IMAGE_FORMAT,
^
/System/Library/Frameworks/OpenCL.framework/Headers/cl.h:1170:1: note:
'clCreateImage2D' has been explicitly marked deprecated here
clCreateImage2D(cl_context / context */,
^
1 warning generated.
[ 11%] Building C object library/CMakeFiles/clBLAS.dir/blas/xgemv.c.o
[ 12%] Building C object library/CMakeFiles/clBLAS.dir/blas/xsymv.c.o
[ 12%] Building C object library/CMakeFiles/clBLAS.dir/blas/xgemm.c.o
[ 12%] Building C object library/CMakeFiles/clBLAS.dir/blas/xtrmm.c.o
[ 13%] Building C object library/CMakeFiles/clBLAS.dir/blas/xtrsm.c.o
[ 13%] Building C object library/CMakeFiles/clBLAS.dir/blas/xsyrk.c.o
[ 14%] Building C object library/CMakeFiles/clBLAS.dir/blas/xsyr2k.c.o
[ 14%] Building C object library/CMakeFiles/clBLAS.dir/blas/xtrmv.c.o
[ 15%] Building C object library/CMakeFiles/clBLAS.dir/blas/xtrsv.c.o
[ 15%] Building C object library/CMakeFiles/clBLAS.dir/blas/xsymm.c.o
/Users/admin_naths/Sources/clBLAS/src/library/blas/xsymm.c:238:16: warning:
enumeration value 'TYPE_UNSIGNED_INT' not handled in switch [-Wswitch]
switch(kargs->dtype)
^
1 warning generated.
[ 16%] Building C object library/CMakeFiles/clBLAS.dir/blas/xgemm2.c.o
[ 16%] Building C object library/CMakeFiles/clBLAS.dir/blas/xger.c.o
[ 16%] Building C object library/CMakeFiles/clBLAS.dir/blas/xsyr.c.o
[ 17%] Building C object library/CMakeFiles/clBLAS.dir/blas/xsyr2.c.o
[ 17%] Building C object library/CMakeFiles/clBLAS.dir/blas/xher.c.o
[ 18%] Building C object library/CMakeFiles/clBLAS.dir/blas/xher2.c.o
[ 18%] Building C object library/CMakeFiles/clBLAS.dir/blas/xhemv.c.o
[ 19%] Building C object library/CMakeFiles/clBLAS.dir/blas/xhemm.c.o
[ 19%] Building C object library/CMakeFiles/clBLAS.dir/blas/xherk.c.o
[ 20%] Building C object library/CMakeFiles/clBLAS.dir/blas/xhpmv.c.o
[ 20%] Building C object library/CMakeFiles/clBLAS.dir/blas/xspmv.c.o
[ 20%] Building C object library/CMakeFiles/clBLAS.dir/blas/xgbmv.c.o
[ 21%] Building C object library/CMakeFiles/clBLAS.dir/blas/xtbmv.c.o
[ 21%] Building C object library/CMakeFiles/clBLAS.dir/blas/xshbmv.c.o
[ 22%] Building C object library/CMakeFiles/clBLAS.dir/blas/xtbsv.c.o
[ 22%] Building C object library/CMakeFiles/clBLAS.dir/blas/xher2k.c.o
[ 23%] Building C object library/CMakeFiles/clBLAS.dir/blas/xswap.c.o
[ 23%] Building C object library/CMakeFiles/clBLAS.dir/blas/xscal.c.o
[ 24%] Building C object library/CMakeFiles/clBLAS.dir/blas/xcopy.c.o
[ 24%] Building C object library/CMakeFiles/clBLAS.dir/blas/xaxpy.c.o
[ 24%] Building C object library/CMakeFiles/clBLAS.dir/blas/xdot.c.o
[ 25%] Building C object library/CMakeFiles/clBLAS.dir/blas/xrotg.c.o
[ 25%] Building C object library/CMakeFiles/clBLAS.dir/blas/xrotmg.c.o
[ 26%] Building C object library/CMakeFiles/clBLAS.dir/blas/xrot.c.o
[ 26%] Building C object library/CMakeFiles/clBLAS.dir/blas/xrotm.c.o
[ 27%] Building C object library/CMakeFiles/clBLAS.dir/blas/ixamax.c.o
[ 27%] Building C object library/CMakeFiles/clBLAS.dir/blas/xnrm2.c.o
[ 28%] Building C object library/CMakeFiles/clBLAS.dir/blas/xasum.c.o
[ 28%] Building C object library/CMakeFiles/clBLAS.dir/blas/generic/common.c.o
[ 28%] Building C object library/CMakeFiles/clBLAS.dir/blas/generic/blas_funcs.c.o
[ 29%] Building C object library/CMakeFiles/clBLAS.dir/blas/generic/events.c.o
[ 29%] Building C object library/CMakeFiles/clBLAS.dir/blas/generic/matrix_props.c.o
[ 30%] Building C object library/CMakeFiles/clBLAS.dir/blas/generic/matrix_dims.c.o
[ 30%] Building C object library/CMakeFiles/clBLAS.dir/blas/generic/kdump.c.o
[ 31%] Building C object library/CMakeFiles/clBLAS.dir/blas/generic/solution_assert.c.o
[ 31%] Building C object library/CMakeFiles/clBLAS.dir/blas/generic/solution_seq.c.o
[ 32%] Building C object library/CMakeFiles/clBLAS.dir/blas/generic/solution_seq_make.c.o
/Users/admin_naths/Sources/clBLAS/src/library/blas/generic/solution_seq_make.c:2033:13: warning:
enumeration value 'TYPE_UNSIGNED_INT' not handled in switch [-Wswitch]
switch (kargs->dtype) {
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/generic/solution_seq_make.c:2319:13: warning:
enumeration value 'TYPE_UNSIGNED_INT' not handled in switch [-Wswitch]
switch (kargs->dtype) {
^
2 warnings generated.
[ 32%] Building C object library/CMakeFiles/clBLAS.dir/blas/generic/problem_iter.c.o
[ 32%] Building C object library/CMakeFiles/clBLAS.dir/blas/generic/kernel_extra.c.o
[ 33%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/gen_init.c.o
[ 33%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/blas_kgen.c.o
[ 34%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/blas_subgroup.c.o
/Users/admin_naths/Sources/clBLAS/src/library/blas/gens/blas_subgroup.c:89:12: warning:
enumeration value 'TYPE_UNSIGNED_INT' not handled in switch [-Wswitch]
switch(dtype){
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/gens/blas_subgroup.c:194:12: warning:
enumeration value 'TYPE_UNSIGNED_INT' not handled in switch [-Wswitch]
switch(dtype){
^
2 warnings generated.
[ 34%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/gen_helper.c.o
[ 35%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/tilemul.c.o
[ 35%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/fetch.c.o
[ 36%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/tile.c.o
[ 36%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/tile_iter.c.o
[ 36%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/decomposition.c.o
[ 37%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/gemv.c.o
[ 37%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/symv.c.o
[ 38%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/gemm.c.o
[ 38%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/trmm.c.o
/Users/admin_naths/Sources/clBLAS/src/library/blas/gens/trmm.c:1248:5: warning:
cast to 'void *' from smaller integer type 'unsigned int'
-Wint-to-void-pointer-castsubdimsNum;
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/gens/trmm.c:1248:5: warning:
expression result unused; should this cast be to 'void'? -Wunused-valuesubdimsNum;
^ ~
2 warnings generated.
[ 39%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/trsm.c.o
[ 39%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/syrxk.c.o
[ 40%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/trxm_common.c.o
[ 40%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/trsm_kgen.c.o
[ 40%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/xxmv_common.c.o
[ 41%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/legacy/blas_kgen_legacy.c.o
[ 41%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/legacy/gen_helper_legacy.c.o
[ 42%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/legacy/trxm_common_legacy.c.o
[ 42%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/legacy/trsm_kgen_legacy.c.o
[ 43%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/legacy/blkmul.c.o
[ 43%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/legacy/gemm_lds.c.o
[ 44%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/legacy/gemm_img.c.o
[ 44%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/legacy/trmm_lds.c.o
[ 44%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/legacy/trmm_img.c.o
[ 45%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/legacy/trsm_lds.c.o
[ 45%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/legacy/trsm_img.c.o
[ 46%] Building C object library/CMakeFiles/clBLAS.dir/blas/gens/legacy/trsm_cached_lds.c.o
[ 46%] Building CXX object library/CMakeFiles/clBLAS.dir/blas/gens/trmv_reg.cpp.o
In file included from /Users/admin_naths/Sources/clBLAS/src/library/blas/gens/trmv_reg.cpp:32:
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:20:37: warning: missing
terminating '"' character [-Winvalid-pp-token]
static const char *trmv_CU_kernel = "
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:20:37: error: expected
expression
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:114:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:116:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:208:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:210:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:302:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:304:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:395:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:397:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:489:2: warning: missing
terminating '"' character [-Winvalid-pp-token]
}";
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:495:37: warning: missing
terminating '"' character [-Winvalid-pp-token]
static const char *trmv_CL_kernel = "
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:590:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:592:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:683:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:685:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:776:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:778:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:869:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:871:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:956:2: warning: missing
terminating '"' character [-Winvalid-pp-token]
}";
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:968:38: warning: missing
terminating '"' character [-Winvalid-pp-token]
static const char *trmv_CLT_kernel = "
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1062:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1064:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1154:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1156:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1248:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1250:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1289:2: warning: missing
terminating '"' character [-Winvalid-pp-token]
}";
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1297:38: warning: missing
terminating '"' character [-Winvalid-pp-token]
static const char *trmv_CUT_kernel = "
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1390:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1392:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1485:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1487:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1580:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1582:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1674:3: warning: missing
terminating '"' character [-Winvalid-pp-token]
\n
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1676:1: warning: missing
terminating '"' character [-Winvalid-pp-token]
"
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1728:2: warning: missing
terminating '"' character [-Winvalid-pp-token]
}";
^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:1299:2: error: unterminated
conditional directive

ifdef DOUBLE_PRECISION

^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:970:2: error: unterminated
conditional directive

ifdef DOUBLE_PRECISION

^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:497:2: error: unterminated
conditional directive

ifdef DOUBLE_PRECISION

^
/Users/admin_naths/Binaries/clBlas/include/trmv.clT:22:2: error: unterminated
conditional directive

ifdef DOUBLE_PRECISION

^
/Users/admin_naths/Sources/clBLAS/src/library/blas/gens/trmv_reg.cpp:136:59: error:
member access into incomplete type 'const SolutionStep'
const CLBlasKargs kargs = (const CLBlasKargs *)(&step->args);
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/include/clblas-internal.h:39:8: note:
forward declaration of 'SolutionStep'
struct SolutionStep;
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/gens/trmv_reg.cpp:144:14: error:
member access into incomplete type 'const SolutionStep'
if( (step->funcID == CLBLAS_HEMV) || (kargs->pigFuncID == CLBLAS_HPM...
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/include/clblas-internal.h:39:8: note:
forward declaration of 'SolutionStep'
struct SolutionStep;
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/gens/trmv_reg.cpp:330:38: error:
use of undeclared identifier 'trmv_CL_kernel'
(strcpy(tempTemplate, (char)trmv_CL_ker...
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/gens/trmv_reg.cpp:346:38: error:
use of undeclared identifier 'trmv_CLT_kernel'
(strcpy(tempTemplate, (char_)trmv_CLT_ke...
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/gens/trmv_reg.cpp:346:87: error:
use of undeclared identifier 'trmv_CUT_kernel'
...(char_)trmv_CLT_kernel)) : (strcpy(tempTemplate, (char_)trmv_CUT_kernel));
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/gens/trmv_reg.cpp:410:23: error:
member access into incomplete type 'SolutionStep'
SolutionStep *step = container_of(blasArgs, args, SolutionStep);
^
/Users/admin_naths/Sources/clBLAS/src/include/list.h:41:34: note: expanded from
macro 'container_of'
(type_)((prt_size_t)(node) - offset_of(field, type))
^
/Users/admin_naths/Sources/clBLAS/src/include/list.h:38:29: note: expanded from
macro 'offset_of'
(prt_size_t)(&((type_)0)->field)
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/include/clblas-internal.h:39:8: note:
forward declaration of 'SolutionStep'
struct SolutionStep;
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/gens/trmv_reg.cpp:418:14: error:
member access into incomplete type 'SolutionStep'
if( (step->funcID == CLBLAS_HEMV) || (blasArgs->pigFuncID == CLBLAS_...
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/include/clblas-internal.h:39:8: note:
forward declaration of 'SolutionStep'
struct SolutionStep;
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/gens/trmv_reg.cpp:445:11: error:
member access into incomplete type 'SolutionStep'
if( (step->funcID == CLBLAS_HEMV) || (blasArgs->pigFuncID == CLB...
^
/Users/admin_naths/Sources/clBLAS/src/library/blas/include/clblas-internal.h:39:8: note:
forward declaration of 'SolutionStep'
struct SolutionStep;
^
38 warnings and 13 errors generated.
make[2]: *_* [library/CMakeFiles/clBLAS.dir/blas/gens/trmv_reg.cpp.o] Error 1
make[1]: *** [library/CMakeFiles/clBLAS.dir/all] Error 2
make: *** [all] Error 2

numerical problems with clMagma and clBLAS

I tested clmagma1.0.0 and clBLAS with AMD FirePro W8000, and got some weird testing results below.

Both MKL 11.0.2.146 and Acml5.3.0 are used to obtain the same problems.

We are wondering whether it is a driver issue. Have you seen this kind of errors before? Any suggestion? Thanks.

./testing_sgemm

M N K clAmdBlas GFLop/s (sec) CPU GFlop/s (sec) error

1024 1024 1024 335.91 ( 0.01) 50.63 ( 0.04) 1.831055e-04
1280 1280 1280 614.83 ( 0.01) 103.87 ( 0.04) 2.746582e-04
1600 1600 1600 605.87 ( 0.01) 138.04 ( 0.06) 3.433228e-04
2000 2000 2000 1198.67 ( 0.01) 136.84 ( 0.12) 4.577637e-04
2500 2500 2500 1173.58 ( 0.03) 37.49 ( 0.83) 8.087158e-04
3125 3125 3125 1250.90 ( 0.05) 151.47 ( 0.40) 1.068115e-03
3906 3906 3906 1226.68 ( 0.10) 166.15 ( 0.72) 1.312256e-03
4882 4882 4882 1250.68 ( 0.19) 194.26 ( 1.20) nan
6102 6102 6102 1260.94 ( 0.36) 217.01 ( 2.09) nan

./testing_dgemm

M N K clAmdBlas GFLop/s (sec) CPU GFlop/s (sec) error

1024 1024 1024 210.07 ( 0.01) 39.60 ( 0.05) 3.126388e-13
1280 1280 1280 334.66 ( 0.01) 55.77 ( 0.08) 5.115908e-13
1600 1600 1600 501.54 ( 0.02) 36.27 ( 0.23) 6.536993e-13
2000 2000 2000 505.43 ( 0.03) 71.73 ( 0.22) 8.242296e-13
2500 2500 2500 482.68 ( 0.06) 83.53 ( 0.37) nan
3125 3125 3125 431.77 ( 0.14) 94.39 ( 0.65) nan
3906 3906 3906 424.89 ( 0.28) 100.95 ( 1.18) nan
4882 4882 4882 419.76 ( 0.55) 112.94 ( 2.06) nan
6102 6102 6102 422.29 ( 1.08) 128.65 ( 3.53) 5.627498e-12

./testing_sgesv_gpu

N NRHS GPU GFlop/s (sec) ||B - AX|| / ||A||*||X||

1024 100 31.37 ( 0.03) 7.35e-08
2048 100 77.00 ( 0.09) 1.35e-07
3072 100 130.27 ( 0.16) 2.02e-07
4032 100 214.66 ( 0.22) 2.27e-07
5184 100 270.23 ( 0.36) nan
magma_sposv had error 257.
6384 100 308.80 ( 0.59) nan
magma_sposv had error 195.
8385 100 434.03 ( 0.94) nan

According to clinfo, Max memory allocation is 1073741824. Global memory size is 3221225472.

From dmesg:
[fglrx] module loaded - fglrx 12.10.5 [May 30 2013]

Please tag the new upstream release

I noticed that the version number in master was bumped to 2.6 after a recent merge from develop. I am guessing this means a new upstream release has been made. Could you guys please push a tag for it ?

Many thanks.

lience issue

I was trying to find out who owned this repo, looked in the license and found:

Copyright [yyyy] [name of copyright owner]

you should fill in the blanks :)

cmake of clBLAS doesn't complete?

I apologize if this is extremely basic but I cannot seem to get the clBLAS to compile using the instructions from arrayfire for a Linux system.

git clone https://github.com/arrayfire/clBLAS.git
cd clBLAS
mkdir build && cd build
cmake ../src -DCMAKE_BUILD_TYPE=Release
make && make install

This results in the following error whether I run as root or not:

 -- Installing: /usr/local/./include/clBLAS.h
 CMake Error at cmake_install.cmake:36 (FILE):
 file INSTALL cannot copy file "/home/cdeterman/clBLAS/src/clBLAS.h" to
 "/usr/local/./include/clBLAS.h".

Have I missed something here? I am hesitant to just start copying files to the other location.

clLAPACK ?

Hi all,

Just a quick question: are there plans for an openCL implementation of LAPACK (or CLAPACK)? Or is that unnecessary? Seems to me that LAPACK is just calling on BLAS, so if you point it at clBLAS it will get all of the HSA goodness you could want.

I'm not an expert on this stuff so I thought I'd ask the experts.

What do you think?

clmathlibraries / clblas Goto Github PK

clblas's Introduction

Build Status

clBLAS

clBLAS update notes 01/2017

clBLAS library user documentation

Google Groups

clBLAS Wiki

Contributing code

License

Example

Build dependencies

Library for Windows

Library for Linux

Library for Mac OSX

Test infrastructure

Performance infrastructure

clblas's People

Contributors

Stargazers

Watchers

Forkers

clblas's Issues

ifdef DOUBLE_PRECISION

ifdef DOUBLE_PRECISION

ifdef DOUBLE_PRECISION

ifdef DOUBLE_PRECISION

M N K clAmdBlas GFLop/s (sec) CPU GFlop/s (sec) error

M N K clAmdBlas GFLop/s (sec) CPU GFlop/s (sec) error

N NRHS GPU GFlop/s (sec) ||B - AX|| / ||A||*||X||

Recommend Projects

Recommend Topics

Recommend Org