I just got clBLAS compiling on OS X (see <a class="issue-link js-issue-link" data-erro

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Documentation for the callbacks on the CBLAS-like API about clblas HOT 17 CLOSED

clmathlibraries commented on June 29, 2024

Documentation for the callbacks on the CBLAS-like API

from clblas.

Comments (17)

kknox commented on June 29, 2024

Hi @fommil,
As OpenCL is meant to operate on devices that operate with disparate memory addresses, OpenCL treats memory in a 'black box' fashion. Allocating opencl memory does not in fact return a pointer, it returns a handle that is meant to be used in further opencl operations. These handles are cl_mem objects, and our API's take the cl_mem objects as parameters. As you can not apply pointer arithmetic to a handle, we add an extra offset parameter for every cl_mem parameter, to allow a user to specify a starting offset into the buffer.
The extra parameters appended to the BLAS API are the openCL objects that control the execution of the OpenCL kernels. If you set them to NULL, the API will not do anything and I'm sure will appear to run very fast.
We provide library documentation for our API, but it already assumes that you are familiar and comfortable with the OpenCL language. If you would like to start learning OpenCL, the OpenCL specification is not a terrible read, and then AMD has additional resources for developers.

from clblas.

fommil commented on June 29, 2024

@kknox a little example of how to call the BLAS functions wouldn't go amiss. the equivalent cuBLAS functions are much more closely aligned with the original BLAS API in comparison... although it is rather frustrating that neither library actually implements the BLAS that decades of middleware has conformed to. Hence my wrapper layer.

from clblas.

fommil commented on June 29, 2024

you don't have an explicit dgemm example, but the C examples you pointed me at were useful.

it looks to me like you're still some way from users being able to call you as BLAS. Ill attempt to wrap DDOT and DGEMM over the coming months, but I'll pause at that point to see where to go.

from clblas.

kknox commented on June 29, 2024

Hi @fommil
If you are looking for code examples for how to call the BLAS functions, take a look at the samples directory of the repository, we have simple examples of calling almost every routine that we support in single precision. You should be able to compile and step through a sample in a debugger and see what is needed to initialize OpenCL and call into a BLAS routine.

We recognize that the clBLAS API is slightly different than as defined with traditional NetLib BLAS; we did not break the BLAS API lightly or arbitrarily. The concerns for designing for heterogeneous platforms like modern GPU platforms necessitate different decisions than were made 30 years ago for homogeneous platforms like traditional CPU servers. There is a heavy cost in transferring data to and from the heterogeneous device (i.e. the GPU over the PCI express bus) and if data is managed carelessly, the performance will actually be worse than not having offloaded the computation in the first place.

Our API, built on top of OpenCL, allows our clients to manage their own data. They control when and where data is transferred to and from the heterogeneous device. This is the reason that we added the extra OpenCL parameters to the BLAS API's; the user manages the OpenCL state and passes it into the library which ultimately generates OpenCL kernels and enqueues them into the command queue. With this API, the client controls when data is transferred to the device, executes a series of BLAS calls (or user defined kernels) while the data remains on the device and then transfers data back to the Host only when they are done processing. Otherwise, you get in a situation where data is transferred in a round-trip fashion to the device and back on every BLAS call, and then find yourself in the uncomfortable situation where you are better off not having offloaded to the device in the first place 😃

from clblas.

fommil commented on June 29, 2024

@kknox can you please take a look at this? It's a translation of your sgemm sample.

https://github.com/fommil/netlib-java/blob/master/perf/src/main/c/clwrapper.c

When I run my test file

https://github.com/fommil/netlib-java/blob/master/perf/src/main/c/dgemmtest.c

(compilation instructions at the top)

I see this :-(

found 1 OpenCL platforms
found 1 OpenCL devices
created context
created command queue
setup clblas
created buffers
enqueud buffers
Segmentation fault: 11

I'm on OS X. Note that I changed the CL_DEVICE_TYPE_GPU as I was getting 0 devices with it. I have another machine that I can try this out on... perhaps my laptop doesn't have GPU OpenCL (first I've heard of it! It's an Intel HD Graphics 3000).

from clblas.

fommil commented on June 29, 2024

for completeness, I thought I would note that my Macbook Air doesn't seem to support OpenCL on the GPU :-( http://forums.macrumors.com/showthread.php?t=1119312

from clblas.

simonmcs commented on June 29, 2024

Apple will provide OpenCL 1.2 support on the integrated Iris graphics of Haswell-based MBA's in Mavericks when that's released soon. Sounds like you'll be justified in treating yourself to a new laptop! ;-)

http://forums.macrumors.com/showthread.php?t=1620203

http://docs.huihoo.com/apple/wwdc/2013/session_508__working_with_opencl.pdf

Simon

On 11 Sep 2013, at 15:49, Sam Halliday [email protected] wrote:

for completeness, I thought I would note that my Macbook Air doesn't seem to support OpenCL on the GPU :-( http://forums.macrumors.com/showthread.php?t=1119312

—
Reply to this email directly or view it on GitHub.

Head of Microelectronics Group and University of Bristol Business Fellow
High Performance Computing and Architectures, Department of Computer Science
University of Bristol, Merchant Venturers Building, Woodland Road, Clifton, Bristol, BS8 1UB, UK
Phone: +44 (0)117 331 5324, Twitter: simonmcs, Web: http://www.cs.bris.ac.uk/~simonm/
Microelectronics Group webpage: http://www.cs.bris.ac.uk/Research/Micro/

from clblas.

fommil commented on June 29, 2024

@simonmcs heh, nah... I've got a relatively new iMac that I'll use for GPU performance tests. And clBLAS needs to work without segfaults before I can rationalise a frivolous upgrade :-P

from clblas.

pavanky commented on June 29, 2024

@fommil

I don't understand what you are trying to do here

size_t off  = 1;
size_t offA = K + 1;   /* K + off */
size_t offB = N + 1;   /* N + off */
size_t offC = N + 1;   /* N + off */

To use clBLAS all you need to do is make offsets 0 and pass the other parameters as is. You are making it more complicated than it is worth. The segmentation fault is likely occurring because you are using yoru CPU as your OpenCL device and the wrapper code you have written is trying to access elements that are out of bounds.

from clblas.

fommil commented on June 29, 2024

@pavanky I am copying the code from the example. I don't understand why the offsets are +1! I thought it was some device specific nonsenses.

from clblas.

pavanky commented on June 29, 2024

The example has the following line.
/* Call clblas extended function. Perform gemm for the lower right sub-matrices */

Since you want matrix multiplication on the entire matrix, try setting offsets to 0 for your case. Use M, N, K, LDA, LDB, LDC directly.

from clblas.

fommil commented on June 29, 2024

oh, I missed that bit :-D

now, why would a gemm example not do gemm?

from clblas.

pavanky commented on June 29, 2024

@fommil it is doing gemm, but only on the bottom right corner of the buffers.

The equivalent in standard gemm would've used something like A + offA, B + offB and C + offC.

This kind of an API is necessary for OpenCL because such offsets to pointers are not possible from the host side. But such offsets are required for some libraries that are downstream from BLAS (such as various LAPACK implementations).

from clblas.

fommil commented on June 29, 2024

@pavanky I'm still getting the segfault with no offsets. Actually, this happened last night too and that's why I added all the offsets (I thought it was some hocus pocus and didn't see the note about sub matrices).

from clblas.

fommil commented on June 29, 2024

I get the segfaults when on a GPU device as well. I won't be able to test this again until next weekend.

from clblas.

fommil commented on June 29, 2024

@kknox @pavanky I'm still unable to get results with clBLAS but I've been able to run some DGEMM tests with CUDA to confirm your comments about the memory overhead. Indeed, it is pretty spectacular. Turquoise (light blue below the red lines, keep pace with the green ATLAS) is CUDA + overhead, dark blue is CUDA just the dgemm call (and I checked that it is computing the result correctly!)

from clblas.

kknox commented on June 29, 2024

Closing old clBLAS issues for the new year

I believe that this question has been answered, in part here and in part with the comments in #12.

from clblas.

Documentation for the callbacks on the CBLAS-like API about clblas HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent