hsa-libraries / bolt Goto Github PK

Bolt is a C++ template library optimized for GPUs. Bolt provides high-performance library implementations for common algorithms such as scan, reduce, transform, and sort.

License: Other

Shell 0.63% Python 0.87% C++ 95.34% Objective-C 0.26% C 2.58% Cool 0.32%

bolt's Introduction

Bolt is a C++ template library optimized for heterogeneous computing. Bolt is designed to provide high-performance library implementations for common algorithms such as scan, reduce, transform, and sort. The Bolt interface was modeled on the C++ Standard Template Library (STL). Developers familiar with the STL will recognize many of the Bolt APIs and customization techniques.

The primary goal of Bolt is to make it easier for developers to utilize the inherent performance and power efficiency benefits of heterogeneous computing. It has interfaces that are easy to use, and has comprehensive documentation for the library routines, memory management, control interfaces, and host/device code sharing.

Compared to writing the equivalent functionality in OpenCL™, you’ll find that Bolt requires significantly fewer lines-of-code and less developer effort. Bolt is designed to provide a standard way to develop an application that can execute on either a regular CPU, or use any available OpenCL™ capable accelerated compute unit, with a single code path.

Here's a link to our BOLT wiki page.

Prerequisites

Windows

Visual Studio 2010 onwards (VS2012 for C++ AMP)
Tested with 32/64 bit Windows® 7/8 and Windows® Blue
CMake 2.8.10
TBB (For Multicore CPU path only) (4.1 Update 1 or Above) . See Building Bolt with TBB.
APP SDK 2.8 or onwards.

Note: If the user has installed both Visual Studio 2012 and Visual Studio 2010, the latter should be updated to SP1.

Linux

GCC 4.6.3 and above
Tested with OpenSuse 12.3, RHEL 6.4 64bit, RHEL 6.3 32bit, Ubuntu 13.4
CMake 2.8.10
TBB (For Multicore CPU path only) (4.1 Update 1 or Above) . See Building Bolt with TBB.
APP SDK 2.8 or onwards.

Note: Bolt pre-built binaries for Linux are build with GCC 4.7.3, same version should be used for Application building else user has to build Bolt from source with GCC 4.6.3 or higher.

Catalyst™ package

The latest Catalyst driver contains the most recent OpenCL runtime. Recommended Catalyst package is latest 13.11 Beta Driver.

13.4 and higher is supported.

Note: 13.9 in not supported.

Supported Devices

AMD APU Family with AMD Radeon™ HD Graphics

A-Series
C-Series
E-Series
E2-Series
G-Series
R-Series

AMD Radeon™ HD Graphics

7900 Series (7990, 7970, 7950)
7800 Series (7870, 7850)
7700 Series (7770, 7750)

AMD Radeon™ HD Graphics

6900 Series (6990, 6970, 6950)
6800 Series (6870, 6850)
6700 Series (6790 , 6770, 6750)
6600 Series (6670)
6500 Series (6570)
6400 Series (6450)
6xxxM Series

AMD Radeon™ Rx 2xx Graphics

R9 2xx Series
R8 2xx Series
R7 2xx Series

AMD FirePro™ Professional Graphics

W9100

Compiled binary windows packages (zip packages) for Bolt may be downloaded from the Bolt landing page hosted on AMD's Developer Central website.

Examples

The simple example below shows how to use Bolt to sort a random array of 8192 integers.

#include <bolt/cl/sort.h>
#include <vector>
#include <algorithm>

int main ()
{
    // generate random data (on host)
    size_t length = 8192
    std::vector<int> a (length);
    std::generate ( a.begin (), a.end(), rand );

    // sort, run on best device in the platform
    bolt::cl::sort(a.begin(), a.end());
    return 0;
}

The code will be familiar to programmers who have used the C++ Standard Template Library; the difference is the include file (bolt/cl/sort.h) and the bolt::cl namespace before the sort call. Bolt developers do not need to learn a new device-specific programming model to leverage the power and performance advantages of heterogeneous computing.

#include <bolt/cl/device_vector.h>
#include <bolt/cl/scan.h>
#include <vector>
#include <numeric>

int main()
{
  size_t length = 1024;
  // Create device_vector and initialize it to 1
  bolt::cl::device_vector< int > boltInput( length, 1 );

  // Calculate the inclusive_scan of the device_vector
  bolt::cl::inclusive_scan(boltInput.begin(),boltInput.end(),boltInput.begin( ) );

  // Create an std vector and initialize it to 1
  std::vector<int> stdInput( length, 1 );
 
  // Calculate the inclusive_scan of the std vector
  bolt::cl::inclusive_scan(stdInput.begin( ),stdInput.end( ),stdInput.begin( ) );
  return 0;
}

This example shows how Bolt simplifies management of heterogeneous memory. The creation and destruction of device resident memory is abstracted inside of the bolt::cl::device_vector <> class, which provides an interface familiar to nearly all C++ programmers. All of Bolt’s provided algorithms can take either the normal std::vector or the bolt::cl::device_vector<> class, which allows the user to control when and where memory is transferred between host and device to optimize performance.

Copyright and Licensing information

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

bolt's People

Contributors

Stargazers

Watchers

bolt's Issues

ConstantIteratorTest

The following code from ConstantIteratorTest should use bolt::cl::device_vector, but uses std::vector insted

TYPED_TEST_P( CountingIterator, DeviceTransformVector )

// initialize the data vector to be sequential numbers
std::vector< TypeParam > devVec( 3 );
bolt::cl::transform( devVec.begin( ), devVec.end( ), bolt::cl::make_counting_iterator( 42 ), devVec.begin( ),
bolt::cl::plus< TypeParam >( ) );
EXPECT_EQ( 42, devVec[ 0 ] );
EXPECT_EQ( 43, devVec[ 1 ] );
EXPECT_EQ( 44, devVec[ 2 ] );
}

why do i have to run this in windows?

why not linux?

Problem with gtest download URL in cmake file

The download URL for gtest in superbuild/ExternalGtest.cmake is not working. It begins with "https://...", after deleting the "s" character, it works fine.

bolt 1.2, typo in transform_reduce.inl

Bolt 1.2, file include/bolt/cl/detail/transform_reduce.inl, lines 446-447:

dblog->CodePathTaken(BOLTLOG::BOLT_TRANSFORMREDUCE,BOLTLOG::BOLT_MULTICORE_CPU,"

::Transform_Reduce::MULTICORE_CPU");

Clearly the string markers (") are located in two lines, which makes the compiler issue unnecessary warnings

Z Koza

Bolt1.2: bolt::cl::min_element and bolt::cl::max_element having issues while using device_vector with iterator.

This issue can be observed only with the bolt i,e, when calling bolt::min_element and bolt::max_element on device_vector. For std::min_element and std::max_element it is working fine.

CODE:

//code for BOLT_MIN_ELEMENT:

TEST(sanity_min_element_2bolt_cl_device_vect_loop, ints_loop){
int size = 10;
bolt::cl::device_vector intStdVect (size);
bolt::cl::device_vector intBoltVect (size);

for (int i = 0 ; i < size; i++){
    intBoltVect[i] = (int)std::rand()%65535 ;
     intStdVect[i] = intBoltVect[i];
}

bolt::cl::device_vector<int>::iterator std_min_ele; 
for (int i = 0 ; i < 1000; i++){
            std_min_ele = std::min_element (intStdVect.begin(), intStdVect.end());
}

bolt::cl::device_vector<int>::iterator bolt_min_ele;  
for (int i = 0 ; i < 1000; i++){
    bolt_min_ele =  bolt::cl::min_element(intBoltVect.begin(), intBoltVect.end());
}
EXPECT_EQ(*std_min_ele, *bolt_min_ele)<<std::endl;

}

// code for BOLT_MAX_ELEMENT:

TEST(sanity_max_element_2bolt_cl_device_vect_loop, ints_loop){
int size = 10;
bolt::cl::device_vector intStdVect (size);
bolt::cl::device_vector intBoltVect (size);

for (int i = 0 ; i < size; i++){
    intBoltVect[i] = (int)std::rand()%65535 ;
     intStdVect[i] = intBoltVect[i];
}

bolt::cl::device_vector<int>::iterator std_min_ele; 
for (int i = 0 ; i < 1000; i++){
            std_max_ele = std::max_element (intStdVect.begin(), intStdVect.end());
}

bolt::cl::device_vector<int>::iterator bolt_min_ele;  
for (int i = 0 ; i < 1000; i++){
    bolt_max_ele =  bolt::cl::max_element(intBoltVect.begin(), intBoltVect.end());
}
EXPECT_EQ(*std_max_ele, *bolt_max_ele)<<std::endl;

}

bolt::cl::sort_by_key does not work for non power of 2 buffer sizes

please remove boost

boost dependency is not needed and creates some compiler issues with different versions of boost out there.

You should use C++11 directly. And please update CMakefiles to support C++11 flags for gcc.

Missing Linux installation instructions

After downloading the binary tarball for Linux and unpacking it, I find a directory with some stuff in it, but no instructions on installation. For example, should one copy include, lib and lib64 to /usr/local? Or is it intended that Bolt applications should set up -I and -L compiler flags to wherever one unpacked Bolt? Is any special care needed if one already has Boost (and no doubt a different version of Boost) installed to avoid version conflicts?

Bolt1.2: Some APP SDK samples in 2.9 are failing to build with respect to bolt 1.2 package.

When we try to build bolt samples of AMD APP SDK by linking to the bolt 1.2 package, two samples( BoxFilterSAT and Stockdataanalysis,) are failed to build by throwing compilation error.

Bolt1.2: Windows: Debug build failure in BoltAMPIntro sample for VS 2012 compiler.

This is the issue with VS2012 compiler. The solution is to simplify the code which compiler doesn’t list where to simplify. This issue is fixed with VS2013.

BOOST_ROOT is not required on UNIX systems

Hard coding the path is creating issues on Ubuntu. I think this will cause issues on OSX as well.

CMake Error at gTestDebug-stamp/download-gTestDebug.cmake:27

I need bolt library for 32 bit Windows. But when i am getting above error.
Kindly help me in fixing this.

throw opencl kernel compile issue when run test opencl case for example clBolt.Test.StableSort

Hi ,
clone the bolt codes, compile on rocm1,9 and opencl-runtime, build the project with cmake commands as "cmake -DBOOST_LIBRARYDIR=/home/qcxie/software/boost_1_65_1/stage/lib -DBOOST_ROOT=/home/qcxie/software/boost_1_65_1 -DGTEST_ROOT=/home/qcxie/software/boost_1_65_1 -DCMAKE_BUILD_TYPE=Debug -DBolt_BUILD64=1 -DCMAKE_CXX_FLAGS="-std =c++14 -fpermissive -I /opt/rocm/opencl/include -L/opt/rocm/opencl/lib/x86_64 -lOpenCL" ../" successfully,
but, it throws cl kernels error in running test case, for example clBolt.Test.StableSort
" error: unknown type name 'namespace' namespace bolt { namespace cl { "
how i should do to configure or set buildprogram optimons to fix this issue? thanks very much.

Missing includes of <boost/thread/lock_guard.hpp>

When I try to build Bolt from source, I get numerous errors about lock_guard because <boost/thread/lock_guard.hpp> is not included anywhere. This could be because I'm using my system's version of Boost (1.53) rather than using the superbuild.

Bolt_1.2: bolt:cl::sort is hanging for higher odd buffer sizes with 1000 iterations.

If we run bolt::cl::sort for 1000 iterations by having higher odd buffer sizes like 2 power 23, 2 power 25,.. for double and float data type it is hanging. No issues with non powers of 2 and even buffer sizes like 2 power 24, 2 power 26..

Bolt1.2: std::partial sum is having compilation issues when we used with UDD by calling transform_iterator.

CODE:

int get_global_id(int i);

int global_id;

BOLT_FUNCTOR(UDD_trans,
struct UDD_trans
{
int i ;
float f ;

UDD_trans ()  
    {
    }; 

UDD_trans (int val1)  
    {
        i =  val1 ;

    }; 
bool operator == (const UDD_trans& other) const {
    return ((i+f) == (other.i+other.f));
}

UDD_trans operator() ()  const 
    { 
        UDD_trans temp ;
        temp.i =  get_global_id(0);
        return temp ;
        //return get_global_id(0); 
    }

};
);

int get_global_id(int i)
{
return global_id++;
}

BOLT_FUNCTOR(add_UDD,
struct add_UDD
{
int operator() (const UDD_trans x) const { return x.i + 3; }
typedef int result_type;
};
);

BOLT_TEMPLATE_REGISTER_NEW_ITERATOR( bolt::cl::device_vector, int, UDD_trans);
BOLT_TEMPLATE_REGISTER_NEW_TRANSFORM_ITERATOR( bolt::cl::transform_iterator, add_UDD, UDD_trans);

int main()
{

int length =  5;

std::vector< UDD_trans > svInVec1( length );

std::vector< int > stlOut(length);

add_UDD add1;
UDD_trans gen_udd(0) ;


// ADD
bolt::cl::transform_iterator< add_UDD, std::vector< UDD_trans >::const_iterator>        sv_trf_begin1 (svInVec1.begin(), add1) ;
bolt::cl::transform_iterator< add_UDD, std::vector< UDD_trans >::const_iterator>            sv_trf_end1   (svInVec1.end(),   add1) ;

    global_id = 0;
std::generate(svInVec1.begin(), svInVec1.end(), gen_udd);

        bolt::cl::plus<int> pls;

        bolt::cl::control ctrl = bolt::cl::control::getDefault();

    global_id = 0;

 //STD_PARTIAL_SUM
std::partial_sum(sv_trf_begin1, sv_trf_end1, stlOut.begin(), pls);

for (int i=0; i<length;  i++ )
{
    std::cout << "Val = "  << stlOut[i] <<  "\n" ;

}

return 0;
}

Bolt1.2: Opensuse12.3:pthread_key_delete error while building bolt from github repository with gcc 4.7.2 .

This issue is with respect to only for Opensuse 12.3 with gcc 4.7.2 when building bolt from github directly. The same issue working fine with ubuntu 13.10 with gcc 4.8.1.

Bolt doesn't support OpenCL CPU device code path

Style checker to use for Bolt

Does Bolt have a recommended style checker to use? I noticed that even though Bolt has coding style guideline, not all of it is used/enforced. In particular, use of tab character and indentation seem wrong and seems to be different depending who checks in the code.

The guideline states "Use only spaces, and indent 2 spaces at a time", it seems most of the code is indented with 4 spaces and uses tab characters as well as spaces.

CountingIterator with bolt::cl::copy

Hello,

I'm testing new functionality of Bolt 1.3 library on GPU.

I've followed example from follwong page: http://developer.amd.com/community/blog/2013/04/26/details-of-the-bolt-beta/

I found iterator functionality very attractive but the example from the mentioned webpage seems not to work correctly, at least under linux OS:

#include "bolt/cl/device_vector.h"
#include "bolt/cl/iterator/counting_iterator.h"
#include "bolt/cl/copy.h"

int main( int argc, char* argv[] )
{
    bolt::cl::device_vector< int > devV( 100 );
    bolt::cl::copy( bolt::cl::make_counting_iterator< int >( 10 ),
                    bolt::cl::make_counting_iterator< int >( 10 + devV.size( ) ),
                    devV.begin( ) );
}

The input vector is not changed at all. I've executed it using bolt::cl::transform instead of bolt::cl::copy and it is working correctly.

bolt::cl::count bolt::cl::count_if does not work with user defined data types

scatter_if in bolt::amp not possible?

Hi,
I'm studying Bolt and wanted to implement an example program that needs scatter_if operation (http://thrust.github.io/doc/group__scattering.html#ga1079bc05bcb3d4b5080f1e07444fee37). I started to port thrust scatter_if code, which uses permutation_iterator but came across this (https://groups.google.com/forum/#!topic/thrust-users/Xe2JkFy_hUk). The Google Group post claims that permutation_iterator in AMP kernel is not possible because of the restriction AMP put on the use of pointer in kernel (ie. restrict(amp)). Is this true? If so, is it possible to implement permutation_iterator in bolt::cl?

BTW, Bolt forum in AMD Dev Central does not seem to work correctly. It is set to private and my post there does not seem to go through. :(

Problems coaxing CMake to not include 32-bit compiler flags.

Just as the title says, I'm having trouble convincing CMake to let go of the -m32 compiler flag. Everything is working except for that one crucial hangup.

I'm using CMake-3.0.

I've tried to manually edit the CMAKE_CXX_COMPILER and CMAKE_EXE_LINKER variables from the command line with little success.

Any assistance would be most well received.

Cross platform issues when importing into OpenCV

In developing OpenCV(Open-source Computer Vision library)'s OpenCL module I intended to import Bolt library and use its sorting and scan APIs, but by browsing the source code I noticed that AMD-only Static OpenCL C++ templates is heavily used in OpenCL kernels.

For OpenCV, we must ensure that it can run OpenCL on most platforms, not only AMD's. Can I ask if you have any plans to make Bolt available on non-amdappsdk platforms, such as nvidia's and intel's OpenCL SDK?

To fix this issue, I have adapted some of the codes from Bolt and use macros to simulate templates. What I added in this branch is sort_by_key using OpenCL. By the way, I included radix sort with float type types, which is not supported yet for Bolt.

Please see the following links:
Host code
Radix sort kernel file

Thanks!

Bolt_1.3:Windows:- Scan family test cases fails for OpenCL CPU path.

Observed test case failures for bolt cl scan families like exclusive_scan, inclusive_scan, transform_exclusive_scan, transform_inclusive_scan only for the OpenCL CPU path. For other paths like Gpu, Automatic, MultiCoreCpu, and Serial Cpu same test cases are passing.

EXAMPLE:
TEST (sanity_exclusive_scan_stdVectVsDeviceVectWithIters, floatSameValuesSerialRange){

int size = 1000;
TAKE_THIS_CONTROL_PATH
bolt::cl::device_vector< float > boltInput( size, 1.125f );
bolt::cl::device_vector< float >::iterator boltEnd = bolt::cl::exclusive_scan( my_ctl, boltInput.begin( ), boltInput.end( ), boltInput.begin( ), 2.0f);

std::vector< float > stdInput( size, 1.125f);
std::vector< float >::iterator stdEnd  =    bolt::cl::exclusive_scan( stdInput.begin( ), stdInput.end( ), stdInput.begin( ), 2.0f );

EXPECT_FLOAT_EQ((*(boltEnd-1)), (*(stdEnd-1)))<<std::endl;

}

Bolt1.2: Ubuntu 32bit: gcc4.8.1: std::merge and std::stable_sort api compilation failure.

Using std::merge/std::stable_sort function with device_vector as a inputs( running function on CPU when data present on GPU) doesn't compile .

Don't default to 32 bit builds.

Linux systems do not have 32 bit headers and libraries installed by default. Debugging the errors that arise because of this cause a lot of overhead on the users end.

unable to input bolt::cl::transform_iterator into bolt::cl::copy

#include <iostream>
#include <vector>
#include <bolt/cl/iterator/counting_iterator.h>
#include <bolt/cl/iterator/transform_iterator.h>
#include <bolt/cl/functional.h>
#include <bolt/cl/device_vector.h>
#include <bolt/cl/copy.h>

BOLT_FUNCTOR(GetSquare,
struct GetSquare
{
public:
    int operator()(const int& globalId) const
    {
        return globalId*globalId;
    }
};);

int main()
{
    const std::size_t n=10;

    bolt::cl::control ctrl = bolt::cl::control::getDefault();

    bolt::cl::device_vector<int> debug(n);

    auto globalId = bolt::cl::make_counting_iterator(0);

    // This is OK
    // bolt::cl::transform(globalId, globalId + n, debug.begin(), GetSquare());

    // This causes compilation error
    auto square = bolt::cl::make_transform_iterator(globalId, GetSquare());
    bolt::cl::copy(square, square + n, debug.begin());

    for(int i = 0; i < n; i++)
    {
        std::cout << i << ": " << debug[i] << std::endl;
    }

    return 0;
}

This problem seems to be because bolt::cl::transform_iterator have the method getContainer() only with template type

template<typename Container >
Container& getContainer() const
{
    return this->base().getContainer( );
}

but bolt::cl::copy needs ITERATOR::getContainer() without any template type

V_OPENCL( kernels[whichKernel].setArg( 0, first.getContainer().getBuffer()), "Error setArg kernels[ 0 ]" );

This can be solved with C++11

auto getContainer() const -> decltype(base().getContainer())
{
    return this->base().getContainer();
}

I don't have any idea in C++03 (boost::result_of?).

Bolt1.2: Ubuntu 32bit: gcc4.8.1: std::stable_sort api compilation failure

when std::sort function call on device_vector, means running function on CPU when we have data on GPU. These are working fine on Windows 32/64 and linux 64bit. We are seeing this only on 32 bit linux which may be because of compiler restriction.

Differences in develop and master branch.

I noticed that there are some differences between develop branch and master/v1.0 branch. Is that intentional? Most of the differences are minor documentation differences that probably won't affect much. But it seems like something was missed when syncing v1.0/master/develop branches in preparation of v1.0 release.

I'm just getting started with Git way of doing thing. I've only been using SVN. So, it's possible that I'm missing something...

Also, to submit an entry for Bolt Sample Code Contest, I should open a pull request to develop branch, right? Or does it not matter?

Java-bindings for Bolt

Enhancement:
Some JNA or JNI binding for a Bolt wrapper, to get Bolt-like API with GPU and performance access from Java.

Since Aparapi seems to be on hold, experimental and solving other issues that bolt does. And Sumatra may not even be available with Java9 in 2016, it would be very nice to get some Java access to GPU reduce, sort, and similar stuff

Open source OpenCL Static C++ Kernel Language Extension

Hi,
sorry as for sure that's not the best place to post but I remember seeing at APU13 a presentation saying AMD would open source his "OpenCL Static C++ Kernel Language Extension" so all OpenCL implementations could take advantage of it.. I think it was in Bolt presentation so asking here as I also remember it's was said either Q1 2014 or H1 2014 so must be coming soon right?
Also Bolt takes advantage of it right now in OCL path right?