rocm / hip-cpu Goto Github PK

View Code? Open in Web Editor NEW

107.0 19.0 19.0 795 KB

An implementation of HIP that works on CPUs, across OSes.

License: MIT License

CMake 2.08% C++ 94.55% C 3.37%

hip hip-runtime hip-portability hip-kernel-language cuda cuda-programming cpp17 stl-algorithms parallel-algorithms spmd

hip-cpu's Introduction

HIP CPU Runtime

Repository | Issues | Documentation | Code Samples

The HIP CPU Runtime is a header-only library that allows CPUs to execute unmodified HIP code. It is generic and does not assume a particular CPU vendor or architecture. Please note the library is being actively developed, and is known to be incomplet; it might also be incorrekt and there could be a few bad bugs lurking.

Overview and Tutorials

HIP CPU Runtime tutorials per compiler and platform

GCC on Linux
GCC on MacOS
Microsoft C++ Compiler (MSVC) on Windows
Others (TODO)

Quick Links

HIP GPU Runtime FAQ

Questions and Feedback

FAQs
Please consult the FAQ before submitting a question.

Known Issues
If a pre-existing issue encompasses your feedback, please leave a reaction on the issue to up-vote or down-vote it, which will help us in prioritisation.

Provide Feedback
Submit questions, issues or feature requests.

Contribution

Contributions are most welcome and strongly encouraged. Please consult the contributing guide for details.

hip-cpu's People

Contributors

Stargazers

Watchers

Forkers

cjatin bjoo sheidanava mathiasmagnus matevzfa stanleyjacob mfkiwl ueqri michaelsmiller alexdutu yanxing-shi fodinabor cowkeyman garymm dc-dc-dc naraenda onlyoh fwyzard ashishd

hip-cpu's Issues

Use of Cuda terminology instead of HIP terminology in coordinate built-ins

In the file "include/hip/hip_device_launch_parameters.h", the Cuda terminology is used instead of the native HIP terminology for both thread indexes, block indexes, block dims and grid dims definition:

Thread-index: threadIdx.x should be hipThreadIdx_x
Block-index: blockIdx.x should be hipBlockIdx_x
Block-dim: blockDim.x should be hipBlockDim_x
Grid-dim: gridDim.x should be hipGridDim_x

This triggers compilation errors and does definitely compromise the portability of codes written in pure HIP or coming out of the "hipify" tool.
https://github.com/ROCm-Developer-Tools/HIP/blob/main/docs/markdown/hip_kernel_language.md#coordinate-built-ins

Dynamic shared memory failes to compile

Trying to compile the HIP program

#include <hip/hip_runtime.h>

__global__ void my_kernel()
{
    extern __shared__ int dyn_shmem[];
}

int main()
{
    int dyn_shmem_size = 64;
    hipLaunchKernelGGL(my_kernel, 4, 32, dyn_shmem_size, 0);

    hipDeviceSynchronize();

    return 0;
}

using the command

g++ -g -O2 -std=c++17 -I/home/jakub/apps/HIP-CPU/include source.hip.cpp -o program.x -ltbb -lpthread

produces the following compiler error

source.hip.cpp: In function ‘void my_kernel()’:
source.hip.cpp:5:37: error: conflicting specifiers in declaration of ‘dyn_shmem’
    5 |     extern __shared__ int dyn_shmem[];
      |                                     ^

This is probably cause by the extern and static specifiers being combined, as __shared__ is defined as #define __shared__ thread_local static.
I understand, that using the HIP_DYNAMIC_SHARED macro would solve the issue, but as written in the description of the HIP-CPU library, it "allows CPUs to execute unmodified HIP code", and the extern __shared__ int dyn_shmem[]; is now correct HIP code.

Problems to compile

Hello! We are facing problems compiling HIP-CPU and I am wondering if someone could help us out. Maybe we are missing something...

Trying to compile in a machine with Intel CPUs and AMD GPUs. We have HIP 4.4.21401-bedc5f61 (the latest one we could install), TBB 2018, GCC 7.2.0 and CMake 3.16.4.

When we run the following commands:

mkdir build
cd build
cmake .. -DCMAKE_CXX_FLAGS="$(hipconfig --cpp_config)"
make install

This is what we get:

Scanning dependencies of target bit_extract
[ 1%] Building CXX object examples/bit_extract/CMakeFiles/bit_extract.dir/bit_extract.cpp.o
/u/cruzeiro/hip-cpu/examples/bit_extract/bit_extract.cpp: In function ‘void bit_extract_kernel(uint32_t*, const uint32_t*, size_t)’:
/u/cruzeiro/hip-cpu/examples/bit_extract/bit_extract.cpp:24:21: error: ‘blockIdx’ was not declared in this scope
size_t offset = blockIdx.x * blockDim.x + threadIdx.x;
^~~~~~~~
/u/cruzeiro/hip-cpu/examples/bit_extract/bit_extract.cpp:24:21: note: suggested alternative: ‘clock’
size_t offset = blockIdx.x * blockDim.x + threadIdx.x;
^~~~~~~~
clock
/u/cruzeiro/hip-cpu/examples/bit_extract/bit_extract.cpp:24:34: error: ‘blockDim’ was not declared in this scope
size_t offset = blockIdx.x * blockDim.x + threadIdx.x;
^~~~~~~~
/u/cruzeiro/hip-cpu/examples/bit_extract/bit_extract.cpp:24:34: note: suggested alternative: ‘clock’
size_t offset = blockIdx.x * blockDim.x + threadIdx.x;
^~~~~~~~
clock
/u/cruzeiro/hip-cpu/examples/bit_extract/bit_extract.cpp:24:47: error: ‘threadIdx’ was not declared in this scope
size_t offset = blockIdx.x * blockDim.x + threadIdx.x;
^~~~~~~~~
/u/cruzeiro/hip-cpu/examples/bit_extract/bit_extract.cpp:24:47: note: suggested alternative: ‘pthread_t’
size_t offset = blockIdx.x * blockDim.x + threadIdx.x;
^~~~~~~~~
pthread_t
/u/cruzeiro/hip-cpu/examples/bit_extract/bit_extract.cpp:25:34: error: ‘gridDim’ was not declared in this scope
size_t stride = blockDim.x * gridDim.x;
^~~~~~~
/u/cruzeiro/hip-cpu/examples/bit_extract/bit_extract.cpp:29:18: error: ‘__bitextract_u32’ was not declared in this scope
C_d[i] = __bitextract_u32(A_d[i], 8, 4);
^~~~~~~~~~~~~~~~
/u/cruzeiro/hip-cpu/examples/bit_extract/bit_extract.cpp:29:18: note: suggested alternative: ‘__restrict_arr’
C_d[i] = __bitextract_u32(A_d[i], 8, 4);
^~~~~~~~~~~~~~~~
__restrict_arr
/u/cruzeiro/hip-cpu/examples/bit_extract/bit_extract.cpp: In function ‘int main()’:
/u/cruzeiro/hip-cpu/examples/bit_extract/bit_extract.cpp:73:9: error: ‘hipLaunchKernelGGL’ was not declared in this scope
hipLaunchKernelGGL(bit_extract_kernel, dim3(blocks), dim3(threadsPerBlock), 0, 0, C_d, A_d, N);
^~~~~~~~~~~~~~~~~~
/u/cruzeiro/hip-cpu/examples/bit_extract/bit_extract.cpp:73:9: note: suggested alternative: ‘hipLaunchKernel’
hipLaunchKernelGGL(bit_extract_kernel, dim3(blocks), dim3(threadsPerBlock), 0, 0, C_d, A_d, N);
^~~~~~~~~~~~~~~~~~
hipLaunchKernel
make[2]: *** [examples/bit_extract/CMakeFiles/bit_extract.dir/bit_extract.cpp.o] Error 1
make[1]: *** [examples/bit_extract/CMakeFiles/bit_extract.dir/all] Error 2
make: *** [all] Error 2

We also tried using hipcc instead of GCC: CXX=hipcc cmake .. -DCMAKE_CXX_FLAGS="$(hipconfig --cpp_config)" . And this is what we get:

Scanning dependencies of target occupancy
[ 20%] Building CXX object examples/occupancy/CMakeFiles/occupancy.dir/occupancy.cpp.o
/u/cruzeiro/hip-cpu/examples/occupancy/occupancy.cpp:54:12: error: no matching function for call to 'hipOccupancyMaxPotentialBlockSize'
HIP_CHECK(hipOccupancyMaxPotentialBlockSize(&mingridSize, &blockSize, multiply, 0, 0));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/u/cruzeiro/hip-cpu/examples/occupancy/occupancy.cpp:11:9: note: expanded from macro 'HIP_CHECK'
if (status != hipSuccess) {
^~~~~~
/opt/rocm-4.5.0/hip/include/hip/hip_runtime_api.h:4473:35: note: candidate function template not viable: no known conversion from 'uint32_t *' (aka 'unsigned int ') to 'int ' for 1st argument
static hipError_t host inline hipOccupancyMaxPotentialBlockSize(int gridSize, int blockSize,
^
/opt/rocm-4.5.0/hip/include/hip/hip_runtime_api.h:3722:12: note: candidate function not viable: no known conversion from 'uint32_t *' (aka 'unsigned int ') to 'int ' for 1st argument
hipError_t hipOccupancyMaxPotentialBlockSize(int gridSize, int blockSize,
^
/opt/rocm-4.5.0/hip/include/hip/hip_runtime_api.h:4526:19: note: candidate function template not viable: no known conversion from 'uint32_t *' (aka 'unsigned int ') to 'int ' for 1st argument
inline hipError_t hipOccupancyMaxPotentialBlockSize(int gridSize, int blockSize,
^
/u/cruzeiro/hip-cpu/examples/occupancy/occupancy.cpp:54:12: error: no matching function for call to 'hipOccupancyMaxPotentialBlockSize'
HIP_CHECK(hipOccupancyMaxPotentialBlockSize(&mingridSize, &blockSize, multiply, 0, 0));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/u/cruzeiro/hip-cpu/examples/occupancy/occupancy.cpp:12:40: note: expanded from macro 'HIP_CHECK'
std::cout << "Got Status: " << status << " at Line: " << LINE << std::endl;
^~~~~~
/opt/rocm-4.5.0/hip/include/hip/hip_runtime_api.h:4473:35: note: candidate function template not viable: no known conversion from 'uint32_t *' (aka 'unsigned int ') to 'int ' for 1st argument
static hipError_t host inline hipOccupancyMaxPotentialBlockSize(int gridSize, int blockSize,
^
/opt/rocm-4.5.0/hip/include/hip/hip_runtime_api.h:3722:12: note: candidate function not viable: no known conversion from 'uint32_t *' (aka 'unsigned int ') to 'int ' for 1st argument
hipError_t hipOccupancyMaxPotentialBlockSize(int gridSize, int blockSize,
^
/opt/rocm-4.5.0/hip/include/hip/hip_runtime_api.h:4526:19: note: candidate function template not viable: no known conversion from 'uint32_t *' (aka 'unsigned int ') to 'int ' for 1st argument
inline hipError_t hipOccupancyMaxPotentialBlockSize(int gridSize, int blockSize,
^
2 errors generated when compiling for gfx803.
make[2]: *** [examples/occupancy/CMakeFiles/occupancy.dir/occupancy.cpp.o] Error 1
make[1]: *** [examples/occupancy/CMakeFiles/occupancy.dir/all] Error 2

Any ideas of what is wrong?

Thank you very much beforehand!

Vinicius Cruzeiro and Michael Miller

Linking to system TBB considered harmful (primarily on Ubuntu 18.04)

I know that HIP-CPU may not share the support matrix of ROCm, however projects that rely on ROCm often do (and HIP-CPU targets these applications). I'm sure you're aware of the sad situation of the parallel STL and Ubuntu 18.04, the gist of the problem being that while installing GCC 8 and 9 sporting parallel STL is simple enough from the ppa:ubuntu-toolchain-r/test PPA, it requires a version of TBB newer than the one provided by the system.

In this case it is painfully clear that simply adding -ltbb to the linker command line will not fly. There are two issues with this:

It doesn't compile.
Making it compile robustly is like walking on broken glass. Adding a non-system lib as a dependency while potentially said lib being installed on the system is tricky. If not using proper CMake detection and relying on SYSTEM BEFORE flags, one must tread carefully.
- For those living the life and surfing the waves of package managers (🙌) installing TBB comes by for eg. saying vcpkg install tbb, which in turn will install a statically built libtbb.a, which with the current build scripts will sit right next to -ltbb on the linker command line, simultaneously linking to an older system shared library and a cutting edge static version. What could go wrong?

Even if HIP-CPU does not intend on supporting this quirky situation, blindly adding -ltbb to the command line (beside being bad form) opens the door to consuming TBB twice, moreover with different versions. (Projects consuming HIP-CPU may want to depend on TBB on their own right as well.)

I would urge trying to detect libstdc++ instead off being on Linux (after all this situation is the consequence of an implementation detail) using CMake's try_run():

Try compiling a <srcfile>. Returns TRUE or FALSE for success or failure in <compileResultVar>. If the compile succeeded, runs the executable and returns its exit code in <runResultVar>.

or using check_cxx_symbol_exists, whichever is easiest. (I don't have a turnkey solution, but we should.) If STL is libc++, only then do

check_cxx_symbol_exists(SOME_LIBCXX_ONLY_SYMBOLNAME "iostream" USING_LIBCXX)
if(USING_LIBCXX)
  if(NOT TARGET TBB::tbb)
    find_package(TBB REQUIRED)
  endif()
  if(NOT TARGET Threads::Threads)
    find_package(Threads REQUIRED)
  endif()
endif(USING_LIBCXX)
target_link_libraries(${PROJECT_NAME}
  INTERFACE
    $<$<BOOL:${USING_LIBCXX}>:
      ${CMAKE_DL_LIBS}
      Threads::Threads
      TBB::tbb
    >
)

(Note: AFAIK having to find and link to Threads::Threads (and perhaps ${CMAKE_DL_LIBS} too) is yet another quirkiness libc++, but don't call me out on that one. It isn't painful to have on non-Linux systems, it's minimal bloat to configuration time and console output on Windows.)

missing hipLaunchKernel() - the version with void **args

It looks like HIP has provided since 2019:

hipError_t  hipLaunchKernel(const void* function_address,
			    dim3 numBlocks, dim3 dimBlocks, void** args,
				size_t sharedMemBytes, hipStream_t stream);

HIP-CPU only provides hipLaunchKernelGGL with variadic macro expansion of kernel arguments.
I want to use the cleaner void **args API provided by CUDA and HIP.
Please implement this component of the HIP API

inconsistent warpSize in host and "device" code

The values of warpSize read from the hipDeviceProps_t variable and the kernel builtin variable warpSize are different, which is very unexpected.

Consider the following HIP program:

#include <cstdio>
#include <hip/hip_runtime.h>


__global__ void print_warp_size_kernel()
{
    printf("DEVICE warpSize = %d\n", warpSize);
}


int main()
{
    hipDeviceProp_t prop;
    hipGetDeviceProperties(&prop, 0);
    printf("HOST   warpSize = %d\n", prop.warpSize);

    hipLaunchKernelGGL(print_warp_size_kernel, 1, 1, 0, 0);

    hipDeviceSynchronize();

    return 0;
}

compiled using the command

g++ -std=c++17 -I/home/jakub/apps/HIP-CPU/include source.hip.cpp -o program.x -ltbb -pthread

When I run the program, this is the output:

HOST   warpSize = 4
DEVICE warpSize = 64

I don't care whether it is 4 or 64 or whatever power of 2, we should be creating wave-aware code (warpSize independent code) anyway, but this behaviour seems really wrong.

I am using the current master branch of the HIP-CPU library.

Feature request for hipLaunchKernel

Hi there,

I have a library that optimises kernel performance through a brute force search over block size. With ROCM it uses the hipLaunchKernel function to handle kernels with varying number of arguments. I see this function appears to be missing in the CPU library.

Could we please have hipLaunchKernel available in HIP-CPU? Or, at least a way to launch a kernel if we just have the kernel stub function pointer (of type const void*) and an array of kernel arguments (of type void**).

Thankyou!

Kind regards,
Toby

Implement missing function: `hipEventQuery`

Would it be possible to implement hipEventQuery ?

According to the current HIP documentation:

hipError_t hipEventQuery(hipEvent_t event)
Query the status of the specified event. This function will return hipSuccess if all commands in the appropriate stream (specified to hipEventRecord()) have completed.

The documentation goes on to say that

If that work has not completed, or if hipEventRecord() was not called on the event, then hipErrorNotReady is returned.

However this is not what I observe: hipEventQuery() returns hipSuccess for an event that has been just created, before hipEventRecord() was ever called with it.

Wtf is this for?

I don't understand why tf this exists? What's the point, we (non-AMD employees) target HIP-CPU code carefully and it works on GPUs? Or what? What's the benefit for the user?

Types of functions in hip_api.h not consistent with HIP runtime and GCC

For example, in HIP-CPU __ffsll is defined with std::uint64_t:

std::uint32_t __ffsll(std::uint64_t x) noexcept
{
    return hip::detail::bit_scan_forward(x);
}

But in the HIP runtime (CLR) and in GCC it is defined with unsigned long long int:

__device__ static inline unsigned int __ffsll(unsigned long long int input) {
    return ( input == 0 ? -1 : __builtin_ctzll(input) ) + 1;
}

When calling HIP-CPU's implementation of __ffsll with unsigned long long int, it will throw a compiler error:

/workspaces/amd/libraries/rocRAND/library/include/rocrand/rocrand_sobol64.h:178:26: error: call to '__ffsll' is ambiguous
        unsigned int z = __ffsll(~x);
                         ^~~~~~~
/workspaces/amd/libraries/rocRAND/build/deps/hip-cpu/include/hip/hip_api.h:95:15: note: candidate function
std::uint32_t __ffsll(std::int64_t x) noexcept
              ^
/workspaces/amd/libraries/rocRAND/build/deps/hip-cpu/include/hip/hip_api.h:101:15: note: candidate function
std::uint32_t __ffsll(std::uint64_t x) noexcept

In the above code block, x is defined as unsigned long long int x.

This makes HIP-CPU not 100% compatible with otherwise valid HIP (GPU) code.

Should I make a PR for this or does this require some discussion first?

Support for building hip-tests

Would it be possible to add some kind of support for building the official (?) HIP tests at https://github.com/ROCm/hip-tests/ with HIP-CPU ?

Newer TBB variants incompatible with libstdc++ 9 and 10

See: https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-threading-building-blocks-release-notes.html. We need to keep this on the radar and work around it (it's already breaking in MacOS with latest brew-ed TBB).

CMakeLists.txt references moodycamel which has been removed

moodycamel was removed in 5eeb1b3 but this change was not reflected in CMakeLists.txt. We came upon this issue while building rocRAND with HIP CPU:

In hip-cpu-download/hip-cpu-download-prefix/src/hip-cpu-download-stamp/hip-cpu-download-install-err.log

CMake Error at cmake_install.cmake:81 (file):
  file INSTALL cannot find
  "/workspaces/amd/libraries/rocRAND/build/hip-cpu-src/external/moodycamel":
  No such file or directory.

After removing the relevant section in the CMake script it works again.

libco #defines conflict with Qt

These 2 symbols are colliding, is there a "nice" way to put libco in its own namespace? (I guess not since it's C89)

/opt/HIP-CPU/external/libco/settings.h:99: note: macro "section" defined here
   99 |   #define section(name) __attribute__((section(#name)))
      |
In file included from /opt/Qt/5.12.6/gcc_64/include/QtCore/QString:1,
                 from ../modules/Writer.cpp:13:
/opt/Qt/5.12.6/gcc_64/include/QtCore/qstring.h:376:104: error: macro "section" passed 4 arguments, but takes just 1
  376 |     QString section(const QString &in_sep, int start, int end = -1, SectionFlags flags = SectionDefault) const;

Missing API function: hipMallocPitch & hipMemcpy2D

According to HIP API docs, hipMallocPitch and related hipMemcpy2D are fully supported by HIP.

But when I build the HIP codes with HIP-CPU, the errors occur on hipMallocPitch and hipMemcpy2D, which are not declared in HIP-CPU runtime library.

The minimal code is here:

#include <hip/hip_runtime_api.h>

int main()
{
  int row = 10, col = 10;
  float h_ptr[row][col];
  float *d_ptr;
  size_t pitch;

  hipMallocPitch(&d_ptr, &pitch, col*sizeof(float), row);
  hipMemcpy2D(d_ptr, pitch, h_ptr, col*sizeof(float), col*sizeof(float), row, hipMemcpyHostToDevice);

  return 0;
}

After looking through the src/include/hip/detail/api.hpp & include/hip/hip_api.h, I found hipMemcpy2DAsync is the only implemented function related to the problem.

Since we could use the trivial way to implement those pitched memory in CPU simulation, could we just add the wrap for hipMallocPitch and hipMemcpy2D in HIP-CPU to avoid compiler errors? I'd be glad to contribute for this.

warp shuffle functions behave incorrectly

Consider the following HIP program:

#include <cstdio>
#include <hip/hip_runtime.h>



__global__ void my_kernel(int * data_in, int * data_out)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    data_out[idx] = __shfl_down(data_in[idx], 16);
}



int main()
{
    int count = 1024;

    int * data_in;
    int * data_out;
    hipMallocManaged((void**)&data_in, count * sizeof(int));
    hipMallocManaged((void**)&data_out, count * sizeof(int));

    for(int i = 0; i < count; i++)
    {
        data_in[i] = i;
        data_out[i] = -1;
    }
    
    printf("Input:");
    for(int i = 0; i < count; i++)
    {
        if(i % 32 == 0)
            printf("\n");
        printf("%5d ", data_in[i]);
    }
    printf("\n");

    int tpb = 256;
    int bpg = count / tpb;
    hipLaunchKernelGGL(my_kernel, bpg, tpb, 0, 0, data_in, data_out);
    hipDeviceSynchronize();

    printf("Output:");
    for(int i = 0; i < count; i++)
    {
        if(i % 32 == 0)
            printf("\n");
        printf("%5d ", data_out[i]);
    }
    printf("\n");

    printf("Diff:");
    for(int i = 0; i < count; i++)
    {
        if(i % 32 == 0)
            printf("\n");
        printf("%5d ", data_out[i] - data_in[i]);
    }
    printf("\n");

    hipFree(data_in);
    hipFree(data_out);

    return 0;
}

The only thing the program does, is it shifts the values from the input buffer and stores the shifted data to the output buffer, using warp shuffle function __shfl_down.

Compiling it using hipcc and running it on the GPU produces expected results, the data in each warp are shifted by 16 values, except for the last 16 values in the warp, which maintained their original value.

But using the HIP-CPU library, the results are incorrect. It seems that the warp shuffle is performed only in the first warp of each threadblock.
The function __shfl_up has similarly incorrect behaviour, and even weirder.
__shfl_xor seems totally wrong, it works as if just __shfl was used, forgetting about the xor.
__shfl behavior seems ok.

Note that I am not assuming any warpSize, if the warpSize was anything, the output would still be wrong.

Compilation of the HIP-CPU program was performed using the command

g++ -g -O2 -std=c++17 -I/home/jakub/apps/HIP-CPU/include source.hip.cpp -o program.x -ltbb -pthread

I am attaching the outputs of the programs. The GPU runs were on an AMD GPU.
out_down_hipCpu.txt
out_down_hipGpu.txt
out_up_hipCpu.txt
out_up_hipGpu.txt
out_xor_hipCpu.txt
out_xor_hipGpu.txt

Missing API function: hipMemcpyToSymbol(const T& symbol, ...)

While it is undocumented, hip_runtime_api.h has an overload to hipMemcpyToSymbol that takes a reference to a variable, instead of a pointer to one (or it's name). This is likely a facilty to not have to use the HIP_SYMBOL which does nothing, just prepends the addressof operator (but only for the nvcc back-end, the macro for the HCC/AMD back-end just repeats the var name verbatim, effectively relaying to the missing API function.)

We came across this issue while porting user code to using HIP-CPU.

Compilation Warning resulting from non-trivial half precision type (MacOS build, homebrew gcc-10)

Hi,
As discussed in my MacOS documentation PR, the compilation gives a warning related to using memcpy to copy a non-trivial type used for half precision

/Users/36j/Devel/HIP-CPU/HIP-CPU/examples/bit_extract/../../include/hip/hip_fp16.h:898:43:   required from here
/Users/36j/Devel/HIP-CPU/HIP-CPU/examples/bit_extract/../../include/hip/../../src/include/hip/detail/helpers.hpp:46:32: warning: 'void* memcpy(void*, const void*, size_t)' copying an object of non-trivial type 'class half_float::half' from an array of 'const short int' [-Wclass-memaccess]
   46 |                     std::memcpy(&r, &v, sizeof(T));
      |                     ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~

This appearst to be known (and is benign) I am just noting it here for later cleanup.

Clang on Windows: 't' cannot be thread local when declared 'dllexport'

Trying to use HIP-CPU on Windows using Clang results in a compiler error:

deps/hip-cpu/include\hip/../../src/include/hip/detail/runtime.hpp:180:55: error: 't' cannot be thread local when declared 'dllexport'
                static thread_local std::vector<Task> t;
                                                      ^

Note to "self": change hip/hip_defines.h from

#if defined(_WIN32)

#if defined(_WIN32) && (!defined(__clang__))

solves this issue. Likely a more sophisticated mechanism will be needed instead of this hotfix.

HIP-CPU targets C++17 compatibility but uses <version> (which is a C++20 feature)

Currently, a lot of headers include <version> to check for availability of parallel algorithms. However, this header is not available in C++17, but only from C++20 onwards.

Is it possible to build MIOpen on HIP-CPU?

Hi all,

does anyone try on this target in order to run tensorflow etc. on HIP-CPU?

[Feature]: Cooperative groups API support

Suggestion Description

Currently HIP code that includes <hip/hip_cooperative_groups.h> cannot be compiled using HIP-CPU because support is currently missing.

Operating System

No response

GPU

No response

ROCm Component

No response

Missing hipMemsetAsync and hipMallocAsync

Title says it all. Would be awesome if these were implemented.

Support for AddressSanitizer

Since HIP-CPU uses libco fiber mechanism to mimic GPU thread behaviour, the stack gets constantly re-written and that leads to ASan reporting false positives all over the place.

Seems like there is a way to annotate fibers that other coroutine libraries use, see this, this or this.

Maybe the solution would be to have different backends for other coroutine libraries? Eventually we would want to use the std version, if that makes sense.

Missing hipRuntimeGetVersion() in hip/hip_runtime.h

hipError_t hipRuntimeGetVersion(int* runtimeVersion) is not defined in HIP-CPU runtime.

See: https://github.com/ROCm-Developer-Tools/HIP/blob/d62eaeeda560c0c33d1e58fe9e44d7fd28220ac1/include/hip/hip_runtime_api.h#L1377

<version> header requires C++ 20

Hi,
It looks like the <version> header included in hip_runtime.h is a C++ 20 thing. I tried to build with:
env CXX="/usr/tce/packages/clang/clang-10.0.1/bin/clang++ -std=c++17" cmake ..
And got the error:

In file included from /g/g20/nissen5/proj/HIP-CPU/examples/bit_extract/bit_extract.cpp:5:
/g/g20/nissen5/proj/HIP-CPU/examples/bit_extract/../../include/hip/hip_runtime.h:7:10: fatal error: 'version' file not found
#include <version>
         ^~~~~~~~~
1 error generated.

I'm just playing around for a hackathon today so I can use whatever compiler I like, but if you're targeting C++ 17 you may want to use something else. Thanks.

__shfl() operation lanes don't wrap around

__shfl() segfaults when being fed a negative lane id. While HIP on it's own right doesn't define what the behavior of __shfl() should be in this case, the CUDA docs do:

If srcLane is outside the range [0:width-1], the value returned corresponds to the value of var held by the srcLane modulo width (i.e. within the same subsection).

HIP follows this behavior as it relays to __builtin_amdgcn_ds_bpermute described in the Vega ISA docs in the case of width == 64 under opcode 63:

// VGPR[index][thread_id] is the VGPR RAM
// VDST, ADDR and DATA0 are from the microcode DS encoding tmp[0..63] = 0 for i in 0..63 do
// ADDR needs to be divided by 4.
// High-order bits are ignored. src_lane = floor((VGPR[ADDR][i] + OFFSET) / 4) mod 64

In practice, the code in question which triggered the segfault issued a __shfl() with negative source lane id but then discards the result (doesn't commit it to memory, but saves the branching for checking if said operation is valid in the first place). Because CUDA and HIP handle this is device code, I'd lean toward the code being valid (ugly nontheless).

HIP-CPU should take a stand and define some kind of behavior that doesn't segfault.

(Note that the issue is slightly graver, because HIP in general doesn't define any behavior for these operations. According to the RDNA ISA docs p. 211, the wrapping behavior for the permute operations are still based on 64, even though the underlying arch uses 32 wide wavefronts. The underdefined nature of __shfl() in HIP, while following CUDA behavior when using GCN/CDNA derivates but deviating for RDNA derivates is going to bite hard in the future.)

Help on understand processor() initialization and threadIdx iteration

I am having difficulty understanding how the processor_ function interacts with the co_thread and how threadIdx dimensions are used. Is there a thread pool created? Where is the processor_ function called? How is thread initialized at the start in the processor_()?

I see how the blocks are iterated in the Tile.hpp but how are threadIdx iterated?

Internal compiler error building hip-cpu with gcc 12.3

Using the current version of hip-cpu (1bf89aa Merge pull request #49 from Naraenda/remove-moodycamel-cmake) with GCC 12.3 on Ubuntu 22.04 results in an internal compiler error.

This can be seen with the hip-cpu tests:

$ git log -n1 --oneline
1bf89aa Merge pull request #49 from Naraenda/remove-moodycamel-cmake

$ mkdir build

$ cd build

$ cmake .. -DCMAKE_CXX_COMPILER=g++-12 -L
-- The CXX compiler identification is GNU 12.3.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++-12 - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Looking for __PSTL_PAR_BACKEND_TBB
-- Looking for __PSTL_PAR_BACKEND_TBB - not found
-- Looking for _PSTL_PAR_BACKEND_TBB
-- Looking for _PSTL_PAR_BACKEND_TBB - found
-- Configuring done
-- Generating done
-- Build files have been written to: /home/fwyzard/src/ROCm/hip-cpu/build
-- Cache values
BUILD_TESTING:BOOL=ON
CMAKE_BUILD_TYPE:STRING=
CMAKE_INSTALL_PREFIX:PATH=/usr/local
TBB_DIR:PATH=/usr/lib/x86_64-linux-gnu/cmake/TBB
hip_cpu_rt_BUILD_EXAMPLES:BOOL=ON
hip_cpu_rt_BUILD_TESTING:BOOL=ON
hip_cpu_rt_INSTALL_LIBRARY:BOOL=ON

$ cmake --build .
[  1%] Building CXX object examples/bit_extract/CMakeFiles/bit_extract.dir/bit_extract.cpp.o
In file included from /usr/include/c++/12/pstl/parallel_backend_tbb.h:26,
                 from /usr/include/c++/12/pstl/parallel_backend.h:20,
                 from /usr/include/c++/12/pstl/algorithm_impl.h:22,
                 from /usr/include/c++/12/pstl/glue_execution_defs.h:50,
                 from /usr/include/c++/12/execution:32,
                 from /home/fwyzard/src/ROCm/hip-cpu/src/include/hip/detail/tile.hpp:21,
                 from /home/fwyzard/src/ROCm/hip-cpu/src/include/hip/detail/coordinates.hpp:12,
                 from /home/fwyzard/src/ROCm/hip-cpu/src/include/hip/detail/grid_launch.hpp:11,
                 from /home/fwyzard/src/ROCm/hip-cpu/src/include/hip/detail/api.hpp:15,
                 from /home/fwyzard/src/ROCm/hip-cpu/include/hip/hip_api.h:14,
                 from /home/fwyzard/src/ROCm/hip-cpu/include/hip/hip_runtime.h:14,
                 from /home/fwyzard/src/ROCm/hip-cpu/examples/bit_extract/bit_extract.cpp:5:
/usr/include/tbb/task.h:21:139: note: ‘#pragma message: TBB Warning: tbb/task.h is deprecated. For details, please see Deprecated Features appendix in the TBB reference manual.’
   21 | #pragma message("TBB Warning: tbb/task.h is deprecated. For details, please see Deprecated Features appendix in the TBB reference manual.")
      |                                                                                                                                           ^
/home/fwyzard/src/ROCm/hip-cpu/src/include/hip/detail/grid_launch.hpp: In instantiation of ‘void hip::detail::launch(const Dim3&, const Dim3&, uint32_t, Stream*, F, std::tuple<_UTypes ...>) [with F = launch_kernel_from_so(Function*, uint32_t, uint32_t, uint32_t, uint32_t, uint32_t, uint32_t, uint32_t, Stream*, void**, void**)::<lambda()>; Args = {}; uint32_t = unsigned int; Stream = Flat_combiner<std::vector<std::packaged_task<void(bool&)> > >]’:
/home/fwyzard/src/ROCm/hip-cpu/src/include/hip/detail/api.hpp:691:19:   required from here
/home/fwyzard/src/ROCm/hip-cpu/src/include/hip/detail/grid_launch.hpp:40:17: internal compiler error: Segmentation fault
   40 |                 ts.emplace_back(
      |                 ^~
0x7ff84904251f ???
        ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
0x7ff849029d8f __libc_start_call_main
        ../sysdeps/nptl/libc_start_call_main.h:58
0x7ff849029e3f __libc_start_main_impl
        ../csu/libc-start.c:392
Please submit a full bug report, with preprocessed source (by using -freport-bug).
Please include the complete backtrace with any bug report.
See <file:///usr/share/doc/gcc-12/README.Bugs> for instructions.
gmake[2]: *** [examples/bit_extract/CMakeFiles/bit_extract.dir/build.make:76: examples/bit_extract/CMakeFiles/bit_extract.dir/bit_extract.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:1138: examples/bit_extract/CMakeFiles/bit_extract.dir/all] Error 2
gmake: *** [Makefile:166: all] Error 2

Do some template magic so we don't have to cast to void**

The following example line of code

hipMalloc(&d_x, count * sizeof(float));

fails to compile (using g++ 9.4.0) with error

saxpy.hip.cpp:46:15: error: invalid conversion from ‘float**’ to ‘void**’ [-fpermissive]
   46 |     hipMalloc(&d_x, count * sizeof(float));
      |               ^~~~
      |               |
      |               float**

I know the fix is to just cast the pointer,

hipMalloc((void**)&d_x, count * sizeof(float));

but who likes that?
CUDA and HIP don't require the cast, so neither should HIP-CPU. In the classic HIP (include/hip_runtime_api.h right at the bottom in 4.3) there is the function

template <class T>
static inline hipError_t hipMalloc(T** devPtr, size_t size) {
    return hipMalloc((void**)devPtr, size);
}

so please do something similar in HIP-CPU too, for all similar functions.
If there are any cons to this, please explain or point me to an explanation.

Thanks,
Jakub Homola

Edit: I am using the current HIP-CPU master branch, changelog says version 0.1.4142 from December 2020

[MacOS] legacy_async_memcpy test appears to hang ( homebrew build, gcc-10)

Ctest test 23: legacy_async_memcpy appears to hang on MacOS when built with gcc-10 from homebrew -- MacOS Catalina (10.15.7) - similar behaviour is seen on MacOS CI where this test times out along with test 27: legacy_memcpy

Provide FindHIP.cmake, HIPConfig.cmake, hip-config.cmake to aid building other HIP Libraries

HIP-CPU supplies

./share/hip_cpu_rt/cmake/hip_cpu_rtConfig.cmake
./share/hip_cpu_rt/cmake/hip_cpu_rtTargets.cmake
./share/hip_cpu_rt/cmake/hip_cpu_rtConfigVersion.cmake

but some ecosystem libraries (e.g. hipCUB) look for FindHIP.cmake, HIPConfig.cmake or hip-config.cmake. Can one do a direct symlink of e.g. hip_cpu_rtConfig.cmake to hip-config.cmake to help build these libraries, or can an appropriate FindHIP/HIPConfig/hip-config.cmake be provided?

Missing some warp cross lane functions

According to HIP programming guides, Warp Cross Lane Functions are well supported in HIP. But I couldn't build the HIP code with some of these wrap functions e.g. int __all(int predicate) and int __any(int predicate) using HIP-CPU library.

Since those APIs are widely used in warp-level programing, I was wondering if there are some plans for these functions, __all and __any (__ballot is already implemented in HIP-CPU).

SIMD related warnings for half precision types (MacOS build, homebrew gcc-10)

A variety of compilation warnings come from hip/hip_fp16.h, similar in nature to:

/Users/36j/Devel/HIP-CPU/HIP-CPU/tests/../include/hip/hip_fp16.h:467:9: warning: unsupported return type '__half2' {aka 'hip::detail::Vector_type<half_float::half, 2>'} for simd
  467 | __half2 __hsub2(__half2 x, __half2 y) noexcept
      |         ^~~~~~~

This is benign but noisy and is known, I am noting it here for later cleanup

amd64.inl triggers warning C4706

Using HIP-CPU with MSVC and compiling with /W4 triggers the following warning within the library sources when consuming HIP-CPU:

external\libco\amd64.inl(134) : warning C4706: assignment within conditional expression

If the behavior really is intended (I couldn't tell at a first glance), please suppress warning locally.

Since the library partly aims at helping debugging device code, it were nice if I could get all the help I can get from the compiler (/W4) and the library wouldn't increase diagnostic noise. I know it isn't easy compiling cleanly under /W4 but it is much appreciated.

[Feature]: missing synchronisation functions

Suggestion Description

The CUDA/HIP synchronisation with predicate functions are missing:

int __syncthreads_count(int predicate);
int __syncthreads_and(int predicate);
int __syncthreads_or(int predicate);

Operating System

Any

GPU

Any CPU

ROCm Component

Kernel language

HIP-CPU for other CPU architectures?

Hello,

I have a few questions about HIP-CPU. First, is the intent for HIP-CPU to only support AMD CPUs, or will the project accept upstream efforts to run this on other architectures (ARM, RISC-V, etc)?

Also, what does it use to parallelize? From what I can see, it doesn't use OpenMP, so does it do it all itself with pthreads or something similar? According to this, ROCm can use OpenMP as the "backend" so perhaps this is how HIP-CPU operates?
https://docs.amd.com/en/docs-5.3.0/reference/openmp/openmp.html

Thank you!

rocm / hip-cpu Goto Github PK

hip-cpu's Introduction

HIP CPU Runtime

Repository | Issues | Documentation | Code Samples

Overview and Tutorials

Quick Links

Questions and Feedback

Contribution

hip-cpu's People

Contributors

Stargazers

Watchers

Forkers

hip-cpu's Issues

Suggestion Description

Operating System

GPU

ROCm Component

Suggestion Description

Operating System

GPU

ROCm Component

Recommend Projects

Recommend Topics

Recommend Org