nsubtil / lift Goto Github PK

View Code? Open in Web Editor NEW

12.0 5.0 3.0 317 KB

A collection of parallel programming primitives for CPUs and GPUs

License: Other

C++ 87.24% C 1.97% CMake 2.14% Shell 0.24% Cuda 8.41%

lift's Introduction

Lift

Lift is a collection of parallel programming primitives for CPUs and GPUs.

Documentation: https://lift.readthedocs.org/

Google Group: https://groups.google.com/forum/#!forum/lift-parallel-programming

lift's People

Contributors

Stargazers

Watchers

Forkers

vucinicv chuckseberino

lift's Issues

Tests Do Not Build on Mac

thread_local storage is unsupported on some versions(Mavericks at least) of mac.

Clean up suballocator implementation

The current suballocator has a lot of cruft left over from the original implementation that should be cleaned up.

Consider gracefully handling for_each calls with size 0

Algorithms that involve stream compaction can often end up launching for_each calls of size zero. This triggers CUDA runtime errors today, since it generates an invalid launch configuration.

It's not clear what the right thing to do here would be.

One option is to have Lift ignore zero-sized for_each calls, either silently or with run-time warnings (which would probably require some sort of debug or diagnostics infrastructure).

Another option is to leave the code as-is and let the CUDA runtime throw errors. This is not as bad as it seems, since these are very easy to spot in cuda-gdb, but it might be nicer if Lift itself can flag such situations.

Optional (opt-out) automatic temporary storage management for parallel primitives

Having a mechanism where temporary storage for parallel primitives is managed automatically (and efficiently) under the covers by Lift would be very desirable.

This would require per-thread, per-device context to be tracked by Lift.

Enable warnings-as-errors for Travis CI build

The automated build system won't notify when there are warnings in the build. Since we're looking to be free of warnings, it's probably best to enable -Werror for automated builds to catch such issues ahead of time.

Document the various function object types used in parallel

The various functor types used in most parallel primitives should be documented somewhere. At a minimum, they should be doxygenated. It would be desirable to have an actual interface object definition as well, even in the absence of C++ concepts.

Investigate profiling annotations

It would be very useful if Lift could implement a common interface for various profiling annotation APIs (nvtx, vtune tasks).

For CUDA profiling in particular, it can be hard to tell which kernel is running given the extremely long names generated by the compiler for each kernel entry point.

Allow timers to measure throughput

It would be useful to add the ability to measure throughput to Lift timers. This can be accomplished by expanding the existing timer API to keep a counter of the amount of data each step will process (which would have to be optional).

If that doesn't work, a separate class derived from timer that can track throughput is probably the next best option.

Runtime feedback for suballocator

It would be very useful to allow the suballocator to log what is happening in order to get a feel for how well it's behaving on any given workload.

There is existing code to do that by printing messages to the console (currently disabled). This would be a useful starting point, though we may need to add more information to the output.

CUB sorting is broken

sort_test_cuda is showing failures as of commit dad6254

Expose random number generator in test harness

A number of tests based on Lift currently use their own built-in RNGs for simplicity and consistency across platforms. We should implement a single version in Lift.

Support building with CMake 3.0

Debian Jessie shipped with CMake 3.0. This seems like a reasonable target to support.

Lift should be able to build with CMake 3.0. Right now it demands 3.2. Need to investigate what (if any) changes are required to lower the version requirements.

Clean up declaration of parallel::copy_if

Similar to #12, copy_if and copy_flagged could use some cleanup as well. It's unclear whether input and output should be different types and the predicate in copy_if needs documenting.

Additionally, it would be good to provide (Lift) pointer versions with implicit size.

Add example code

Lift could use a few examples to document how the library should be used.

pointer::range() is confusing

The way pointer::range() works is confusing --- range should take a start and end index instead.

Add "large" memory containers for 64-bit addressing

All Lift memory containers and pointers use 32-bit addresses by default. We should add "large" versions of the same containers to perform 64-bit arithmetic instead.

Note that this implies modifying certain return types. E.g., a large_allocation + offset must yield a large_pointer, not a 32-bit pointer as it does today.

GPU Launch parameter calculation oddities

The current launch parameter calculation code has a tendency to fill up blocks rather than queuing up larger numbers of smaller blocks. In many instances this seems to lead to worse performance.

It would be best to try to come up with a strategy that tries to keep block size manageable. The usual rule of thumb of 128 threads/block being a good number for most cases would probably apply here.

Add unit tests

Catch-all issue to document and track work required to add unit tests for Lift.

Move to CUB 1.5.x

Clean up the declaration of parallel::inclusive_scan

Three clean-up items for inclusive_scan:

Evaluate whether parallel::inclusive_scan can really handle different InputIterator and OutputIterator types

It's not immediately obvious in which cases it's useful to have different input and output data types, nor is it obvious how that affects the implementation of the scan predicate function.

The right thing to do might be to just have a single data type for both input and output. Either we need to check that the underlying type for InputIterator and OutputIterator is the same, or (preferably) we need to modify inclusive_scan to only accept the same input and output iterator data types.

Rename and document the Predicate argument

Predicate is really a binary operator. It would be good to rename it. This ties in with #13.

Implement a Lift pointer version with implicit size

'make lift-run-tests' from client projects doesn't work

It seems the path to the generated liftest binary is wrong when Lift is used as a nested CMake project.

Better error messages when passing a non-pointer into a method expecting a pointer

For code like this:

struct s my_var;
pointer<system, uint32> p = my_var;

lift generates a somewhat obscure syntax error. It would be nice to generate a better message via compile-time checks and static_assert() that checks that the input for operator=() (or other similar entrypoints) is indeed a pointer type;

Tune suballocator for real-world usage

The existing suballocator parameters are not tested on real-world cases and are likely ineffective. This issue tracks the effort to improve this situation.

One likely outcome is that we may split allocation sizes into buckets that are handled by different suballocator objects.

Reconsider allowing timers to be copied

Lift GPU timers are "very stateful" and have a destructor with lots of side effects. This is proving problematic in several use cases. Firepony in particular seems to rely on host timers being copiable, which has forced a corresponding and hopefully temporary change in Lift to allow host timers to be copied (cuda timers remain uncopiable).

We should reconsider whether this is the best approach.

One possible alternative would be to rely on the Lift context to track timers and keep all non-POD data on that side, turning the timer object itself into a shallow handle to the actual timer. This also has the advantage of allowing CUDA timer objects to be pooled and reused on a per-GPU basis. The drawbacks are potential performance issues with starting/stopping timers, depending on how the timer vs. state approach is designed (start/stop may have to do an expensive names array lookup).

Clean up static pointer compatibility check code

Most of lift/memory/type_assignment_checks.h is no longer relevant. It is also inside a namespace that no longer makes sense to preserve.

Disallow copy-constructor for scoped_allocation

I've seen code written where functions return a scoped_allocation that was defined inside the function itself. This is obviously dangerous and can never work.

We should investigate whether there's some way of disallowing this without removing the ability to do implicit conversion to allocation or pointer.

Add "benchmark mode" to test harness

We'll need to track performance across a variety of operations. The test harness needs the ability to run benchmarks in addition to tests.

A benchmark would return some sort of performance metric, which would have to be output by the test harness in some way. The obvious performance metric to track is throughput, though we might need different metrics for different tests, so it would be useful to allow the test to specify what metric is being output.

Allow Lift projects to build without nvcc

It can be desirable to allow code that uses Lift to be built with the host compiler directly. The obvious use case is a CPU-only project, but it can be beneficial for GPU projects as well to avoid the overhead of compiling through nvcc.

This will require some surgery to isolate everything related to Thrust from the host-only build.

Command line parser for test harness

The test harness functionality desperately needs a command line parser. We need to be able to do things such as:

Run one specific test by name
Run a set of tests (see #35)
Run tests from a pre-defined list
Enable/disable compute devices on the fly (e.g., run the CPU path only, or pick one specific GPU out of a multi-GPU setup)

The exact set of features that we'll implement on top of this is TBD, but they all require some way to receive runtime parameters, hence the need for a command line parser of some sort.

pointer::peek() should be const

pointer::peek() is non-const, but is not supposed to allow modifying anything behind the pointer. It should be made const.

Avoid duplicate header names

Lift currently lift/test.h and lift/test/test.h header files. This arrangement is somewhat confusing.

In a more general sense, having duplicate header file names in different directories is probably best avoided.

Test harness should check that test names are unique when registering a new test

Ensure that the vector of test names contains only unique names

Hide generated test object names from the namespace

The current LIFT_TEST_FUNC* macros generate objects whose names are directly based on the test name argument. This easily leads to naming conflicts.

We need to make these object names less likely to clash with any other names in the same namespace. Adding some sort of atypical prefix (e.g, '__generated_test_object' or something along those lines) would probably suffice.

parallel API inconsistencies

This issue attempts to document the current inconsistencies in the lift::parallel API.

Inconsistent usage of iterators vs. pointers

Some of the parallel operators take as input abstract iterator types. Others require pointers as input. It would be nice. The current status seems to be:

sort/sort-by-key (pointers)
reduce-by-key (pointers or iterators)
run-length-encode (iterators only)
copy-flagged/copy-if (iterators only)
inclusive-scan (iterators only)
for-each (pointers or iterators)

Inconsistent usage of first+last iterator vs first iterator + length

Similar to the issue above, some operators take in first and last iterators while others take in first iterators + a numerical length. This should be made uniform.

Inconsistent parameter order

By convention, temp_storage is usually the last parameter. Except for reduce_by_key, where it isn't.

Unclear behavior for pointer version of reduce-by-key

The output buffers for the pointer version of reduce-by-key are resized to match the input size, even though the output size is expected to be smaller. The original reasoning behind this was to avoid reallocations of client memory which might be reused across calls to reduce-by-key, but this seems confusing. It might be best to leave allocations up to the caller entirely.

Any change here will probably have to be reflected in other operators that implement similar behavior.

Template argument naming

run-length-encode calls it's key iterator InputIterator. reduce-by-key calls it KeyIterator. This seems confusing and unnecessarily inconsistent.

run-length-encode is confusing

run-length-encode seems to assume that the input and output key types are different. They probably can't be.

Because it outputs a well-defined quantity (integral length of the runs), it probably makes sense to restrict this to integral quantities if possible.

Allow different test categories in test harness

We need the ability to isolate tests into different sets within the test harness. The idea is to enable users to run only CPU tests or only "fast" tests during a run.

We'll need to hash out what this should look like, given that client code will need to define their own sets of tests.

Thrust include path insanity

FindCUDA.cmake seems to have an incredibly strong desire to add the CUDA toolkit include path to the very beginning of the list no matter what. This makes it hard to point both Lift as well as Lift client code at Lift's version of the Thrust toolkit.

This needs to be fixed somehow. Alternatives include:

Forcing client code to include a .cmake file provided by Lift after calling find_package(CUDA). This would sanitize the CUDA_INCLUDE_DIRS (e.g., point it at "/this-path-does-not-exist") to force the CUDA toolkit include path to come last in the generated command line
Give up on packaging Thrust and use the toolkit version.
Ship a custom version of FindCUDA and force client code to use it.

std::fpclassify won't build under nvcc

Attempting to use std::fpclassify on code built through nvcc seems to generate errors like this:

/usr/include/c++/4.9/cmath(562): error: calling a host function("builtin_fpclassify") from a __device function("std::fpclassify") is not allowed

This happens even though the caller of std::fpclassify is undecorated (i.e., host only).

Reuse CUDA event objects

The timer class creates and destroys CUDA events every time it's started / stopped. CUDA events should be reused instead.

Const Lift pointers do not convert well to Thrust iterators

Calling t_begin() on a const Lift pointer seems to return a const Thrust iterator to a non-const data type, which the compiler rightfully complains about.

The likely cause is that the const_thrust_iterator type is missing the const qualifier on the data type.

METHOD_INSTANTIATE triggers nvcc warnings

Usage of the current implementation of METHOD_INSTANTIATE yields warnings from nvcc:

/Users/rapososn/snail/oc_calibration/oc_calibration.cu(35): warning: nonstandard conversion of pointer-to-member to a function pointer

Define the list of supported CUDA toolkit versions

It would be useful to support more than one version of the CUDA toolkit. Currently Lift seems to build with CUDA 7.0 and 7.5, but only 7.5 is actively tested.

This issue tracks the effort of creating a list of suitable CUDA toolkit versions and implementing automated build/test environments to track all supported versions.

Improve documentation on inclusive_scan

I was aware of what inclusive_scan was doing, but I was not sure how to implement/create a predicate object/functor based on the current documentation. It would be helpful to have improved documentation on inclusive_scan

CPU atomic path triggers warnings

Seen in Travis builds:

/home/travis/build/nsubtil/lift/lift/atomics_host.inl: In static member function static float lift::atomics<system>::add(float*, float) [with lift::target_system system = (lift::target_system)0u]: /home/travis/build/nsubtil/lift/lift/atomics_host.inl:56:111: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] __atomic_compare_exchange((volatile uint32 *)address, ^ /home/travis/build/nsubtil/lift/lift/atomics_host.inl:61:29: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] if (*((uint32 *)&expected) == *((uint32 *)&oldval)) ^ /home/travis/build/nsubtil/lift/lift/atomics_host.inl:61:57: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing] if (*((uint32 *)&expected) == *((uint32 *)&oldval)) ^

https://travis-ci.org/nsubtil/lift/jobs/110036115