jaredhoberock / bulk Goto Github PK

View Code? Open in Web Editor NEW

36.0 36.0 5.0 433 KB

Launching collective tasks in bulk

Cuda 16.64% C++ 83.36%

bulk's People

Contributors

Stargazers

Watchers

Forkers

shenguojun malenie 3gx andrei-pokrovsky multithreadcorner

bulk's Issues

Add zero-initializing default constructor to device_properties_t

We have to manually zero-init each member of the struct in order to avoid problems with -Wextra otherwise.

bulk::accumulate must not use destructive_reduce_n

destructive_reduce_n exploits commutativity, so bulk::accumulate cannot use it.

choose_heap_size needs handle occupancy overflow more carefully

We can get into a situation where a kernel has very little occupancy to begin with (i.e., occupancy == 1).

In this case, we should not attempt to bump up occupancy because the result will be no occupancy

Support async(con(...)) instead of having to do async(par(con(...), 1)

Bug in scan spine size computation

The expression for spine_n needs to account for grainsize:

const size_type spine_n = (n >= g.size() * g.this_exec.grainsize()) ? g.size() : (n + g.this_exec.grainsize() - 1) / g.this_exec.grainsize();

scan's performance doesn't depend on using an extra buffer for the blockwise scan

We could eliminate the use of the buffer in small_inplace_exclusive_scan_with_buffer. It doesn't improve performance, and it would make it possible to implement a bounded version of scan which doesn't stage data and doesn't require temporary storage.

Create a way to vary execution groups based on machine architecture

Something like Sean's LaunchBox. Merge sort requires significantly different tunings for Fermi & Kepler.

scan's implementation should avoid assignment from ternary operator expression

Causes problems for nvcc's sm_1x code generation.

cuda_launcher for parallel_group<...> needs to do virtualization

sequential bounded bulk::merge isn't correct at all

It seems to work serendipitously for the cases we tests, but not in general.

bulk::malloc returns misaligned allocations

On a 32b system,

struct block
{
    size_t  size;
    block  *prev;
    size_t  is_free;
};

Is 12 bytes. That means allocations for e.g., double, will be misaligned.

Marshall large kernel parameters through global memory

apply __launch_bounds__ to kernels to avoid overflowing resources at launch

Guard use of __launch_bounds__ from non-CUDA compilers

Avoid assignment from ternary operator expression in destructive_reduce_n

It makes the expression's type ambiguous. Just use an if/else here.

Avoid counting_iterator in implementation of bulk::merge_by_key

If we write a specialized merge_and_indices function instead of using merge_by_key and a counting_iterator, we can improve the performance a bit.

Scan's implementation should accumulate into intermediate_type

Right now it accumulates into input_type.

accumulate should not put raw types in a union

cuda_launcher should use ExecutionGroup::size_type instead of size_t to avoid warnings

We'll need to cast to the appropriate types at the triple chevron.

nvcc's sm_1x pointer space tracking gets confused inside of reduce_by_key's implementation

Split up the zip_iterators used in the scatter_if for CUDA_ARCH < 200

destructive_reduce_n is under-synchronized

Needs an additional barrier after the final call to binary_op.

What is the status of the project?

Hi!

First of all thank you so much for creating that library.

I'm curious what is the status of it. I've seen the same code in the Thrust. Will all upcoming development be there(with extended support for async)?

Bulk should carefully undefine host & device before including CUDA headers to avoid warnings

reduce_by_key receives BinaryPredicate but does not actually use it

bulk::reduce's temporary array should be of type input_type, not T

inclusive_scan without init should respect thrust::inclusive_scan's semantics

When passing element 0 as init, we should convert to the intermediate_type so as to respect Thrust's semantics.

We may want to change this behavior when Thrust gets an inclusive_scan which takes an init parameter. This makes it easier for the user to control this conversion behavior.

Avoid dynamic load in sequential merge implementations

We can replace code like

if(i < n)
{
  key = first1[i];
}

with

key = first1[min(i, n1 - 1)];

for a significant performance increase.

Does this still work?

Is this project still alive?
There don't seem to have been any major updates in literally years, and I'm curious if it even runs with current cuda versions.

Has this been used at all by anyone recently? Will there be any future development?

inclusive_scan without init should respect N3724's semantics

Implement this at the same time Thrust does.

An empty async should not be an error

Remove the deprected calls to cudaConfigureCall / cudaSetupArgument / cudaLaunch

Dear Jared, there is any chance of updating the triple_chevron_launcher and remove the calls for the deprected (now removed) CUDA Execution Control API?
We use this bulk + thrust on Hydra here at CERN, and would like to keep running it with the newer versions of CUDA.
Cheers

shmalloc's block is misaligned on win32

Windows doesn't obey the bit fields and creates a 12b block. I guess we need to do the encoding of is_free and size into a single word manually.

Reclaim register in for_each

We require an extra register in for_each because we maintain both the current agent index, i, as well as the parameter, first.

Each agent increments first by its index to find its iterator. That means if parallel agents happen to be implemented in sequence, then over the course of the sequential loop, we have to remember the value of the first parameter even though the next agent that gets scheduled in sequence could have gotten its value of first from the previous agent.

We could reclaim the register by explicitly targeting a par(seq) nesting instead of flat par. The idea would be for each parallel agent to find its index and then call for_each(seq, first + offset, last, f). offset would be something like i * grainsize.