jaredhoberock / bulk Goto Github PK
View Code? Open in Web Editor NEWLaunching collective tasks in bulk
Launching collective tasks in bulk
We have to manually zero-init each member of the struct in order to avoid problems with -Wextra
otherwise.
destructive_reduce_n exploits commutativity, so bulk::accumulate
cannot use it.
We can get into a situation where a kernel has very little occupancy to begin with (i.e., occupancy == 1
).
In this case, we should not attempt to bump up occupancy because the result will be no occupancy
The expression for spine_n
needs to account for grainsize:
const size_type spine_n = (n >= g.size() * g.this_exec.grainsize()) ? g.size() : (n + g.this_exec.grainsize() - 1) / g.this_exec.grainsize();
We could eliminate the use of the buffer in small_inplace_exclusive_scan_with_buffer
. It doesn't improve performance, and it would make it possible to implement a bounded version of scan which doesn't stage data and doesn't require temporary storage.
Something like Sean's LaunchBox
. Merge sort requires significantly different tunings for Fermi & Kepler.
Causes problems for nvcc's sm_1x code generation.
It seems to work serendipitously for the cases we tests, but not in general.
On a 32b system,
struct block
{
size_t size;
block *prev;
size_t is_free;
};
Is 12 bytes. That means allocations for e.g., double
, will be misaligned.
It makes the expression's type ambiguous. Just use an if/else here.
If we write a specialized merge_and_indices
function instead of using merge_by_key
and a counting_iterator
, we can improve the performance a bit.
Right now it accumulates into input_type
.
We'll need to cast to the appropriate types at the triple chevron.
Split up the zip_iterators used in the scatter_if for CUDA_ARCH < 200
Needs an additional barrier after the final call to binary_op
.
Hi!
First of all thank you so much for creating that library.
I'm curious what is the status of it. I've seen the same code in the Thrust. Will all upcoming development be there(with extended support for async)?
When passing element 0 as init
, we should convert to the intermediate_type
so as to respect Thrust's semantics.
We may want to change this behavior when Thrust gets an inclusive_scan
which takes an init
parameter. This makes it easier for the user to control this conversion behavior.
We can replace code like
if(i < n)
{
key = first1[i];
}
with
key = first1[min(i, n1 - 1)];
for a significant performance increase.
Is this project still alive?
There don't seem to have been any major updates in literally years, and I'm curious if it even runs with current cuda versions.
Has this been used at all by anyone recently? Will there be any future development?
Implement this at the same time Thrust does.
Dear Jared, there is any chance of updating the triple_chevron_launcher and remove the calls for the deprected (now removed) CUDA Execution Control API?
We use this bulk + thrust on Hydra here at CERN, and would like to keep running it with the newer versions of CUDA.
Cheers
Windows doesn't obey the bit fields and creates a 12b block. I guess we need to do the encoding of is_free and size into a single word manually.
We require an extra register in for_each
because we maintain both the current agent index, i
, as well as the parameter, first
.
Each agent increments first
by its index to find its iterator. That means if parallel agents happen to be implemented in sequence, then over the course of the sequential loop, we have to remember the value of the first
parameter even though the next agent that gets scheduled in sequence could have gotten its value of first
from the previous agent.
We could reclaim the register by explicitly targeting a par(seq)
nesting instead of flat par
. The idea would be for each parallel agent to find its index and then call for_each(seq, first + offset, last, f)
. offset
would be something like i * grainsize
.
The assign to tuple from a ternary operator expression confuse nvcc's pointer space tracking. The workaround is to avoid use of the ternary operator.
Only thread 0 should do this.
Not handling occupancy == 0
leads to division by zero inside proportional_smem_allocation
.
When occupancy == 0
we should return early.
This might improve the performance on small problem sizes
It should be possible to #include
a Bulk header without generating a compiler error, even if the compiler is not nvcc
. We should guard the use of things like __syncthreads
, threadIdx
, etc. from foreign compilers.
This page suggests they were added in gcc 4.3.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.