eyalroz / libgiddy Goto Github PK
View Code? Open in Web Editor NEWGiddy - A lightweight GPU decompression library
License: BSD 3-Clause "New" or "Revised" License
Giddy - A lightweight GPU decompression library
License: BSD 3-Clause "New" or "Revised" License
We get the following results decompressing incidence bitmaps:
hostname | timestamp | db_name | table_name | column_name | length | compression_ratio | execution_time | time_unit | bandwidth | bandwidth_unit | test_adapter |
---|---|---|---|---|---|---|---|---|---|---|---|
bricks02.scilens.private | 2017-02-14T12:41:31+01:00 | tpch-sf-10 | lineitem | l_returnflag | 59986052 | 8/3 | 1.493719 | ms | 40.158 | GB/sec | decompression::IncidenceBitmaps<unsigned int, unsigned char, true, 8u> |
bricks02.scilens.private | 2017-02-14T12:41:31+01:00 | tpch-sf-10 | lineitem | l_linestatus | 59986052 | 8/2 | 1.47756 | ms | 40.598 | GB/sec | decompression::IncidenceBitmaps<unsigned int, unsigned char, true, 8u> |
while the GPUs theoretical memory bandwidth is 336 GB/sec, and with Model we've been able to get close to 300. While it's true we should also count the bandwidth used for reading i data, but if that's done with perfect efficiency (which I would think it isn't) - it still gives us only a factor of 11/8 or 5/4 (respectively) to the bandwidth - still no more than 1/7 or 1/8 of the maximum. I'm pretty sure this can be improved.
(copied from Issue #163 in the kernel testbench)
At the moment, most of our decompressors can be used with segment anchors or without them - but not both:
Scheme | Segmented | Unsegmented |
---|---|---|
BITMAP | N/A | ๐น |
RPE | ๐น | ๐ท |
DICT | ๐ท | ๐น |
FOR | ๐น | ๐ท |
MODEL | ๐ท | ๐น |
NS | N/A | ๐น |
NSV | ๐น | ๐ท |
RPE | ๐น | ๐ท |
RLE | ๐น | ๐ท |
First, segment execution is important even as a single option; so MODEL and DICT should definitely have it. Then, it would be nice to support, at least for the sake of completeness, the unsegmented versions of these schemes, especially DELTA for benchmarking purposes, and RPE for cases where the overall support of the column is so small that segmentation is mostly a hassle.
Typically, the IncidenceBitmaps compression scheme would be used when there are very few input values (beyond 32 distinct values there's no longer any storage space benefit, although if one pre-filters and sends just a subset of the values the scheme could be beneficial in other contexts). But even that is not very efficient, and one can expect this scheme to be used a lot with 2,3,4,5 value cases.
The case of 2 values is a special case, in which a single bitmap would be enough, so let's put it aside. But in the other cases of a small number of bitmaps, the dictionary is also very small. So small, in fact, that instead of looking up in a shared-memory copy of it, one might be better served be a case statement over these values, or an if-else chain, which might incur predication, but might still be faster than going to shared memory.
In these cases, and with the uncompressed data being not-too-large, it might be possible to fit the entire dictionary into each single thread's
While slower to decompress, it is certainly useful to support data whose compressed form has sizes which are not always multiples of a byte. This is particularly useful for element sizes of 1...7 bits.
Bit-resolution element sizes should be supported for the decompressed type by all compression schemes which have a theoretically-unrestricted Uncompressed
typename template.
As for the compressed
input, bit-resolution sizes should be supported at least by the following schemes:
And we should perhaps consider also:
While this library is currently C++only, there's no reason why it shouldn't be used from C code.
Our current implementation of compression schemes (see Issues #24, #25) all assume the uncompressed data type is known at compile time. In other words, there will be a separate instantiation for each type; this has its downsides even for smaller-sized types, but it's especially inopportune for fixed-length strings, whose length one cannot make assumptions about at compile time (it's uniform between records of a DB, not between columns and schemata).
So, we need variants of at least some of those compression schemes which produce untyped bytes, and take the output length as a run-time parameter.
Hey, @eyalroz, love the project, can't wait to try it out.
I'm trying to build libgiddy and I'm running into a lot of problems. Most of them are just version mismatches e.g. my CUDA was too new, but my cmake was too old, and my Boost install was too new, etc. I'm still running into build problems but I have reason to believe they're versioning-related.
Could you provide a list of known good versions for libgiddy's requirements?
Even though this library is about doing work on the GPU, it wouldn't be a bad idea to have a utility which can compress and decompress data entirely on the host side, to facilitate playing around with the actual GPU-side code. At least initially, the code doesn't need to be fast, so it should not be that much of an effort to write.
Compression code can be lifted from my kernel testbench, simply taken out of the context of a test; decompression code I'll have to actually get down to writing. Of course, there would be a de/compression binary, it will have lots of command-line options, etc. etc. - and that will be most of the work.
The current implementation of dictionary compression assumes a dictionary entry index is some basic unsigned integral type; we can't have a dictionary of 128 entries, or 16, or 4.
We need to specialize for the case of sub-byte dictionary index - at least for units of 1, 2 and 4 bits if not for arbirtrary bit length (perhaps even over-1-byte).
We currently have no executable binary using the library - neither as a test nor as an example. Now, testing we don't do in this repository - there's a one on BitBucket for that; but we should definitely have some examples to illustrate how the library is used.
There's currently almost no attempt to optimize performance for short inputs - e.g. tweaking the serialization factor or even the core algorithm to allow for better parallelizing cases in which we can fill entire blocks the way we would like to.
This would require scrutiny of the kernels, and no less importantly the launch configuration resolution code (both the general mechanism and the kernel-specific parts).
It should be able to specify dictionary indices of sizes 3, 5, 6 or 7 bytes - not just 1,2, 4 or 8. It's even kind of useful in many cases for size 3.
The codebase begins with an "inheritance" of a C++ utilities collection in the src/util
directory. Some of that code requires linking, and is not header-only; I would rather the users of this library not be encumbered by those objects.
After getting rid of many dependencies before the initial commit, we're left with two:
poor_mans_reflection.cpp
stack_trace.cpp
The former should be easier to remove I believe, the latter - well, mostly emotionally painful I guess... I like having stack traces on my exceptions. But for the interest of simplicity we'll drop those until we can just depend on Polukhin's upcoming Boost.Stacktrace,
It is often the case that, locally, data in a column has limited support, while globally the support is much larger. If we're very lucky, this is captured by a model function, in which case we can use a compression scheme like FOR, FORLIN etc. But in other cases we just have a jumble of values which change over time. Example: the postal codes of students at a school.
For these cases it is useful to extend the Dictionary scheme so that once in a while the Dictionary can be replaced. Thus instead of the following inputs:
name | length | description |
---|---|---|
compressed_input | n | an index into the dictionary for each compressed element |
dictionary_entries | m | an uncompressed value corresponding to each possible dictionary index |
n | 1 | number of compressed (and decompressed) elements |
m | 1 | number of dictionary entries |
we'll have
name | length | description |
---|---|---|
compressed_input | n | an index into the dictionary for each compressed element |
dictionary_entries | (varies) | an uncompressed value corresponding to each possible dictionary index |
dictionary_application_start | d | the first position in the input (and the output) at which each dictionary comes into effect (the next dictionary start position is where the current dictionary goes out of effect) |
dictionary_lengths | d | the number of dictionary entries in each of the dictionaries |
n | 1 | number of compressed (and decompressed) elements |
d | 1 | number of dictionaries |
This should be useful for data such as that of the USDT-Ontime benchmark.
Why did I switch to only using sizes for Delta decompression? Let's be more flexible that that. It's true that it's not strictly necessary in many/most cases, but it's reasonable to expect to be able to pass signed integers at least as differences.
(copied from here)
In the decompression code, we interchangeably use terms like "periods", "intervals" and "segments" to signify what is - extensionally if not intentionally - the same thing. Same goes for other terms. So - let's unify the terminology somewhat, to refer to "segments" and "anchors" (or "anchor values" etc.) throughout the code.
In our RLE decoding, we have to account for the same run of data possibly being very long, making us take as input an extra anchoring column with offsets inside the run.
If it were possible to branch out into multiple threads for very long runs, perhaps this could speed up the kernel; and it's conceivable this might be possible with dynamic parallelism.
... on the other hand, it could be that the overhead is just too high.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.