eyalroz / libgiddy Goto Github PK

View Code? Open in Web Editor NEW

42.0 7.0 9.0 325 KB

Giddy - A lightweight GPU decompression library

License: BSD 3-Clause "New" or "Revised" License

CMake 13.62% Shell 0.22% Cuda 52.46% C++ 32.31% C 1.39%

compression-scheme gpu cuda compressed-data lightweight-compression pcie databases run-length-encoding

libgiddy's Issues

Lackluster performance of the Incidence Bitmaps decompressor

We get the following results decompressing incidence bitmaps:

hostname	timestamp	db_name	table_name	column_name	length	compression_ratio	execution_time	time_unit	bandwidth	bandwidth_unit	test_adapter
bricks02.scilens.private	2017-02-14T12:41:31+01:00	tpch-sf-10	lineitem	l_returnflag	59986052	8/3	1.493719	ms	40.158	GB/sec	decompression::IncidenceBitmaps<unsigned int, unsigned char, true, 8u>
bricks02.scilens.private	2017-02-14T12:41:31+01:00	tpch-sf-10	lineitem	l_linestatus	59986052	8/2	1.47756	ms	40.598	GB/sec	decompression::IncidenceBitmaps<unsigned int, unsigned char, true, 8u>

while the GPUs theoretical memory bandwidth is 336 GB/sec, and with Model we've been able to get close to 300. While it's true we should also count the bandwidth used for reading i data, but if that's done with perfect efficiency (which I would think it isn't) - it still gives us only a factor of 11/8 or 5/4 (respectively) to the bandwidth - still no more than 1/7 or 1/8 of the maximum. I'm pretty sure this can be improved.

Support segmentation and non-segmentation in more decompression kernels

(copied from Issue #163 in the kernel testbench)

At the moment, most of our decompressors can be used with segment anchors or without them - but not both:

Scheme	Segmented	Unsegmented
BITMAP	N/A	🗹
RPE	🗹	🗷
DICT	🗷	🗹
FOR	🗹	🗷
MODEL	🗷	🗹
NS	N/A	🗹
NSV	🗹	🗷
RPE	🗹	🗷
RLE	🗹	🗷

First, segment execution is important even as a single option; so MODEL and DICT should definitely have it. Then, it would be nice to support, at least for the sake of completeness, the unsegmented versions of these schemes, especially DELTA for benchmarking purposes, and RPE for cases where the overall support of the column is so small that segmentation is mostly a hassle.

Consider optimizing IncidenceBitmaps dictionary accesses for very small dictionaries

Typically, the IncidenceBitmaps compression scheme would be used when there are very few input values (beyond 32 distinct values there's no longer any storage space benefit, although if one pre-filters and sends just a subset of the values the scheme could be beneficial in other contexts). But even that is not very efficient, and one can expect this scheme to be used a lot with 2,3,4,5 value cases.

The case of 2 values is a special case, in which a single bitmap would be enough, so let's put it aside. But in the other cases of a small number of bitmaps, the dictionary is also very small. So small, in fact, that instead of looking up in a shared-memory copy of it, one might be better served be a case statement over these values, or an if-else chain, which might incur predication, but might still be faster than going to shared memory.

In these cases, and with the uncompressed data being not-too-large, it might be possible to fit the entire dictionary into each single thread's

Support bit-resolution sizes for all relevant compression schemes

While slower to decompress, it is certainly useful to support data whose compressed form has sizes which are not always multiples of a byte. This is particularly useful for element sizes of 1...7 bits.

Bit-resolution element sizes should be supported for the decompressed type by all compression schemes which have a theoretically-unrestricted Uncompressed typename template.

As for the compressed input, bit-resolution sizes should be supported at least by the following schemes:

Dictionary (for when the number of distinct elements is small; this is Issue #65).
Discard Zero Bytes - Fixed (for when almost all elements are sized, say, 3+b bits to 16+b bits or so, for some non-negative b)
Frame of Reference (for when the discrepancies from the FOR model function are usually very small)
Run-Length Encoding (either for small-size data or for when run lengths are usually relatively short)
Delta (for when elements don't differ much from each other; if they differ consistently, your want DELTA-then-FOR compression)

And we should perhaps consider also:

Discard Zero Bytes - Variable, in a variant in which the sizes are in bits,

Provide C-language bindings for the kernel wrappers and the factory.

While this library is currently C++only, there's no reason why it shouldn't be used from C code.

Implement compression primitives for runtime-specified-length data

Our current implementation of compression schemes (see Issues #24, #25) all assume the uncompressed data type is known at compile time. In other words, there will be a separate instantiation for each type; this has its downsides even for smaller-sized types, but it's especially inopportune for fixed-length strings, whose length one cannot make assumptions about at compile time (it's uniform between records of a DB, not between columns and schemata).

So, we need variants of at least some of those compression schemes which produce untyped bytes, and take the output length as a run-time parameter.

identify supported build environment

Hey, @eyalroz, love the project, can't wait to try it out.

I'm trying to build libgiddy and I'm running into a lot of problems. Most of them are just version mismatches e.g. my CUDA was too new, but my cmake was too old, and my Boost install was too new, etc. I'm still running into build problems but I have reason to believe they're versioning-related.

Could you provide a list of known good versions for libgiddy's requirements?

Add code for host-side compression and decompression

Even though this library is about doing work on the GPU, it wouldn't be a bad idea to have a utility which can compress and decompress data entirely on the host side, to facilitate playing around with the actual GPU-side code. At least initially, the code doesn't need to be fast, so it should not be that much of an effort to write.

Compression code can be lifted from my kernel testbench, simply taken out of the context of a test; decompression code I'll have to actually get down to writing. Of course, there would be a de/compression binary, it will have lots of command-line options, etc. etc. - and that will be most of the work.

Support sub-byte-resolution indices into dictionaries

The current implementation of dictionary compression assumes a dictionary entry index is some basic unsigned integral type; we can't have a dictionary of 128 entries, or 16, or 4.

We need to specialize for the case of sub-byte dictionary index - at least for units of 1, 2 and 4 bits if not for arbirtrary bit length (perhaps even over-1-byte).

Add an example program using the library

We currently have no executable binary using the library - neither as a test nor as an example. Now, testing we don't do in this repository - there's a one on BitBucket for that; but we should definitely have some examples to illustrate how the library is used.

Performance is generally weak for short inputs

There's currently almost no attempt to optimize performance for short inputs - e.g. tweaking the serialization factor or even the core algorithm to allow for better parallelizing cases in which we can fill entire blocks the way we would like to.

This would require scrutiny of the kernels, and no less importantly the launch configuration resolution code (both the general mechanism and the kernel-specific parts).

Support (byte-resolution-size) dictionary indices of arbitrary sizes

It should be able to specify dictionary indices of sizes 3, 5, 6 or 7 bytes - not just 1,2, 4 or 8. It's even kind of useful in many cases for size 3.

Reduce dependencies on on general-purpose C++ code requiring linking

The codebase begins with an "inheritance" of a C++ utilities collection in the src/util directory. Some of that code requires linking, and is not header-only; I would rather the users of this library not be encumbered by those objects.

After getting rid of many dependencies before the initial commit, we're left with two:

poor_mans_reflection.cpp
stack_trace.cpp

The former should be easier to remove I believe, the latter - well, mostly emotionally painful I guess... I like having stack traces on my exceptions. But for the interest of simplicity we'll drop those until we can just depend on Polukhin's upcoming Boost.Stacktrace,

Support changing dictionaries for the Dictionary compression scheme

It is often the case that, locally, data in a column has limited support, while globally the support is much larger. If we're very lucky, this is captured by a model function, in which case we can use a compression scheme like FOR, FORLIN etc. But in other cases we just have a jumble of values which change over time. Example: the postal codes of students at a school.

For these cases it is useful to extend the Dictionary scheme so that once in a while the Dictionary can be replaced. Thus instead of the following inputs:

name	length	description
compressed_input	n	an index into the dictionary for each compressed element
dictionary_entries	m	an uncompressed value corresponding to each possible dictionary index
n	1	number of compressed (and decompressed) elements
m	1	number of dictionary entries

we'll have

name	length	description
compressed_input	n	an index into the dictionary for each compressed element
dictionary_entries	(varies)	an uncompressed value corresponding to each possible dictionary index
dictionary_application_start	d	the first position in the input (and the output) at which each dictionary comes into effect (the next dictionary start position is where the current dictionary goes out of effect)
dictionary_lengths	d	the number of dictionary entries in each of the dictionaries
n	1	number of compressed (and decompressed) elements
d	1	number of dictionaries

This should be useful for data such as that of the USDT-Ontime benchmark.

Make Delta decompression great^H^H^H^H^H typed again

Why did I switch to only using sizes for Delta decompression? Let's be more flexible that that. It's true that it's not strictly necessary in many/most cases, but it's reasonable to expect to be able to pass signed integers at least as differences.

Use more uniform terminology for decompression schemes & kernels

(copied from here)

In the decompression code, we interchangeably use terms like "periods", "intervals" and "segments" to signify what is - extensionally if not intentionally - the same thing. Same goes for other terms. So - let's unify the terminology somewhat, to refer to "segments" and "anchors" (or "anchor values" etc.) throughout the code.

Consider using dynamic parallelism in decompressing variable-run-length data

In our RLE decoding, we have to account for the same run of data possibly being very long, making us take as input an extra anchoring column with offsets inside the run.

If it were possible to branch out into multiple threads for very long runs, perhaps this could speed up the kernel; and it's conceivable this might be possible with dynamic parallelism.

... on the other hand, it could be that the overhead is just too high.

eyalroz / libgiddy Goto Github PK

libgiddy's Issues

Recommend Projects

Recommend Topics

Recommend Org