This project contains implementations of various compressors and decompressors of double[]s, namely Gorilla, Sprintz, FCM, DFCM, (D)FCM -> Gorilla/Sprintz pipelining, a ToBinary compressor that converts a double[] from ASCII to binary representation, and a lossy ToInt compressor that converts each double to an integer that preserves the most possible precision of the double. There are also a variety of testing functions that ensure the accuracy of this compression and decompression and provide automated testing of the throughput and compression ratio of the various compression methods.

How to use this project

To compress data and generate reports, code should be written in Main.java. Some common function calls include:

Tester.generateReport(methodName, verbose, datasetName);
Tester.testCorrectnessFull(methodName);
Tester.testCorrectnessOneFile(fileName, methodName);

Note that GorillaRef and SprintzRef are the names of the Gorilla and Sprintz Compressors, respectively.

Results so far

In my testing on the UCR dataset, the best method I have found is a slightly modified Sprintz (using xoring instead of diffing) with a block size of 2, which gives an average compression ratio of 0.54. For comparison, ToBinary gives a baseline CR of 0.67, while gzip9 gives a CR of 0.39. Sprintz (like Gorilla) has an advantage over gzip in that it is much faster, with an average compression throughput of ~85 mb/s as opposed to <1 mb/s for gzip. This may not be enough of a speed advantage to warrant the larger CR, but with further work we can hopefully reduce the CR attained by the double compression methods and increase their throughput to levels that make them a better alternative to gzip. ToInt provides a CR of 0.39 as well while also having an average compression throughput of ~20 mb/s, though it is not lossless.

Terminology and notes

For our purposes, compression ratio is defined as compressed size / uncompressed zie.
My implementation of Gorilla has an average compression throughput of ~20 mb/s while a reference solution run on the same data has a compression throughput of ~110 mb/s, suggesting that my implementations can still be largely optimized, and that we can expect speedups of up to 5x when this is done.

Next steps

Investigate compression of ints and run the outputs of ToInt through those for decreased CRs for lossy compression - this may involve modifying Gorilla, Sprintz, and (D)FCM to work for ints.

File Breakdown

Analyses

This folder contains CSV files with data generated by running tests and used for making the figures that appear in the Figures folder. They also contain Grapher.py files that are used to produce the figures.

Bit Report

Contains information on the percentage of zero and one bits produced by various modifications to the doubles. The first column is the bit number, the second column is the percentage of doubles that have a one in that place, and the third column is the the manipulation done to the data. Gorilla refers to xoring the doubles with the previous double. The number after (D)FCM refers to the level used (how many previous doubles are used to make the prediction). "Full" after (D)FCM refers to all doubles output, while "Xored" refers to only those for which a prediction was actually made. "Pred Or Xor" refers to outputting the xor with the prediction if a prediction is made or the xor with the previous double otherwise, and "Correct Pred Or Xor" is the same with the difference that a correct prediction must be made. Note that I was not able to correcly decompress with these methods. As can be seen in Figures/Bit Reports, xoring with the previous double (aka "Gorilla") yeilds the lowest overall percentage of one bits, which is what we desire, except for "Correct Pred Or Xor" which is not feasible because this extra work lowers the throughput (and, as mentioned earlier, I was not able to decompress it properly), and the added benefit is only ~1% lower and thus negligible. DFCM lowers the percentage of one bits towards the end of the double, but at the expense of a higher percentage at the beginning of the double, which negates any benefit. For (D)FCM, lower levels resulted in lower percentages.

Compression vs. Throughput

Used to make the figure Compression vs. Throughput. This figure shows that, for Gorilla and Sprintz, lower compression ratios often yeild higher throughputs.

FCM Predictions

Used to make the figure FCM Level Prediction Comparison, which relates the (D)FCM level to the number of predictions and correct predictions made. The figure shows that lower FCM levels yeild higher percentages of predictions and correct predictions.

FCM Throughput

Used to make the figures FCM Compression Throughput and FCM Decompression Throughput. These show that lower levels yeild higher throughputs, and that FCM has higher throughputs than DFCM, especially for decompression.

Front Coding

Used to make the figure Front Coding. The figure shows that front coding (which refers to Gorilla using the old leading and trailing zeroes if the new meaningful bytes fall within the old meaningful bytes) leads to slightly lower compression ratios.

gzip

Used to make the figures gzip CR and gzip Throughput. These figures compare the compression ratios and throughputs of various levels of gzip. They show that gzip9 yeilds a slightly lower compression ratio than gzip1 with no loss in throughput, and thus gzip 9 should be used for comparisons against other compression methods.

nBits vs. Block Size

Used to make the figure nBits vs. Block Size, where block size is the number of doubles in a Sprintz block and nBits is the average number of significant bits stored in each block. As we can see in the figure, lower block sizes yeild lower average nBits. Green triangles are means. However, lower block sizes are not necessarily better, as metadata must be stored for every block, so lower block sizes also result in more metadata that must be stored. As we will see, a block size of two yeilds greatest compression.

Reports

Contains the reports generated about each compression method, each of which contain the compression ratio, compression throughput, and decompression throughput for each file. Used to make figures CR Comparison and Throughput Comparison. The former shows that gzip9 gives the lowest average CR of 0.39, with Sprintz2 giving 0.54, Gorilla giving 0.61, and ToBinary giving 0.67, which should be considered as a baseline (thus, these CRs could be recoded by dividing them by 0.67). The latter figure, however, shows that gzip9 is also by far the slowest method, with an average compression throughput under 1 mb/s, while the others vary from 15-20 mb/s. These should even be considered low estimates for all but gzip - the reference solution of Gorilla has am average throughput of 110 mb/s, suggesting that with optimizations all of my compression methods may actually be able to be become up to 5x faster than they currently are.

data

Contains the raw data files, which are ASCII representations of the doubles. Each subfolder is a different dataset, each of which contains multiple files that each contain one time series.

gorilla-tsc

Contains the reference implementation of Gorilla from https://github.com/burmanm/gorilla-tsc. I modified it very slightly, mainly just making some functions public so that they could be accessed in my files.

gzipbin

Raw data for making reports of how gzip performs on binary data. The already made gzip reports are the results of running gzip on the raw ASCII data, but we figured we should see if it performed any better on the binary data. Each first line contains the percent the first file was compressed by (NOTE: If it was compressed by 60%, the compression ratio is 0.4, not 0.6), the second line is the compression time, and the third line is the decompression time, and then it repeats for the next files. This is a project that should be taken on in future research.

gzipDataAscii

A copy of data for gzip ASCII tests.

gzipDataBinary

A copy of data run through ToBinary, for gzip binary tests.

gzipTester

An executable file that tests the CR and time of gzip. NOTE: This takes many hours to run.

Output

Contains the compressed data files. Structured in the same way as data.

src/DoubleCompression

Compressor.java

Contains convenience functions that data can be supplied to (either directly or in the form of a filename) along with a method name that will compress of decompress the data.

DFCMCompressor.java

Contains the DFCM Compression code. Also contains a convert() function that can be used for converting a raw double[] into a double[] of differences.

DFCMDecompressor.java

Contains the DFCM Decompression code.

FCMCompressor.java

Contains the FCM Compression code.

FCMDecompressor.java

Contains the FCM Decompression code.

GorillaCompressor.java

Contains the Gorilla Compression code.

GorillaDecompressor.java

Contains the Gorilla Decompression code.

GorillaReferenceCompressor.java

Contains the reference Gorilla Compression code.

GorillaReferenceDecompressor.java

Contains the reference Gorilla Decompression code.

Main.java

Can call funcions here for the purposes of testing and generating reports. In general, only functions from Tester.java should be called here.

Reader.java

Cotains code for reading in data from a file.

SprintzCompressor.java

Contains the Sprintz Compression code.

SprintzDecompressor.java

Contains the Sprintz Decompression code.

Tester.java

Contains many functions for testing compression methods and outputting reports about them. These are the functions that will usually be called in Main.java.

ToBinaryCompressor.java

Contains the ToBinary Compression code.

ToBinaryDecompressor.java

Contains the ToBinary Decompression code.

ToIntCompressor.java

Contains the ToInt Compression code.

ToIntDecompressor.java

Contains the ToInt Decompression code.

Util.java

Contains utility functions used for a variety of purposes throughout the rest of the code.

Writer.java

Contains code for writing data to a file.

elshaikh-amir / doublecompression Goto Github PK

doublecompression's Introduction