ed-von-schleck / shoco Goto Github PK

shoco is a compressor for small text strings

Home Page: http://ed-von-schleck.github.io/shoco/

License: MIT License

Makefile 1.24% Python 11.16% C 85.20% JavaScript 1.36% HTML 1.05%

shoco's Introduction

shoco: a fast compressor for short strings

shoco is a C library to compress and decompress short strings. It is very fast and easy to use. The default compression model is optimized for english words, but you can generate your own compression model based on your specific input data.

shoco is free software, distributed under the MIT license.

Quick Start

Copy shoco.c, shoco.h and shoco_model.h from github/shoco over to your project. Include shoco.h and you are ready to use the API!

API

Here is all of it:

size_t shoco_compress(const char * in, size_t len, char * out, size_t bufsize);
size_t shoco_decompress(const char * in, size_t len, char * out, size_t bufsize);

If the len argument for shoco_compress is 0, the input char is assumed to be null-terminated. If it’s a positive integer, parsing the input will stop after this length, or at a null-char, whatever comes first. shoco_decompress however will need a positive integer for len (most likely you should pass the return value of shoco_compress).

The return value is the number of bytes written. If it is less than bufsize, all is well. In case of decompression, a null-terminator is written. If the return value is exactly bufsize, the output is all there, but not null-terminated. It is up to you to decide if that’s an error or not. If the buffer is not large enough for the output, the return value will be bufsize + 1. You might want to allocate a bigger output buffer. The compressed string will never be null-terminated.

If you are sure that the input data is plain ASCII, your out buffer for shoco_compress only needs to be as large as the input string. Otherwise, the output buffer may need to be up to 2x as large as the input, if it’s a 1-byte encoding, or even larger for multi-byte or variable-width encodings like UTF-8.

For the standard values of shoco, maximum compression is 50%, so the out buffer for shoco_decompress needs to be a maximum of twice the size of the compressed string.

How It Works

Have you ever tried compressing the string “hello world” with gzip? Let’s do it now:

$ echo "hello world" | gzip -c | wc -c
32

So the output is actually larger than the input string. And gzip is quite good with short input: xz produces an output size of 68 bytes. Of course, compressing short strings is not what they are made for, because you rarely need to make small strings even smaller – except when you do. That’s why shoco was written.

shoco works best if your input is ASCII. In fact, the most remarkable property of shoco is that the compressed size will never exceed the size of your input string, provided it is plain ASCII. What is more: An ASCII string is suitable input for the decompressor (which will return the exact same string, of course). That property comes at a cost, however: If your input string is not entirely (or mostly) ASCII, the output may grow. For some inputs, it can grow quite a lot. That is especially true for multibyte encodings such as UTF-8. Latin-1 and comparable encodings fare better, but will still increase your output size, if you don’t happen to hit a common character. Why is that so?

In every language, some characters are used more often than others. English is no exception to this rule. So if one simply makes a list of the, say, sixteen most common characters, four bits would be sufficient to refer to them (as opposed to eight bits – one byte – used by ASCII). But what if the input string includes an uncommon character, that is not in this list? Here’s the trick: We use the first bit of a char to indicate if the following bits refer to a short common character index, or a normal ASCII byte. Since the first bit in plain ASCII is always 0, setting the first bit to 1 says “the next bits represent short indices for common chars”. But what if our character is not ASCII (meaning the first bit of the input char is not 0)? Then we insert a marker that says “copy the next byte over as-is”, and we’re done. That explains the growth for non-ASCII characters: This marker takes up a byte, doubling the effective size of the character.

How shoco actually marks these packed representations is a bit more complicated than that (e.g., we also need to specify how many packed characters follow, so a single leading bit won’t be sufficient), but the principle still holds.

But shoco is a bit smarter than just to abbreviate characters based on absolute frequency – languages have more regularities than that. Some characters are more likely to be encountered next to others; the canonical example would be q, that’s almost always followed by a u. In english, the, she, he, then are all very common words – and all have a h followed by a e. So if we’d assemble a list of common characters following common characters, we can do with even less bits to represent these successor characters, and still have a good hit rate. That’s the idea of shoco: Provide short representations of characters based on the previous character.

This does not allow for optimal compression – by far. But if one carefully aligns the representation packs to byte boundaries, and uses the ASCII-first-bit-trick above to encode the indices, it works well enough. Moreover, it is blazingly fast. You wouldn’t want to use shoco for strings larger than, say, a hundred bytes, because then the overhead of a full-blown compressor like gzip begins to be dwarfed by the advantages of the much more efficient algorithms it uses.

If one would want to classify shoco, it would be an entropy encoder, because the length of the representation of a character is determined by the probability of encountering it in a given input string. That’s opposed to dictionary coders that maintain a dictionary of common substrings. An optimal compression for short strings could probably be achieved using an arithmetic coder (also a type of entropy encoder), but most likely one could not achieve the same kind of performance that shoco delivers.

How does shoco get the information about character frequencies? They are not pulled out of thin air, but instead generated by analyzing text with a relatively simple script. It counts all bigrams – two successive characters – in the text and orders them by frequency. If wished for, it also tests for best encodings (like: Is it better to spend more bits on the leading character or on the successor character?), and then outputs its findings as a header file for shoco.c to include. That means the statistical model is compiled in; we simply can’t add it to the compressed string without blowing it out of proportions (and defeating the whole purpose of this exercise). This script is shipped with shoco, and the next section is about how you can use it to generate a model that’s optimized for your kind of data. Just remember that, with shoco, you need to control both ends of the chain (compression and decompression), because you can’t decompress data correctly if you’re not sure that the compressor has used the same model.

Generating Compression Models

Maybe your typical input isn’t english words. Maybe it’s german or french – or whole sentences. Or file system paths. Or URLs. While the standard compression model of shoco should work for all of these, it might be worthwile to train shoco for this specific type of input data.

Fortunately, that’s really easy: shoco includes a python script called generate_compression_model.py that takes one or more text files and ouputs a header file ready for shoco to use. Here’s an example that trains shoco with a dictionary (btw., not the best kind of training data, because it’s dominated by uncommon words):

$ ./generate_compression_model.py /usr/share/dict/words -o shoco_model.h

There are options on how to chunk and strip the input data – for example, if we want to train shoco with the words in a readme file, but without punctuation and whitespace, we could do

$ ./generate_compression_model.py --split=whitespace --strip=punctuation README.md

Since we haven’t specified an output file, the resulting table file is printed on stdout.

This is most likely all you’ll need to generate a good model, but if you are adventurous, you might want to play around with all the options of the script: Type generate_compression_model.py --help to get a friendly help message. We won’t dive into the details here, though – just one word of warning: Generating tables can be slow if your input data is large, and especially so if you use the --optimize-encoding option. Using pypy can significantly speed up the process.

Comparisons With Other Compressors

smaz

There’s another good small string compressor out there: smaz. smaz seems to be dictionary based, while shoco is an entropy encoder. As a result, smaz will often do better than shoco when compressing common english terms. However, shoco typically beats smaz for more obscure input, as long as it’s ASCII. smaz may enlarge your string for uncommon words (like numbers), shoco will never do that for ASCII strings.

Performance-wise, shoco is typically faster by at least a factor of 2. As an example, compressing and decompressing all words in /usr/dict/share/words with smaz takes around 0.325s on my computer and compresses on average by 28%, while shoco has a compression average of 33% (with the standard model; an optimized model will be even better) and takes around 0.145s. shoco is especially fast at decompression.

shoco can be trained with user data, while smaz’s dictionary is built-in. That said, the maximum compression rate of smaz is hard to reach for shoco, so depending on your input type, you might fare better with smaz (there’s no way around it: You have to measure it yourself).

gzip, xz

As mentioned, shoco’s compression ratio can’t (and doesn’t want to) compete with gzip et al. for strings larger than a few bytes. But for very small strings, it will always be better than standard compressors.

The performance of shoco should always be several times faster than about any standard compression tool. For testing purposes, there’s a binary inlcuded (unsurprisingly called shoco) that compresses and decompresses single files. The following timings were made with this command line tool. The data is /usr/share/dict/words (size: 4,953,680), compressing it as a whole (not a strong point of shoco):

compressor	compression time	decompression time	compressed size
shoco	0.070s	0.010s	3,393,975
gzip	0.470s	0.048s	1,476,083
xz	3.300s	0.148s	1,229,980

This demonstates quite clearly that shoco’s compression rate sucks, but also that it’s very fast.

Javascript Version

For showing off, shoco ships with a Javascript version (shoco.js) that’s generated with emscripten. If you change the default compression model, you need to re-generate it by typing make js. You do need to have emscripten installed. The output is asm.js with a small shim to provide a convenient API:

compressed = shoco.compress(input_string);
output_string = shoco.decompress(compressed);

The compressed string is really a Uint8Array, since that resembles a C string more closely. The Javascript version is not as furiously fast as the C version because there’s dynamic (heap) memory allocation involved, but I guess there’s no way around it.

shoco.js should be usable as a node.js module.

Tools And Other Included Extras

Most of them have been mentioned already, but for the sake of completeness – let’s have a quick overview over what you’ll find in the repo:

`shoco.c`, `shoco.h`, `shoco_model.h`

The heart of the project. If you don’t want to bother with nitty-gritty details, and the compression works for you, it’s all you’ll ever need.

`models/*`

As examples, there are more models included. Feel free to use one of them instead of the default model: Just copy it over shoco_model.h and you’re all set. Re-build them with make models.

`training_data/*`

Some books from Project Gutenberg used for generating the default model.

`shoco.js`

Javascript library, generated by emscripten. Also usable as a node.js module (put it in node_modules and require it). Re-build with make js.

`shoco.html`

A example of how to use shoco.js in a website.

`shoco`

A testing tool for compressing and decompressing files. Build it with make shoco or just make. Use it like this:

$ shoco compress file-to-compress.txt compressed-file.shoco
$ shoco decompress compressed-file.shoco decompressed-file.txt

It’s not meant for production use, because I can’t image why one would want to use shoco on entire files.

`test_input`

Another testing tool for compressing and decompressing every line in the input file. Build it with make test_input. Usage example:

$ time ./test_input < /usr/share/dict/words 
Number of compressed strings: 479828, average compression ratio: 33%

real   0m0.158s
user   0m0.145s
sys    0m0.013s

Adding the command line switch -v gives line-by-line information about the compression ratios.

`Makefile`

It’s not the cleanest or l33test Makefile ever, but it should give you hints for integrating shoco into your project.

`tests`

Invoke them with make check. They should pass.

Things Still To Do

shoco is stable, and it works well – but I’d have only tested it with gcc/clang on x86_64 Linux. Feedback on how it runs on other OSes, compilers and architectures would be highly appreciated! If it fails, it’s a bug (and given the size of the project, it should be easy to fix). Other than that, there’s a few issues that could stand some improvements:

There should be more tests, because there’s never enough tests. Ever. Patches are very welcome!
Tests should include model generation. As that involves re-compilation, these should probably written as a Makefile, or in bash or Python (maybe using ctypes to call the shoco-functions directly).
The Python script for model generation should see some clean-up, as well as documentation. Also it should utilize all cpu cores (presumably via the multiprocess-module). This is a good task for new contributers!
Again for model generation: Investigate why pypy isn’t as fast as should be expected (jitviewer might be of help here).
Make a real node.js module.
The current SSE2 optimization is probably not optimal. Anyone who loves to tinker with these kinds of micro-optimizations is invited to try his or her hand here.
Publishing/packaging it as a real library probably doesn’t make much sense, as the model is compiled-in, but maybe we should be making it easier to use shoco as a git submodule (even if it’s just about adding documentation), or finding other ways to avoid the copy&paste installation.

Feedback

If you use shoco, or like it for whatever reason, I’d really love to [hear from you](mailto:christian.h.m.schramm at gmail.com - replace the 'at' with @ and delete this sentence-)! If wished for, I can provide integration with shoco for your commercial services (at a price, of course), or for your totally awesome free and open source software (for free, if I find the time). Also, a nice way of saying thanks is to support me financially via git tip or flattr.

If you find a bug, or have a feature request, file it! If you have a question about usage or internals of shoco, ask it on stackoverflow for good exposure – and write me a mail, so that I don’t miss it.

Authors

shoco is written by [Christian Schramm](mailto:christian.h.m.schramm at gmail.com - replace the 'at' with @ and delete this sentence).

shoco's People

Contributors

Stargazers

Watchers

shoco's Issues

I am getting this error: Uncaught TypeError: $.stackSave is not a function

What is happening and can I fix this? It is happening on line (127) or, (2969 if it is actually formatted)

python binding/wrapper needed.

`make js` --> Error: Variable Module declared more than once. (and a `cwrap` error)

Came across a few errors when trying to build my own model for running in JavaScript, and thought I'd post it in case it helps other newbs like me. Seems like there have been some breaking changes in emscripten since this repo was last updated, so I had to do two things to get the WASM and JavaScript files to build:

Remove var from var Module = { ... at the top of the pre.js
Rather than using make js, use the below command instead:

emcc shoco.c -O3 -o shoco.js --closure 1 -s EXPORTED_FUNCTIONS="['_shoco_compress', '_shoco_decompress']" -s 'EXTRA_EXPORTED_RUNTIME_METHODS=["ccall", "cwrap"]' --pre-js pre.js

This seems to get it working fine for emscripten 1.38.22

Model generator generates non-C compliant code

Hi guys,

Thank you for the library. I am using it to compress the underlying data in the Dictionary library on ESP32 and ESP8266 microcontrollers (you can reference those platforms as "works on" from now on).

While playing with specific training data set I noticed that generate_compressor_model.py produced this kind of strings:

static const int8_t successor_ids_by_chr_id_and_chr_id[32][32] = {
  {-1, 13, 6, 9, -1, 7, -1, 15, 1, 11, 3, 4, 2, 10, -1, -1, -1, 0, 12, 5, 8, -1, 14, -1, -1, -1, -1, -1, None, None, None, None},
  {0, 6, 2, 5, -1, 1, 3, -1, 8, 7, 15, 4, 12, -1, -1, 9, 11, -1, 10, -1, 14, -1, -1, -1, -1, -1, -1, 13, None, None, None, None},
  {15, -1, -1, 3, 12, 0, 1, -1, 9, 2, 5, 7, 6, -1, 14, -1, 10, -1, 11, 4, -1, -1, 13, 8, -1, -1, -1, -1, None, None, None, None},

Note all the "None"'s in the code.
Obviously C compiler didn't know what to do with this.
I replaced the None's with 0's manually and it compiled and worked (compressed) the strings, so I assume those are meant to be zeros but I am not sure.
The training dataset is attached (it's a test data set for random key-value creation. Also attached is the resulting model generated with None's.

Hope you can fix this.
dictionary-test.txt
dictioanry-test.h.txt

Buffer overflows on malformed input

shoco_decompress has buffer overflows in two places when given malformed input.

I would submit a patch but shoco_decompress returns a size_t and I am unsure how you would like to signal the error in the API.

The two places are these:

if the input ends directly after an 0x00 code
if the input ends midway through the packed bytes (for example, directly after any byte with the high-bit set that is not itself immediately preceded by the 0x00 code.)

JSON output for generate_successor_table.py

I've ported shoco to Haxe (and thus all its targets, incl JS): https://github.com/clemos/shocohx
Now I would like to add a JSON output to generate_successor_table.py, because I can't work directly with the .h files that the script outputs currently.
I thought about simply adding a new --output-format {json,headerfile} flag.
Would you welcome a PR with this feature, or do you think it would make more sense to just add it to my repo ?
In any case, thanks a lot for your work :D

global buffer overflow in shoco_decompress()

Compiled with AFL with ASan like CC=afl-clang-fast make and then run like ./shoco decompress test000 /dev/null which produces this:

==19039==ERROR: AddressSanitizer: global-buffer-overflow on address 0x0000004d0548 at pc 0x0000004bfdda bp 0x7ffd2945a650 sp 0x7ffd2945a648
READ of size 4 at 0x0000004d0548 thread T0
    #0 0x4bfdd9 in shoco_decompress (/root/shoco/shoco+0x4bfdd9)
    #1 0x4c017c in main (/root/shoco/shoco+0x4c017c)
    #2 0x7f542c310b44 in __libc_start_main /build/glibc-qK83Be/glibc-2.19/csu/libc-start.c:287
    #3 0x4bd56c in _start (/root/shoco/shoco+0x4bd56c)

0x0000004d0548 is located 24 bytes to the left of global variable 'chrs_by_chr_and_successor_id' defined in './shoco_model.h:58:21' (0x4d0560) of size 1328
0x0000004d0548 is located 8 bytes to the right of global variable 'chrs_by_chr_id' defined in './shoco_model.h:15:19' (0x4d0520) of size 32
SUMMARY: AddressSanitizer: global-buffer-overflow ??:0 shoco_decompress

test000.zip

Proposal for better compression with non-ascii characters

shoco_decompress can take a null-terminated ascii input string and return the same string. If that can be relaxed a little then I see an opportunity to better handle unicode characters.

The current scheme encodes non-ascii characters by prefixing each byte with the 0x00 code. This doubles the length of the character.

But many of the characters below 0x20 are non-printable and uncommon—except the whitespace characters in the range 0x09–0x0D inclusive. If these uncommon characters are treated like the non-ascii characters it gives room for encoding a length for subsequent special characters.

This is especially useful for utf-8 where the non-ascii characters are multiple bytes.

Take the example utf-8 strings "μ", "μδ" and "😁":

μ
Decoded_both....... cebc
Encoded_current.... 00ce 00bc
Encoded_proposed... 01ce bc

μδ
Decoded_both....... cebc ceb4
Encoded_current.... 00ce 00bc 00ce 00b4
Encoded_proposed... 03ce bcce b4

😁
Decoded_both....... f09f 9881
Encoded_current.... 00f0 009f 0098 0081
Encoded_proposed... 03f0 9f98 81

The length code gives the length of the following bytes less 1.

Any work on php for shoco?

Compress tokenized strings

Hi,

For compressing tokenized strings similar to:

//HEADER//CODE1//decimals_here//CODE4//decimals_here//CODE3//decimals_here//CODE5//decimals_here//ENDING//

Do I need to split them on the slashes for it to be effective? I ran the model generator with a giant example with lots of these and the output model does not actually compresses anything, I see the last array fill with 00s

Any hint? thanks!

Use _mm_testz_si128() in check_indices()

Looks like a nice project!

Untested, but you could probably get a boost by using _mm_testz_si128() in check_indices(). It combines the operations you are doing with _mm_and_si128, _mm_movemask_epi8, and the comparison into a single instruction (PTEST). It requires SSE4.1, but this is pretty universally supported at this point.

Good documentation for it here: http://software.intel.com/sites/landingpage/IntrinsicsGuide/
(search for PTEST or testz)

PHP extension

I threw together some PHP bindings: https://github.com/jbboehr/php-shoco

The extension currently requires building shoco as a shared library, which I accomplish using autoconf (#22).

What's the optimum minimum length of a string to consider using shoco?

What's the optimum minimum length of a string to consider using shoco?
I'm putting brotli at 100 bytes min for compression
what about shoco? 20?

e.g. 20-100 will be covered by shoco, anything larger, will use brotli

Non-minimized JS port

Thanks for the library.
Would it be possible to also upload the non-minimized JS port? I'd like to include the library in a Firefox addon, and it would be problematic to get reviewed by AMO if it's minimized.

Input string in compress must be null terminated even when len provided

I had some code where the input string "in" was not null-terminated for:
size_t shoco_compress(const char * in, size_t len, char * out, size_t bufsize);

Specifically
in := Building.txt/237/245/0
len := 12

shoco_compress encoded the two junk characters /237/245 and returned encoded array of length 10.

By providing a copy of the "in" string with terminating null:
in := Building.txt/0
len := 12
it works fine, returning an encoded string length 8.

Easy work around but I was going off the API description: "If [len is] a positive integer, parsing the input will stop after this length, or at a null-char, whatever comes first." - which doesn't seem to be the case.

Hope this makes sense; and thanks for providing this library.

Does not work as a node.js module

Download this repo, cd there, and run node. Type this in:

var shoco = require("./shoco.js");shoco.decompress(shoco.compress("Hello, World!"));

I get this error:

TypeError: shoco.compress is not a function
    at repl:1:58
    at sigintHandlersWrap (vm.js:22:35)
    at sigintHandlersWrap (vm.js:96:12)
    at ContextifyScript.Script.runInThisContext (vm.js:21:12)
    at REPLServer.defaultEval (repl.js:340:29)
    at bound (domain.js:280:14)
    at REPLServer.runBound [as eval] (domain.js:293:12)
    at REPLServer.<anonymous> (repl.js:538:10)
    at emitOne (events.js:101:20)
    at REPLServer.emit (events.js:188:7)

instead of the expected "Hello, World!". FYI, var shoco = require("./shoco.js");Object.keys(shoco); returns:

[ 'preRun',
  'print',
  'printErr',
  'read',
  'readBinary',
  'load',
  'arguments',
  'I',
  'postRun',
  'Runtime',
  'ccall',
  'cwrap',
  'setValue',
  'getValue',
  'ALLOC_NORMAL',
  'ALLOC_STACK',
  'ALLOC_STATIC',
  'ALLOC_DYNAMIC',
  'ALLOC_NONE',
  'allocate',
  'Pointer_stringify',
  'UTF16ToString',
  'stringToUTF16',
  'UTF32ToString',
  'stringToUTF32',
  'HEAP',
  'HEAP8',
  'HEAP16',
  'HEAP32',
  'HEAPU8',
  'HEAPU16',
  'HEAPU32',
  'HEAPF32',
  'HEAPF64',
  'Md',
  'addOnPreRun',
  'Jd',
  'addOnInit',
  'Ld',
  'addOnPreMain',
  'Id',
  'addOnExit',
  'Kd',
  'addOnPostRun',
  'intArrayFromString',
  'intArrayToString',
  'writeStringToMemory',
  'writeArrayToMemory',
  'writeAsciiToMemory',
  'addRunDependency',
  'removeRunDependency',
  'preloadedImages',
  'preloadedAudios',
  '_malloc',
  '_memset',
  '_strlen',
  '_memcpy',
  '_free',
  'requestFullScreen',
  'requestAnimationFrame',
  'setCanvasSize',
  'pauseMainLoop',
  'resumeMainLoop',
  'getUserMedia',
  'FS_createFolder',
  'FS_createPath',
  'FS_createDataFile',
  'FS_createPreloadedFile',
  'FS_createLazyFile',
  'FS_createLink',
  'FS_createDevice',
  '_shoco_decompress',
  '_shoco_compress',
  'runPostSets',
  'Rd',
  'callMain',
  're',
  'run',
  'Vd',
  'exit',
  'abort',
  'calledRun',
  'stdin',
  'stdout',
  'stderr',
  'preloadPlugins',
  'Ta' ]

BTW, var shoco = require("./shoco.js");shoco._shoco_decompress(shoco._shoco_compress("hi")); returns 0.

Tag a release

TypeError: ord() expected string of length 1, but int found

On large input file with words to train I get(the input csv file is actually just pure 1 column list):

zangetsu@venus: pts/11: 16 files 192Kb -> ./generate_compressor_model.py ~/tmp/csv-compress-string/my_events.csv -o shoco_model.h 
finding bigrams ... done.
Traceback (most recent call last):
  File "./generate_compressor_model.py", line 391, in <module>
    main()
  File "./generate_compressor_model.py", line 305, in main
    max_chr = ord(max(successors.keys())) + 1
TypeError: ord() expected string of length 1, but int found

Cant compress some ASCII strings.

why can't shoco compress these kinda ASCII strings ??

examples :
My-Nam3-iZ-H@5eeB-M33R
not-a-g00d-Exampl333

Unable to "make js" on Ubuntu

When I attempt to call make js on Ubuntu I get the following error.

/tmp/shoco$ make js emcc shoco.c -O3 -o shoco.js --closure 1 -s EXPORTED_FUNCTIONS="['_shoco_compress', '_shoco_decompress']" --pre-js pre.js INFO root: ======================================= INFO root: bootstrapping relooper... INFO root: bootstrap phase 1 Exception in thread Thread-3: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "/usr/lib/python2.7/multiprocessing/pool.py", line 389, in _handle_results task = get() TypeError: ('__init__() takes at least 3 arguments (1 given)', <class 'subprocess.CalledProcessError'>, ())

It seems to be the same as here https://bugs.launchpad.net/ubuntu/+source/emscripten/+bug/1474449 and http://stackoverflow.com/questions/15314189/python-multiprocessing-pool-hangs-at-join

Does anyone know of a work around?

Thanks

generate_compressor_model.py generates invalid .h files

I'm using a quite large file (7mb), ASCII chars only. The words look like this:

hninetrakierf hnisnoitknuf hrettuf hcsnedobSuf htsrUf g hbeilnetrag htsag...

With default options, the tool generates a model file where each element in successor_ids_by_chr_id_and_chr_id looks like this:

static const int8_t successor_ids_by_chr_id_and_chr_id[32][32] = {
  {-1, 9, 3, 10, 14, -1, 4, 1, 6, 5, -1, 0, ... -1, -1, -1, -1, -1, None},
  {0, -1, 1, 7, 4, 6, 2, 3, 5, 8, 12, 10, 1..., -1, -1, -1, -1, -1, None},

If you have no time to fix this, do you have any idea how I could repair this file?
If i replace None with -1, the compression is a little bit better than the default shoco_model.h file, but maybe I could do better?

Off-by-one error in shoco.c function: shoco_compress

Unable to determine where this error occurs but the tests.c asserts fail when checking how many bytes are written to the output buffer.

For example, shoco_compress(MESSAGE,1,Out_buf,4) where MESSAGE is 3 characters long will return 2. If we change the 1 to n then the function will return n+1

Weissman Score for Short String Compression

There is a benchmark for Compression https://www.wikiwand.com/en/Weissman_score
This can be applied to Short Strings too, in comparison to:

Smaz https://github.com/antirez/smaz and https://github.com/CordySmith/PySmaz
Unishox https://github.com/siara-cc/Unishox and https://github.com/tweedge/unishox2-py3
Skrot https://github.com/jstepien/skrot

The equation for a single algorithm: Compression Ratio / Log(Time Required)

Issue: How does one check the speed if the string is short?
Solution: Aggregation of results based on compression of multiple short strings

Issue: How can the data be aggregated?
A: On a String-to-String basis then average the score
B: Concatenate all Strings and their time usage, then calculate the score.

Issue: Using Python to time the speed of each algorithm
Solution: Maybe use C instead? Or there is a way to do this with Python?

TypeError: can't concat str to bytes

zangetsu@venus: pts/11: 16 files 192Kb -> ./generate_compressor_model.py --split=whitespace --strip=punctuation README.md
Traceback (most recent call last):
  File "./generate_compressor_model.py", line 391, in <module>
    main()
  File "./generate_compressor_model.py", line 287, in main
    chunks = list(chunkinator(args.file, args.split, args.strip))
  File "./generate_compressor_model.py", line 245, in chunkinator
    for chunk in chunks:
  File "./generate_compressor_model.py", line 242, in <genexpr>
    chunks = itertools.chain.from_iterable(re.split(b"[" + WHITESPACE + "]", data) for data in all_in)
TypeError: can't concat str to bytes

Probably related to fact it is python2 related and I run on v3...

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.