nayuki / reference-arithmetic-coding Goto Github PK

Clear implementation of arithmetic coding for educational purposes in Java, Python, C++.

Home Page: https://www.nayuki.io/page/reference-arithmetic-coding

Java 44.06% Python 25.12% C++ 30.27% Makefile 0.55%

java arithmetic-coding python library reference-implementation c-plus-plus

reference-arithmetic-coding's Introduction

Reference arithmetic coding

This project is a clear implementation of arithmetic coding, suitable as a reference for educational purposes. It is provided separately in Java, Python, C++, and is open source.

The code can be used for study, and as a solid basis for modification and extension. Consequently, the codebase optimizes for readability and avoids fancy logic, and does not target the best speed/memory/performance.

Home page with detailed description: https://www.nayuki.io/page/reference-arithmetic-coding

License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
The Software is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the Software or the use or other dealings in the Software.

reference-arithmetic-coding's People

Contributors

Stargazers

Watchers

Forkers

developster moversti balamurugan-r indrasat eemmanuel7 wrmsr meghagoyal2410 rchurianov hungnguyengoc bestplay9384 hodgka yuyobit nasirawon zoukaifa shuaihuaiyi stefanbirkner wardmohanna cqqqh samyoo9696 xuwangfmc benjamin-etheredge rongjiewang angeloobeta verdeeco jmswhtsn teamday chrislit charpoint cuthbertcai frankpolte mian8mian smileformylove ahmedmed7t amallia igordsmk hadeelyounes tongxyh jiuzhuanzhuan pdrgrc bigdavidwang imagingbook jnorthrup abcdpm vsmawoex happyxuwork jim-song amo5 albinchang chnvigny katrin408 zwmdr flasdev khloudpppp skn123 lijaiwei20180818 qianqiusuili vani88-git yuhongjiu weafre bwshan abhishekunnithan sunset-clouds todagami yifeipet back2yes akuljuneja dushishuang kzhang2 uyw4687 svengali galliumwang filipjosifoski icodein ravikumar-pawar vuiseng9 meikai1993 comydream zwj326 kamranabbaszadeibrahim mcognetta xnowbzhao mcourteaux hieule96 raitisvilums alexmojaki saramisajlovska tongdaxu noor0101 joaomrcarvalho xomadev jq12 pure-water paulhuang01 kaname-madoka18 ethankowalski vitaly-z namlee2 qihang-zhang nguyenxp24 bmillikanphd

reference-arithmetic-coding's Issues

Why flush() in python's arithmeticcoding's ArithmeticEncoder just need to write out 1?

I have read the original Mark Nelson's blog, and his implementation is below:

  pending_bits++;
  if ( low < MODEL::ONE_FOURTH )
    put_bit_plus_pending(0, pending_bits);
  else
    put_bit_plus_pending(1, pending_bits);

So, I am a little confused about it, could you kindly give me more explaination? In addition, thanks for your so clear code and sharing spirit!

About encode range

Can this implementation be used to encode 16bit images?

Can we add the LICENSE to the repository?

Hi @nayuki - thanks so much for the clean implementations, it's brilliant! :)

I noticed you've listed the licenses for your projects (thanks for making so many of the OSS!) and that this repository is listed under the MIT License.

Can the MIT License be added as a LICENSE to the repository? If so, I'm happy for you to reply with a Yes and I can do the work on adding and sending a pull request to minimize the work for you ^_^

As background, I'm interested in modifying this for research code but as I'm at a commercial research lab (though we publish openly, both papers and code! ^_^) I wanted to make sure there's no disagreement between this index page and a project’s main page (and I think having the LICENSE in the repo would make legal happier) :)

How to compress a list of signed integers?

Hello @nayuki, the codes are great!
For my application, I need to compress a list of signed integers (in python) by arithmetic coding (AC), and calculate the size of the compressed file (the smaller, the better). The question is, how can I firstly write these integer into an appropriate InputFile?

In your webpage, it says that 'a symbol is a non-negative integer' and 'e.g., a symbol limit of 4 means that the set of allowed symbols is {0, 1, 2, 3}'. For instance, if my symbols are {-20, -19, ..., 10}, I can firstly shift them to non-negative to obtain {0, 1, ..., 30}. Then according to the usage of the software, I need to write these integers into a file at first.

The problem is, if I write in (not necessarily in txt mode) content2 or content3 (please see the snippet below), the file size is almost double than a file with content1, since the symbols ' ' or ',' between integers are also compressed. But I cannot remove these delimiters as the integers are not only single digits from 0 to 9. So is there a proper way to write these integers into a file, without coding the delimiters between them?
I am new to entropy coding and any suggestion would be really appreciated.
Thanks!

import os, sys, time

print(os.getcwd())
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
FILE_DIR = os.path.join(BASE_DIR, 'my_file.txt')
OUT_DIR = os.path.join(BASE_DIR, 'out.txt')
REC_DIR = os.path.join(BASE_DIR, 'rec.txt')

python3 = sys.version_info.major >= 3

content1 = 1000 * '012345666666666666666666789999999999'
content2 = 1000 * '0 1 2 3 4 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 8 9 9 9 9 9 9 9 9 9 9'
content3 = 1000 * '0,1,2,3,4,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,7,8,9,9,9,9,9,9,9,9,9,9'

f = open(FILE_DIR, 'wt')
f.write(content1)
f.close()

t_start = time.time()
os.system('python arithmetic-compress.py {InputFile} {OutputFile}'.format(InputFile=FILE_DIR, OutputFile=OUT_DIR))
os.system('python arithmetic-decompress.py {InputFile} {OutputFile}'.format(InputFile=OUT_DIR, OutputFile=REC_DIR))
t_end = time.time() - t_start
rate_ori = os.path.getsize(FILE_DIR)  # file size in bytes
rate_out = os.path.getsize(OUT_DIR)
rate_rec = os.path.getsize(REC_DIR)
os.remove(OUT_DIR)  # remove a file
os.remove(REC_DIR)

parallel arithmetic coding

Thanks for sharing your implementation, it's really well-organized and easy to follow.
I am trying to use arithmetic coding for neural network-based text compression. It is very time-consuming when encoding large text corpus because it is essentially sequential. I did not find a parallel open-source implementation. So I am wondering if it is easy to be parallel. Could you please give me some hints?

compressing 2 streams into 1 BitOutputStream

Hello,
First of all Thank you very much for this great implementation. I am using your c++ arithmetic coder and I am trying to compress 2 different vectors of integers (with values between [0,255]) into 1 BitOutputStream, but unfortunately, I face with runtime_error("End of stream") when I am trying to decompress the second stream. Here is a sample code that I am using. Could you please tell me how I can change my code to solve the problem?

#include <stdlib.h>
#include <fstream>
#include <iostream>
#include <stdexcept>
#include "ArithmeticCoder.hpp"
#include "BitIoStream.hpp"
#include "FrequencyTable.hpp"
using namespace std;

static int compressPrecomputeFrequencyTable(vector<int>& in, BitOutputStream &bOut)
{

    SimpleFrequencyTable freqs(std::vector<uint32_t>(257, 0));
    freqs.increment(256);  // EOF symbol gets a frequency of 1
    for (unsigned int i=0; i < in.size(); i++) {
        int b = in[i];
        if (b < 0 || b > 255)
            throw std::logic_error("Assertion error");
        freqs.increment(static_cast<uint32_t>(b));
    }

    try {

        // Write frequency table
        for (uint32_t i = 0; i < 256; i++) {
            uint32_t freq = freqs.get(i);
            for (int j = 31; j >= 0; j--)
                bOut.write(static_cast<int>((freq >> j) & 1));  // Big endian
        }

        ArithmeticEncoder enc(32, bOut);
        for (unsigned int i=0; i < in.size(); i++) {
            // Read and encode one byte
            int symbol = in[i];
            if (symbol < 0 || symbol > 255)
                throw std::logic_error("Assertion error");
            enc.write(freqs, static_cast<uint32_t>(symbol));
        }

        enc.write(freqs, 256);  // EOF
        enc.finish(); // Flush remaining code bits

        return EXIT_SUCCESS;

    } catch (const char *msg) {
        std::cerr << msg << std::endl;
        return EXIT_FAILURE;
    }
}

static int decompressPrecomputeFrequencyTable(BitInputStream &bIn, vector<int> &out)
{
    // Perform file decompression

    try {

        // Read frequency table
        SimpleFrequencyTable freqs(std::vector<uint32_t>(257, 0));
        for (uint32_t i = 0; i < 256; i++) {
            uint32_t freq = 0;
            for (int j = 0; j < 32; j++)
                freq = (freq << 1) | bIn.readNoEof();  // Big endian
            freqs.set(i, freq);
        }
        freqs.increment(256);  // EOF symbol

        ArithmeticDecoder dec(32, bIn);
        while (true) {
            uint32_t symbol = dec.read(freqs);
            if (symbol == 256)  // EOF symbol
                break;
            int b = static_cast<int>(symbol);

            out.push_back(b);
        }
        return EXIT_SUCCESS;

    } catch (const char *msg) {
        std::cerr << msg << std::endl;
        return EXIT_FAILURE;
    }

}

int main()
{
    vector<int> encodingStream1 = {0, 0, 1, 1, 1, 2, 2, 254, 254, 255, 255, 10, 10};
    vector<int> encodingStream2 = {1,0,1};

    {
        int ok;
        std::ofstream out("enc.bin", std::ios::binary);
        BitOutputStream bOut(out);

        ok = compressPrecomputeFrequencyTable(encodingStream1, bOut);
        if(ok ==EXIT_FAILURE)
        {
            cerr << "can not compress stream 1" << endl;
            return 0;
        }
        ok = compressPrecomputeFrequencyTable(encodingStream2, bOut);
        if(ok ==EXIT_FAILURE)
        {
            cerr << "can not compress stream 2" << endl;
            return 0;
        }

        bOut.finish();


    }
    vector<int> decodedStream1;
    vector<int> decodedStream2;
    {

        int ok;
        std::ifstream in("enc.bin", std::ios::binary);
        BitInputStream bIn(in);

        ok =decompressPrecomputeFrequencyTable(bIn, decodedStream1);
        if(ok ==EXIT_FAILURE)
        {
            cerr << "can not decompress stream 1" << endl;
            return 0;
        }
        ok = decompressPrecomputeFrequencyTable(bIn, decodedStream2);
        if(ok ==EXIT_FAILURE)
        {
            cerr << "can not decompress stream 2" << endl;
            return 0;
        }

    }


    if(decodedStream1.size() != encodingStream1.size())
    {
        cerr << "Error in decoding. size do not matches in the first stream" << endl;
        return -1;
    }

    for(unsigned int i=0; i < decodedStream1.size(); i++)
    {
        if(decodedStream1[i] != encodingStream1[i])
        {
            cerr << "decoded symbol is not equal to encoded symbol for the first stream" << endl;
            return -1;
        }
    }

    if(decodedStream2.size() != encodingStream2.size())
    {
        cerr << "Error in decoding. size do not matches in the second stream" << endl;
        return -1;
    }

    for(unsigned int i=0; i < decodedStream2.size(); i++)
    {
        if(decodedStream2[i] != encodingStream2[i])
        {
            cerr << "decoded symbol is not equal to encoded symbol for the second stream" << endl;
            return -1;
        }
    }

    cout << "successfully decoded" << endl;

    return 0;
}

Proposal: to use std::Map for FrequencyTable

Hi @nayuki
I like your implementation of the arithmetic encoding and I think I can make it work with any UTF-8 characters if we add a Binary tree-based FrequencyTable implementation.

If you are accepting pull requests I can add the implementation.

Error in arithmetic decoder

Awesome resource here, thanks for putting this together! I think there's a small error in the decoder - specifically this line should be -1 instead of +1 (which would match the update function. Looks like the issue is present in the C++ code as well. I tested the decoding and changing this does fix the "code out of range" error I was getting.

PpmDecompress.decompress() freezes when decoding certain BitInputStream

Hi Nayuki,

I use your java PPM algorithm as a similarity measure between text strings to detect similar sentences.

When I compress "Hello World! " with PpmCompress.compress() and decompress the resulting OutputStream with PpmDecompress.decompress() again I get, as expected, "Hello World! ". When I do the same with the String "Hello World! Hello World! " then PpmDecompress.decompress() freezes and 1 core is 100% utilized. Maybe a thread is caught in a loop?

Here is my clojure code to recreate the bug:

(let [input "Hello World! Hello World! "
      baos-c (java.io.ByteArrayOutputStream.)
      charset "UTF-8"]
  (io.nayuki.PpmCompress/compress
    (-> input
        (.getBytes charset)
        (java.io.ByteArrayInputStream.))
    baos-c)
  (let [compressed (.toByteArray baos-c)
        _ (println (count compressed))
        intermediate (java.io.ByteArrayInputStream. compressed)
        baos-d (java.io.ByteArrayOutputStream.)]
    (println "before")
    (io.nayuki.PpmDecompress/decompress intermediate baos-d)
    (println "after")
    (String. (.toByteArray baos-d))))

The Bitout object has't close?

class ArithmeticEncoder(ArithmeticCoderBase):
    def finish(self):
        self.output.write(1)

the ArithmeticEncoder finish just write bit 1 ,but it has not close the Bitout object. It will decompress extra symbol.
We can fix the problem as

    def finish(self):
        self.output.write(1)
        self.output.close()

Error finishing encoder stream

(c++)I am using Encode in own stream and if I write 3 symbol, no byte is outputted.

void ArithmeticEncoder::finish() {
	output.write(1);
}

must also call BitOutputStream::finish()
in my test program:

#include <iostream>
#include <fstream>
#include "ArithmeticEncoder.h" //I changed *.hpp to my own files, but It is small change
#include "ArithmeticDecoder.h"
#include "SimpleFrequencyTable.h"

int main()
{
	std::filebuf fb;
	fb.open("test.txt", std::ios::out);
	std::ostream os(&fb);
	std::vector<uint32_t> freqvec;
	freqvec.push_back(10);
	freqvec.push_back(20);
	freqvec.push_back(30);
	SimpleFrequencyTable freqs(freqvec);
	BitOutputStream out(os);
	{
		ArithmeticEncoder encoder(9, out);
		encoder.write(freqs, 0);
		encoder.write(freqs, 1);
		encoder.write(freqs, 2);	
		encoder.finish();
	}
	fb.close();
	std::filebuf fb1;
	fb1.open("test.txt", std::ios::in);
	std::istream in(&fb1);
	BitInputStream inp(in);
	ArithmeticDecoder decoder(9, inp);
	for (int i=0; i<4; i++) //4 because I test symbol out of stream
	{
		uint32_t symbol = decoder.read(freqs);
		printf("symbol=%d\n", symbol);
	}
}

Problem with processing ByteBuffers - works with Huffman coder, but doesn't with Arithmetic coder

Hello.

I use Huffman coder by nayuki in my small application and it works flawlessly. Since Huffman and Arithmetic coders by nayuki both have the same interface, I tried replacing Huffman coder with Arithmetic and compare compression ratios and times.

However, it doesn't work. This should be just one life of code, right?

I created a simple example which is basically how my application works, but very simplified. (see below)

When I use Huffman coder, I get output "abc", when I use arithmetic coder, I get "abc" followed by random characters.

I can see the same behavior in my application when I try to process whole files like text or pictures.

I am getting really frustrated, I don't know how to proceed.

Does anybody know how to solve this? I have to take ByteBuffers as input, compress with AC/HC and output List< byte[]> and another way around with decompression. This is because in my application I chain multiple algorithms like MTF and RLE which takes different arguments.

import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.List;

public class DummyClass {
    public static void main(String[] args) throws IOException {
        List<byte[]> list = new ArrayList();

        ByteBuffer byteBuffer = ByteBuffer.allocate(3);
        byteBuffer.put((byte)'a');
        byteBuffer.put((byte)'b');
        byteBuffer.put((byte)'c');

        //COMPRESS

        OutputStream os = new ByteArrayOutputStream();
        InputStream in = new ByteArrayInputStream(byteBuffer.array());
        BitOutputStream out = new BitOutputStream(os);

        //======================================================================================
        //AdaptiveHuffmanCompress.compress(in,out); <---- THIS WORKS
        AdaptiveArithmeticCompress.compress(in, out);
        //======================================================================================

        list.add(((ByteArrayOutputStream) os).toByteArray());

        //DECOMPRESS

        BitInputStream in2 = new BitInputStream(new ByteArrayInputStream(list.get(0)));
        OutputStream out2 = new ByteArrayOutputStream();

        //======================================================================================
//        try{
//            AdaptiveHuffmanDecompress.decompress(in2, out2); <---- THIS WORKS
//        }
//        catch (EOFException ex){
//            //IDK why is this happening but it doesn't affect the funcionality
//        }
        AdaptiveArithmeticDecompress.decompress(in2, out2);
        //======================================================================================

        ByteBuffer output = ByteBuffer.wrap(((ByteArrayOutputStream) out2).toByteArray());

        for (int i = 0; i < output.capacity(); i++){
            System.out.println((char)output.get());
        }
    }
}

Mismatch decoding encoded stream random bytes

Steps:

my test c++ program returns that in some cases decoded symbols mismatch with random()

#include <iostream>
#include <fstream>
#include "ArithmeticEncoder.h"
#include "ArithmeticDecoder.h"
#include "FlatFrequencyTable.h"
#include "SimpleFrequencyTable.h"
#include "FenwickFrequencyTable.h"

int testforErrors(int seed)
{
	std::filebuf fb;
	fb.open("test.txt", std::ios::out);
	std::ostream os(&fb);
	BitOutputStream out(os);
	std::vector<uint32_t> freqvec;
	for (int i = 0; i < 10; i++)
		freqvec.push_back(i+1);
	//SimpleFrequencyTable freqs(freqvec);
	FlatFrequencyTable freqs(10);
	//FenwickFrequencyTable freqs(freqvec);//my ext: https://gist.github.com/borneq/8598b8693f7c72d0c1cc2452e1094cc1
	ArithmeticEncoder encoder(33, out);
	srand(seed);
	for (int i = 0; i < 100; i++)
	{
		encoder.write(freqs, rand()% 10);
	}	
	encoder.finish();

	fb.close();
	std::filebuf fb1;
	fb1.open("test.txt", std::ios::in);
	std::istream in(&fb1);
	BitInputStream inp(in);
	ArithmeticDecoder decoder(33, inp);
	srand(seed);
	for (int i = 0; i < 100; i++)
	{
		uint32_t symbol = decoder.read(freqs);
		int randsym = rand() % 10;
		if (symbol != randsym)
			return i;			
	}
	return -1;
}

int main()
{
	for (int i = 0; i < 100; i++)
	{
		int pos = testforErrors(i);
		if (pos>=0 && pos<20)
			printf("%d: %d\n", i, pos);
	}
}