In current implemention, the BZ3OmpCompressor will not perform a compress until it's internal buffer is long enough for all threads to perform a compress, which means it has to wait until num_threads blocks are received by calling compress method, otherwise it will just return b"". This is called lazy compress, which guaranteed maximum performance. It's okey for compressor, but for decompressor, things changed. The decompressor just have no idea about when does the stream end. If the decompressor still buffers the input, the caller might thought that the input is not enough for a block decompress and drop it. So it only assume that the input could end at any time. In this way, the decompressor can't buffer the input to perform multi-threaded decompress. It perform a decompress when it's buffer is long enough for one thread to decompress(although it would be better if more blocks are received in the same time), making it degenerate to a single thread decompressor. An effective measure to avoid this is to fill at least num_threads blocks, for BZIP3File usage, you can
from bz3 import compression
compression.BUFFER_SIZE = 300*10**6
to increase the buffer size, making more blocks receive at the same time.
The GitHub description has a typo, it says "python bingding for bzip3"; I imagine it's supposed to say "python binding for bzip3" or "python bindings for bzip3"
Why does it so slow compare to the original one, the compress speed is just about 4MB/s, much slower than the origin one. Maybe we should replace bytearray with uint8_t* and manage it by PyMem_Realloc