milesgranger / cramjam Goto Github PK

View Code? Open in Web Editor NEW

80.0 4.0 5.0 54.78 MB

Your go-to for easy access to a plethora of compression algorithms, all neatly bundled in one simple installation.

License: MIT License

Rust 77.98% Python 21.23% Makefile 0.79%

rust python decompression compression snappy brotli lz4 gzip zstd deflate

cramjam's Introduction

cramjam

API Documentation

Install

pip install --upgrade cramjam  # Requires no Python or system dependencies!

CLI

A CLI interface is available as cramjam-cli

libcramjam

A Rust crate and C friendly library available at libcramjam

Extremely thin and easy-to-install Python bindings to de/compression algorithms in Rust. Allows for using algorithms such as Snappy, without any system or other python dependencies.

Benchmarks

Some basic benchmarks are available in the benchmarks directory

Available algorithms:

Snappy cramjam.snappy
Brotli cramjam.brotli
Bzip2 cramjam.bzip2
Lz4 cramjam.lz4
Gzip cramjam.gzip
Deflate cramjam.deflate
ZSTD cramjam.zstd
XZ / LZMA cramjam.xz
Blosc2 cramjam.experimental.blosc2

All available for use as:

>>> import cramjam
>>> import numpy as np
>>> compressed = cramjam.snappy.compress(b"bytes here")
>>> decompressed = cramjam.snappy.decompress(compressed)
>>> decompressed
cramjam.Buffer(len=10)  # an object which implements the buffer protocol
>>> bytes(decompressed)
b"bytes here"
>>> np.frombuffer(decompressed, dtype=np.uint8)
array([ 98, 121, 116, 101, 115,  32, 104, 101, 114, 101], dtype=uint8)

Where the API is cramjam.<compression-variant>.compress/decompress and accepts bytes/bytearray/numpy.array/cramjam.File/cramjam.Buffer / memoryview objects.

de/compress_into Additionally, all variants support decompress_into and compress_into. Ex.

>>> import numpy as np
>>> from cramjam import snappy, Buffer
>>>
>>> data = np.frombuffer(b'some bytes here', dtype=np.uint8)
>>> data
array([115, 111, 109, 101,  32,  98, 121, 116, 101, 115,  32, 104, 101,
       114, 101], dtype=uint8)
>>>
>>> compressed = Buffer()
>>> snappy.compress_into(data, compressed)
33  # 33 bytes written to compressed buffer
>>>
>>> compressed.tell()  # Where is the buffer position?
33  # goodie!
>>>
>>> compressed.seek(0)  # Go back to the start of the buffer so we can prepare to decompress
>>> decompressed = b'0' * len(data)  # let's write to `bytes` as output
>>> decompressed
b'000000000000000'
>>>
>>> snappy.decompress_into(compressed, decompressed)
15  # 15 bytes written to decompressed
>>> decompressed
b'some bytes here'

cramjam's People

Contributors

Stargazers

Watchers

Forkers

martindurant messense ods musicinmybrain ol-teuto

cramjam's Issues

test_bench.py: Snappy tests fail

Something I've stumbled upon packaging this for nixpkgs NixOS/nixpkgs#124862.
Here is the build log https://gist.github.com/veprbl/345d882b01923f31fcfc75c1238d305b
Fastparquet's round trip tests seem to pass just fine. Perhaps they don't use this code path?

Please coordinate PyPI and crates.io releases if possible

I’m almost at the point of introducing a python-cramjam package to Fedora Linux. I’m currently updating my proposed package from 2.8.0 to 2.8.1.

I am able to use bundled or vendored dependencies when I need to under our packaging guidelines, but it’s preferred to build against system copies. I have already introduced a rust-libcramjam package, and I can build a python-cramjam-2.8.0 package using it (in lieu of the copy included in the GitHub archive or PyPI sdist).

With 2.8.1, though, the included libcramjam is updated to 0.2.0, which hasn’t yet been released on crates.io. I can temporarily switch to using the bundled libcramjam again, but it would make life easier if the version used by the Python extension were reliably available on crates.io. (Rust library crates that are published on crates.io are required to be packaged from crates.io sources, so I can’t just package a snapshot from GitHub as rust-libcrmjam.)

What do you think? Does this sound reasonable?

Add proper exceptions

Define one or more exceptions to catch the ugliness that happens during errors such as #5 example.

Ref: https://pyo3.rs/v0.10.1/exception.html

cargo_audit does not like brotli-sys

When trying to package pyrus-cramjam for openSUSE Tumbleweed, I get this:

info:obs-service-cargo_audit: Running OBS Source Service : obs-service-cargo_audit
ERROR:obs-service-cargo_audit:  possible vulnerabilties: 1
ERROR:obs-service-cargo_audit: /tmp/tmptxa26w30/pyrus-cramjam/Cargo.lock
ERROR:obs-service-cargo_audit: For more information you SHOULD inspect the output of cargo audit manually
ERROR:obs-service-cargo_audit: * RUSTSEC-2021-0131 -> crate: brotli-sys, cvss: None, class: ['memory-corruption']
ERROR:obs-service-cargo_audit: ⚠️  Vulnerabilities may have been found. You must review these.
Aborting: service call failed:  /usr/lib/obs/service/cargo_audit --srcdir pyrus-cramjam --outdir /home/ben/src/osc/home:bnavigator:branches:devel:languages:python:numeric/python-cramjam/tmpegree2c6.cargo_audit.service

obs-service-cargo_audit uses the local Cargo.lock file to determine if the related sources in a Rust application have known security vulnerabilities. If vulnerabilities are found, the source service will alert allowing you to update and help upstream update their sources.

https://github.com/milesgranger/pyrus-cramjam/blob/29d9e3b4e1e116761637b7a0f3ac8830f2f1541b/Cargo.lock#L23-L31

The cited security advisory is here: https://rustsec.org/advisories/RUSTSEC-2021-0131.html, bitemyapp/brotli2-rs#45

blosc?

I may have asked this before....

numcodecs from zarr may be interested in depending on cramjam to simplify its build process ( zarr-developers/numcodecs#464 ). The default and most common codecs used by zarr for new datasets is blosc v1. Blosc v2 also exists and is able to read v1. Is there any interest in adding blosc (v1 or v2) to cramjam? There seems to be quite a few crates in the area.

Python test test_variants_different_dtypes[brotli] sometimes times out

Environment:

Fedora Linux 39
x86_64 architecture
Python 3.12.1

I originally saw this while working on a python-cramjam package for Fedora Linux, but I’m able to reproduce it in a simple virtual environment.

To reproduce:

Check out current master, a1c0c02, and cd to the cramjam-python/ directory.

rm -rf _e && python3 -m build && python3 -m venv _e && . _e/bin/activate && pip install ./dist/cramjam-2.7.0-cp312-cp312-linux_x86_64.whl && pip install numpy pytest pytest-xdist hypothesis && python3 -m pytest -v -n 16 tests/ && deactivate

Sometimes, all tests pass:

================================================ 564 passed in 34.75s ================================================

…but if I run the command repeatedly, I often see this:

====================================================== FAILURES ======================================================
_______________________________________ test_variants_different_dtypes[brotli] _______________________________________
[gw1] linux -- Python 3.12.1 /home/ben/src/forks/cramjam/cramjam-python/_e/bin/python3

variant_str = 'brotli'

    @pytest.mark.parametrize("variant_str", VARIANTS)
>   @given(arr=st_np.arrays(st_np.scalar_dtypes(), shape=st.integers(0, int(1e5))))

tests/test_variants.py:35: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = ('brotli', array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j])), kwargs = {}
arg_drawtime = 0.0020151169737800956, initial_draws = 1, start = 463383.986474383, result = None
finish = 463385.225831358, internal_draw_time = 0, runtime = datetime.timedelta(seconds=1, microseconds=239357)
current_deadline = timedelta(milliseconds=1000)

    @proxies(self.test)
    def test(*args, **kwargs):
        arg_drawtime = sum(data.draw_times)
        initial_draws = len(data.draw_times)
        start = time.perf_counter()
        try:
            result = self.test(*args, **kwargs)
        finally:
            finish = time.perf_counter()
            internal_draw_time = sum(data.draw_times[initial_draws:])
            runtime = datetime.timedelta(
                seconds=finish - start - internal_draw_time
            )
            self._timing_features = {
                "time_running_test": finish - start - internal_draw_time,
                "time_drawing_args": arg_drawtime,
                "time_interactive_draws": internal_draw_time,
            }
    
        current_deadline = self.settings.deadline
        if not is_final:
            current_deadline = (current_deadline // 4) * 5
        if runtime >= current_deadline:
>           raise DeadlineExceeded(runtime, self.settings.deadline)
E           hypothesis.errors.DeadlineExceeded: Test took 1239.36ms, which exceeds the deadline of 1000.00ms
E           Falsifying example: test_variants_different_dtypes(
E               variant_str='brotli',
E               arr=array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]),
E           )

_e/lib64/python3.12/site-packages/hypothesis/core.py:845: DeadlineExceeded
----------------------------------------------------- Hypothesis -----------------------------------------------------
WARNING: Hypothesis has spent more than five minutes working to shrink a failing example, and stopped because it is making very slow progress.  When you re-run your tests, shrinking will resume and may take this long before aborting again.
PLEASE REPORT THIS if you can provide a reproducing example, so that we can improve shrinking performance for everyone.
============================================== short test summary info ===============================================
FAILED tests/test_variants.py::test_variants_different_dtypes[brotli] - hypothesis.errors.DeadlineExceeded: Test took 1239.36ms, which exceeds the deadline of 1000.00ms
===================================== 1 failed, 563 passed in 335.91s (0:05:35) ======================================

… or this:

====================================================== FAILURES ======================================================
_______________________________________ test_variants_different_dtypes[brotli] _______________________________________
[gw1] linux -- Python 3.12.1 /home/ben/src/forks/cramjam/cramjam-python/_e/bin/python3

variant_str = 'brotli'

    @pytest.mark.parametrize("variant_str", VARIANTS)
>   @given(arr=st_np.arrays(st_np.scalar_dtypes(), shape=st.integers(0, int(1e5))))

tests/test_variants.py:35: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = ('brotli', array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j])), kwargs = {}
arg_drawtime = 0.0022371799568645656, initial_draws = 1, start = 464013.190208792, result = None
finish = 464014.414762949, internal_draw_time = 0, runtime = datetime.timedelta(seconds=1, microseconds=224554)
current_deadline = timedelta(milliseconds=1000)

    @proxies(self.test)
    def test(*args, **kwargs):
        arg_drawtime = sum(data.draw_times)
        initial_draws = len(data.draw_times)
        start = time.perf_counter()
        try:
            result = self.test(*args, **kwargs)
        finally:
            finish = time.perf_counter()
            internal_draw_time = sum(data.draw_times[initial_draws:])
            runtime = datetime.timedelta(
                seconds=finish - start - internal_draw_time
            )
            self._timing_features = {
                "time_running_test": finish - start - internal_draw_time,
                "time_drawing_args": arg_drawtime,
                "time_interactive_draws": internal_draw_time,
            }
    
        current_deadline = self.settings.deadline
        if not is_final:
            current_deadline = (current_deadline // 4) * 5
        if runtime >= current_deadline:
>           raise DeadlineExceeded(runtime, self.settings.deadline)
E           hypothesis.errors.DeadlineExceeded: Test took 1224.55ms, which exceeds the deadline of 1000.00ms
E           Falsifying example: test_variants_different_dtypes(
E               variant_str='brotli',
E               arr=array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]),
E           )

_e/lib64/python3.12/site-packages/hypothesis/core.py:845: DeadlineExceeded
============================================== short test summary info ===============================================
FAILED tests/test_variants.py::test_variants_different_dtypes[brotli] - hypothesis.errors.DeadlineExceeded: Test took 1224.55ms, which exceeds the deadline of 1000.00ms
=========================================== 1 failed, 563 passed in 31.16s ===========================================

In my testing, it seems like increasing the deadline, e.g.

diff --git a/cramjam-python/tests/test_variants.py b/cramjam-python/tests/test_variants.py
index 4ee4ca3..97e287a 100644
--- a/cramjam-python/tests/test_variants.py
+++ b/cramjam-python/tests/test_variants.py
@@ -12,7 +12,7 @@ VARIANTS = ("snappy", "brotli", "bzip2", "lz4", "gzip", "deflate", "zstd")
 
 # Some OS can be slow or have higher variability in their runtimes on CI
 settings.register_profile(
-    "local", deadline=timedelta(milliseconds=1000), max_examples=100
+    "local", deadline=timedelta(milliseconds=10000), max_examples=100
 )
 settings.register_profile("CI", deadline=None, max_examples=25)
 if os.getenv("CI"):

is enough to resolve the problem. Note that I am testing on a fairly fast workstation (AMD Ryzen 9 5950X) ; I haven’t yet tried this on slower CI machines, particularly those of other architectures like ppc64le.

Update all benchmarks with `output_len` set

Already done for gzip in #22

Accept buffers with types other than u8

All other compression libraries don't have this caveat and are happy to ingest any PickleBuffer; could you fix it? (no rush)

Originally posted by @crusaderky in #99 (comment)

lzma / xz support?

Hi!

Is it planned or already in the works to have lzma support in cramjam?

A bunch of high energy particle physicists would be rather grateful for that capability in this package.

Support for memoryview and PickleBuffer

As of 2.7.0rc1, cramjam seems to be incompatible with memoryview and PickleBuffer objects.
This is a blocker to the adoption in dask/distributed.

>>> cramjam.bzip2.compress(memoryview(b"123"))
TypeError: argument 'data': failed to extract enum BytesType ('bytes | bytearray | File | Buffer | numpy')
- variant Bytes (bytes): TypeError: failed to extract field BytesType::Bytes.0, caused by TypeError: 'memoryview' object cannot be converted to 'PyBytes'
- variant ByteArray (bytearray): TypeError: failed to extract field BytesType::ByteArray.0, caused by TypeError: 'memoryview' object cannot be converted to 'PyByteArray'
- variant RustyFile (File): TypeError: failed to extract field BytesType::RustyFile.0, caused by TypeError: 'memoryview' object cannot be converted to 'File'
- variant RustyBuffer (Buffer): TypeError: failed to extract field BytesType::RustyBuffer.0, caused by TypeError: 'memoryview' object cannot be converted to 'Buffer'
- variant NumpyArray (numpy): TypeError: failed to extract field BytesType::NumpyArray.0, caused by TypeError: 'memoryview' object cannot be converted to 'PyArray<T, D>'

>>> import pickle, numpy
>>> a = numpy.ones(10)
>>> buffers = []
>>> pickle.dumps(a, buffer_callback=buffers.append)
>>> buffers
[<pickle.PickleBuffer at 0x7ff1f5b28540>]
>>> cramjam.bzip2.compress(buffers[0])
TypeError: argument 'data': failed to extract enum BytesType ('bytes | bytearray | File | Buffer | numpy')
- variant Bytes (bytes): TypeError: failed to extract field BytesType::Bytes.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'PyBytes'
- variant ByteArray (bytearray): TypeError: failed to extract field BytesType::ByteArray.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'PyByteArray'
- variant RustyFile (File): TypeError: failed to extract field BytesType::RustyFile.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'File'
- variant RustyBuffer (Buffer): TypeError: failed to extract field BytesType::RustyBuffer.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'Buffer'
- variant NumpyArray (numpy): TypeError: failed to extract field BytesType::NumpyArray.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'PyArray<T, D>'

Handle numpy import error if numpy not installed.

Problem

It was my expectation to optionally support numpy input/output but would also work fine if the user only wanted to make use of bytes/bytearray/Buffer/File objs. This however does not appear to be the case

Example with an environment without numpy installed

>>> import cramjam
>>> compressed = cramjam.snappy.compress(b'bytes')
>>> out = cramjam.snappy.decompress(compressed)
thread '<unnamed>' panicked at 'Failed to import numpy module', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/numpy-0.13.1/src/npyffi/mod.rs:16:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
pyo3_runtime.PanicException: Failed to import numpy module
>>>

Solution?

Not sure, but hoping there is a way to not force a numpy install if the user doesn't want it.

Remove extra zstd-safe dep

cramjam/libcramjam/Cargo.toml

Line 22 in 91a329f

 zstd-safe = "7.0.0" # NOTE: This is the same dep version as zstd, as they don't re-export 

They do re-export, not sure how I got to that conclusion.

https://github.com/gyscos/zstd-rs/blob/ed106b224b2d386a95dd54c7feb7b558ea2745d0/src/lib.rs#L25

Compatibility tests with variants

We're asserting the roundtrip data returns as original, and only done during manual testing that output from de/compression is compatible with original variants, but these should be their own tests.

import gzip
import cramjam

data = b"some bytes"
assert gzip.decompress(cramjam.gzip.compress(data)) == data
assert cramjam.gzip.decompress(gzip.compress(data)) == data

zlib-ng?

According to the flate2 crate's README, you can opt to compile with zlib-ng for better performance. Is it a good idea?

The length argument cannot be kwarg

Gzip decompression.

Python Example
--------------
```python
>>> cramjam.gzip.decompress(compressed_bytes, output_len=Optional[int])

it turns out that cramjam.gzip.decompress(compressed_bytes) is OK and cramjam.gzip.decompress(compressed_bytes, length) is OK, but you cannot explicitly give output_length= as a keyword argument. This is weird behaviour on pyo3's part, but the doc should show ways that actually work.

Python: tests/test_variants.py::test_variants_different_dtypes randomly fail with `hypothesis.errors.FailedHealthCheck: Examples routinely exceeded the max allowable size`

When running the Python test suite, I'm frequently getting health check failures on different tests. For example:

$ python -m pytest tests/test_variants.py::test_variants_different_dtypes
========================================================= test session starts =========================================================
platform linux -- Python 3.11.8, pytest-8.0.2, pluggy-1.4.0
rootdir: /tmp/cramjam/cramjam-python
plugins: hypothesis-6.98.13, xdist-3.5.0
collected 8 items                                                                                                                     

tests/test_variants.py FF......                                                                                                 [100%]

============================================================== FAILURES ===============================================================
_______________________________________________ test_variants_different_dtypes[snappy] ________________________________________________

variant_str = 'snappy'

    @pytest.mark.parametrize("variant_str", VARIANTS)
>   @given(arr=st_np.arrays(st_np.scalar_dtypes(), shape=st.integers(0, int(1e4))))
E   hypothesis.errors.FailedHealthCheck: Examples routinely exceeded the max allowable size. (20 examples overran while generating 8 valid ones). Generating examples this large will usually lead to bad results. You could try setting max_size parameters on your collections and turning max_leaves down on recursive() calls.
E   See https://hypothesis.readthedocs.io/en/latest/healthchecks.html for more information about this. If you want to disable just this health check, add HealthCheck.data_too_large to the suppress_health_check settings for this test.

tests/test_variants.py:42: FailedHealthCheck
------------------------------------------------------------- Hypothesis --------------------------------------------------------------
You can add @seed(297719150791330741877614251129208577971) to this test or run pytest with --hypothesis-seed=297719150791330741877614251129208577971 to reproduce this failure.
_______________________________________________ test_variants_different_dtypes[brotli] ________________________________________________

variant_str = 'brotli'

    @pytest.mark.parametrize("variant_str", VARIANTS)
>   @given(arr=st_np.arrays(st_np.scalar_dtypes(), shape=st.integers(0, int(1e4))))
E   hypothesis.errors.FailedHealthCheck: Examples routinely exceeded the max allowable size. (20 examples overran while generating 9 valid ones). Generating examples this large will usually lead to bad results. You could try setting max_size parameters on your collections and turning max_leaves down on recursive() calls.
E   See https://hypothesis.readthedocs.io/en/latest/healthchecks.html for more information about this. If you want to disable just this health check, add HealthCheck.data_too_large to the suppress_health_check settings for this test.

tests/test_variants.py:42: FailedHealthCheck
------------------------------------------------------------- Hypothesis --------------------------------------------------------------
You can add @seed(151523981063034797703438801667290859669) to this test or run pytest with --hypothesis-seed=151523981063034797703438801667290859669 to reproduce this failure.
======================================================= short test summary info =======================================================
FAILED tests/test_variants.py::test_variants_different_dtypes[snappy] - hypothesis.errors.FailedHealthCheck: Examples routinely exceeded the max allowable size. (20 examples overran while generating 8 v...
FAILED tests/test_variants.py::test_variants_different_dtypes[brotli] - hypothesis.errors.FailedHealthCheck: Examples routinely exceeded the max allowable size. (20 examples overran while generating 9 v...
=============================================== 2 failed, 6 passed in 153.35s (0:02:33) ===============================================

I can reproduce with 496c1ab.

My reproducer:

pip install . pytest pytest-xdist hypothesis numpy
python -m pytest tests/test_variants.py::test_variants_different_dtypes

pyo3_runtime.PanicException: Failed to import NumPy module

When I call snappy.compress with an invalid argument cramjam displays this cryptic exception mentioning NumPy:

>>> from cramjam import snappy
>>> d = snappy.compress(b'b'*1024) # bytes works fine
>>> d
cramjam.Buffer(len=70)
>>> d = snappy.compress(1) # Passing invalid argument produces cryptic error:
thread '<unnamed>' panicked at 'Failed to import NumPy module', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/numpy-0.17.2/src/npyffi/mod.rs:22:9
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
pyo3_runtime.PanicException: Failed to import NumPy module

Env info:

$  python -m pip list
Package    Version
---------- -------
cramjam    2.6.2
pip        22.3
setuptools 65.5.0
$ python
Python 3.11.0 (main, Apr  4 2023, 20:04:59) [GCC 8.4.1 20200928 (Red Hat 8.4.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.

Please consider adding license metadata to PyPI

Hi there,

I'm looking at using fastparquet which seems to have a dependency on cramjam. I'm also using liccheck to check any python packages we use are compatible with our license needs. It looks like this package is licensed under MIT, however because this information is not included in the PyPI package, liccheck is flagging cramjam as "unknown license" (see the META section of cramjam compared with e.g. fastparquet).

I'm not familiar with maturin (which I think is used to build this package), but possibly you need to add something like https://www.maturin.rs/metadata.html#python-project-metadata or https://stackoverflow.com/a/73274312/5179470 ?

Many thanks for any help, and for this great package! :)

use with cargo?

How would I import this as a crate in other python-facing rust package? I would like to use RustyBuffer as a zero-copy way of passing read()-able byte chunks to python within rfsspec. Later, I would also use the (stream) de/compressors. Just naively adding cramjam into my Cargo.toml causes a big long list of compiler errors related to linker symbols.

cramjam 2.8.1 release on conda?

Presently only 2.8.0 is present on conda, when you have time, could you please update it?

This is blocking conda-forge/uproot-feedstock#144 since the dependency doesn't yet exist.

blosc2 experimental feedback/tracking issue

blosc2 was first added in v2.8.4-rc1 in the experimental module cramjam.experimental.blosc2

This is the tracking issue to either move it out of experimental or remove it.

#160

Make some proper python specific docs?

Right now we just use doc.rs, but that's pretty rust specific, maybe using pdoc or something?

Universal2 Python 3.10 is skipped on OSX

Looks like https://github.com/milesgranger/pyrus-cramjam/blob/4b1c21a34195198d24ffdd9730815f4e5cb0d240/.github/workflows/CI.yml#L41 is evaluated false for '3.10'; while the wheels are built for 3.9.

Ref: last release, showing 3.10 being skipped.
https://github.com/milesgranger/pyrus-cramjam/runs/4150238631?check_suite_focus=true

conda-forge/cramjam-feedstock#20 (comment)

Will try to fix later, as time permits unless anyone beats me to it.

Incorrect docstring, and Compressor

cramjam.lz4.Compressor's docstring is "Snappy Compressor object for streaming compression".

I found this because I wanted to know whether the stream compressor is for the simple or block variant - both should be available, no? The output appears closer to the block version, but not identical.

Same comment for snappy, where I find Compressor is for the framed format, no raw variant.

Panic on invalid input is bad

In [1]: import cramjam

In [2]: cramjam.snappy_decompress(b'abc')
Out[2]: b''

In [3]: cramjam.snappy_decompress(b'abcdefgh')
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: StreamHeader { byte: 97 } }', src/snappy.rs:21:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
[1]    74251 abort      py -m IPython

I would expect it to raise a Python exception instead of aborting the process.

Add output_len and _into to snappy raw

Snappy framed compress allows output_len and has the _into variants, but snappy_raw does not. Would be nice!

btw: am I right in thinking

snappy.compress->cramjam.snappy.compress_raw
snappy.StreamCompressor->cramjam.snappy.compress ?
(snappy.stream_compress uses StreamCompressor, but works on file-likes)

Support PyPy output of `bytes` and `memoryview` for de/compress_into functions

When running the test suite using PyPy3.10 7.3.15 release, I'm getting lots of test failures. For example:

______________________________ test_obj_api[File] ______________________________
[gw0] linux -- Python 3.10.13 /tmp/cramjam/cramjam-python/.venv/bin/python

tmpdir = local('/tmp/pytest-of-mgorny/pytest-3/popen-gw0/test_obj_api_File_0')
Obj = <class 'File'>

    @pytest.mark.parametrize("Obj", (File, Buffer))
    def test_obj_api(tmpdir, Obj):
        if isinstance(Obj, File):
            buf = File(str(tmpdir.join("file.txt")))
        else:
            buf = Buffer()
    
        assert buf.write(b"bytes") == 5
        assert buf.tell() == 5
        assert buf.seek(0) == 0
        assert buf.read() == b"bytes"
        assert buf.seek(-1, 2) == 4  # set one byte backwards from end; position 4
        assert buf.read() == b"s"
        assert buf.seek(-2, whence=1) == 3  # set two bytes from current (end): position 3
        assert buf.read() == b"es"
    
        with pytest.raises(ValueError):
            buf.seek(1, 3)  # only 0, 1, 2 are valid seek from positions
    
        for out in (
            b"12345",
            bytearray(b"12345"),
            File(str(tmpdir.join("test.txt"))),
            Buffer(),
        ):
            buf.seek(0)
    
            expected = b"bytes"
    
            buf.readinto(out)
    
            # Will update the output buffer
            if isinstance(out, (File, Buffer)):
                out.seek(0)
                assert out.read() == expected
            elif isinstance(out, bytearray):
                assert out == bytearray(expected)
            else:
>               assert out == expected
E               AssertionError: assert b'12345' == b'bytes'
E                 
E                 At index 0 diff: b'1' != b'b'
E                 Use -v to get more diff

tests/test_rust_io.py:44: AssertionError
____________________ test_variants_different_dtypes[snappy] ____________________
[gw0] linux -- Python 3.10.13 /tmp/cramjam/cramjam-python/.venv/bin/python

variant_str = 'snappy'

    @pytest.mark.parametrize("variant_str", VARIANTS)
>   @given(arr=st_np.arrays(st_np.scalar_dtypes(), shape=st.integers(0, int(1e4))))

tests/test_variants.py:42: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

variant_str = 'snappy', arr = array([], shape=(2, 0), dtype=bool)

    @pytest.mark.parametrize("variant_str", VARIANTS)
    @given(arr=st_np.arrays(st_np.scalar_dtypes(), shape=st.integers(0, int(1e4))))
    def test_variants_different_dtypes(variant_str, arr):
        variant = getattr(cramjam, variant_str)
        compressed = variant.compress(arr)
        decompressed = variant.decompress(compressed)
        assert same_same(bytes(decompressed), arr.tobytes())
    
        # And compress n dims > 1
        if arr.shape[0] % 2 == 0:
            arr = arr.reshape((2, -1))
>           compressed = variant.compress(arr)
E           TypeError: argument 'data': failed to extract enum BytesType ('Buffer | File | pybuffer')
E           - variant RustyBuffer (Buffer): TypeError: failed to extract field BytesType::RustyBuffer.0, caused by TypeError: 'ndarray' object cannot be converted to 'Buffer'
E           - variant RustyFile (File): TypeError: failed to extract field BytesType::RustyFile.0, caused by TypeError: 'ndarray' object cannot be converted to 'File'
E           - variant PyBuffer (pybuffer): TypeError: failed to extract field BytesType::PyBuffer.0, caused by BufferError: Buffer is not C contiguous
E           Falsifying example: test_variants_different_dtypes(
E               variant_str='snappy',
E               arr=array([], dtype=bool),
E           )

tests/test_variants.py:52: TypeError

They all look quite serious. This is with 2b90ebb.

To reproduce, using pypy3.10 venv:

pip install . pytest pytest-xdist hypothesis numpy
python -m pytest -n auto tests

Full test log (1.3M): test.txt

Any plans to release pypy wheels for windows?

Looks like there are pypy wheels for linux and macos, but not windows. Are there plans to release pypy wheels for windows?

failure of compress_into

data = b"oh what a beautiful morning, oh what a beautiful day!!" * 5000000   # 270000000 bytes
x = np.zeros(270000000, dtype='uint8') # plenty of space
size = cramjam.gzip.compress_into(data, x)
cramjam.gzip.decompress(x.tobytes()[:size])

gives DecompressionError: corrupt deflate stream. The value of size, 15787, is much smaller than the size you get with the compression function(s), 785746.

Support `de/compress_into(ptr, len)`

de/compress directly into a Python buffer

Add `version` module attribute

This helps with downstream code.

Cramjam wrong architecture error when installing on M1 with conda (mambaforge)

I'm working on Apple M1 with a mambaforge deployment of conda.
When trying to save a dataframe to parquet using fastparquet I'm getting the following error:

ImportError: dlopen(/Users/rpelgrim/mambaforge/envs/cramjam/lib/python3.9/site-packages/cramjam.cpython-39-darwin.so, 2): no suitable image found.  Did find:
	/Users/rpelgrim/mambaforge/envs/cramjam/lib/python3.9/site-packages/cramjam.cpython-39-darwin.so: mach-o, but wrong architecture
	/Users/rpelgrim/mambaforge/envs/cramjam/lib/python3.9/site-packages/cramjam.cpython-39-darwin.so: mach-o, but wrong architecture

Seems like conda is giving me the wrong version of cramjam?

MRE:

conda create -n cramjam
conda install cramjam ipython
ipython
import cramjam

ImportError: dlopen(/Users/rpelgrim/mambaforge/envs/cramjam/lib/python3.9/site-packages/cramjam.cpython-39-darwin.so, 2): no suitable image found.  Did find:
	/Users/rpelgrim/mambaforge/envs/cramjam/lib/python3.9/site-packages/cramjam.cpython-39-darwin.so: mach-o, but wrong architecture
	/Users/rpelgrim/mambaforge/envs/cramjam/lib/python3.9/site-packages/cramjam.cpython-39-darwin.so: mach-o, but wrong architecture

Equality check on values for Buffer

This should work: intake/python-snappy#130 (comment)

Macro or other simplification for code

#22 allows support for both PyBytes and PyByteArray; between the variants there is a lot of repeated/very similar code. Let's fix that

Unused `Cargo.lock` files?

I've noticed that in addition to the top-level Cargo.lock file, there are Cargo.lock files in individual directories. However, from what I can see cargo only uses the file from the workspace root, i.e. the top-level file, and the other files are unused. Am I missing something, or can they be removed?

Add LZO

A rarer compression algorithm for parquet is LZO. This library seems to do it: https://badboy.github.io/minilzo-rs/minilzo/index.html

Provide source distribution on PyPI

Release CI is misconfigured

https://pypi.org/project/cramjam/#files

It should only contains 3 abi3(macOs, Windows, Linux) wheels and a dist file, for example: https://pypi.org/project/rjieba/#files

Publish Python 3.12 wheel?

I'm hoping to be able to use this library rather than python-snappy as this seems easier to install (whereas python-snappy needs system installs). But currently when trying to install on Python 3.12 it tries to install the tar.gz because no wheel exists. Would it be possible to create a Python 3.12 wheel on the next release?

Thanks!

release for conda

This seems like a very useful package, if it is small, fast and compliant!
I'm sure users would appreciate installing pre-compiled versions from conda.
https://conda-forge.org/docs/maintainer/adding_pkgs.html

Add arm64/universal wheels for Python 3.11 under MacOS

Trying to install

cramjam 2.6.2
with pip 22.3.1
on MacOS 13.0.1 (Ventura)
In a Python 3.11.0 environment

It finds no wheels, attempts to build from source, and fails because I don't have the Rust toolchain installed:

pip install cramjam
Collecting cramjam
  Using cached cramjam-2.6.2.tar.gz (1.1 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]

      Cargo, the Rust package manager, is not installed or is not on PATH.
      This package requires Rust and Cargo to compile extensions. Install it through
      the system's package manager or via https://rustup.rs/

      Checking for Rust toolchain....
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Looking at the list of files on PyPI, it looks like there are MacOS universal binaries for Python 3.8, 3.9, and 3.10 but not for 3.11.

Maintenance: Delete codebuild logs

They aren't accounted for in the regular log cleanup routine in the maintenance stack.

bytearray decompress failure in snappy

In [45]: data = b"oh what a beautiful morning, oh what a beautiful day!!" * 5000000

In [47]: out = cramjam.snappy.compress(data)

In [48]: bout = bytearray(out)

In [49]: assert cramjam.snappy.decompress(out) == data

In [50]: assert cramjam.snappy.decompress(bout) == data
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-50-a424b56cb9e7> in <module>
----> 1 assert cramjam.snappy.decompress(bout) == data

AssertionError:

The decompress with bytearray returns the first 895 bytes of what it should do.

Feature request: add xxhash for use with LZ4

The C implementation of LZ4 includes xxhash.h, and I guess that's why LZ4-compressed buffers sometimes use xxhash as a checksum on the contents. In Python, we get this through two libraries: it used to be lz4 and xxhash, but now the first library is cramjam.

It would be great if cramjam could include xxhash and we'd use just one library, especially since cramjam works in Pyodide and the xxhash library doesn't.

proc-macro2 1.0.56 doesn't work with versions of rust since July

See this discussion here:
https://stackoverflow.com/questions/77347182/how-to-tell-why-an-unknown-feature-feature-is-needed

I think it would be convenient to update the lock version to 1.0.60.

TypeError: 'Buffer' does not support the buffer interface

When using pypy 3.8 I get the error mentioned in the title. This doesn't seem to happen on pypy 3.9. Do you know what might be going on?

To reproduce, save the following contents as compress.py:

from io import BytesIO
from cramjam import snappy

compressed = snappy.compress_raw(b"123")
BytesIO(compressed)

And then you can use the following Dockerfile:

FROM pypy:3.8-bullseye

RUN pip install cramjam
COPY compress.py /

RUN pypy compress.py

Note: Change the dockerfile to use pypy:3.9-bullseye to see that it works in 3.9

collaborate on ak.str

@milesgranger - this seems like the best way for me to get in touch, and I don't mind if this is public.

I have been involved in the awkard-array project, which brings numpy like and vectorised processing of variable-length and nested data schemas, i.e., deep parquet or array-like things. This includes numba compiled functions and GPU ops.

The library was designed for high-energy physics, i.e., numerical work. However, we are building out dask-awkward and want to promote it for a much wider audience, since there's nothing else in the python realm that does this kind of work. One major missing piece is (utf8) string handling - like all the python str methods or pandas' .str accessor methods. UTF8 in C/C++ exists, but is non-standard, but it is native in Rust. ...so I am thinking that an external library could exist for string operations on awkward arrays. These arrays are just uint8 numpy arrays/buffers and int32/64 offsets. The point would be to pass buffers around without copy and just rely on rust for the string ops. I don't really know if this is a wise idea!

I am writing here to see whether you might be interested in applying your python-rust buffer passing knowhow to the problem.

macos wheels aren't built for the macos versions they advertise

The macos wheels on pypi indicate that they are for macos version >=10.7 and they get installed as such but they are actually built for macos 11.0.

Importing the module on macos 10.13.6 results in

ImportError: dlopen(<snip>/.venv/lib/python3.9/site-packages/cramjam/cramjam.cpython-39-darwin.so, 2): Symbol not found: ____chkstk_darwin
  Referenced from: <snip>/.venv/lib/python3.9/site-packages/cramjam/cramjam.cpython-39-darwin.so (which was built for Mac OS X 11.0)
  Expected in: /usr/lib/libSystem.B.dylib

Compute output length for gzip encoded input

Seems that the length of the output is the last 4 bytes of the encoded input. as u32

Find blocks?

If you were to seek to some arbitrary location in compressed data, and attempt to start decompressing, you would fail. However, all of the algorithms have some level of block-wise operation.

Is it possible with the dependant libraries here, to be able to find in the original compressed version of some data, byte offsets at which decompression can start? I could imagine doing this in a brute force fashion: try at each byte offset and if some decompression does happen, see if the output is contained in the decompressed output of the whole.