milesgranger / cramjam Goto Github PK
View Code? Open in Web Editor NEWYour go-to for easy access to a plethora of compression algorithms, all neatly bundled in one simple installation.
License: MIT License
Your go-to for easy access to a plethora of compression algorithms, all neatly bundled in one simple installation.
License: MIT License
I may have asked this before....
numcodecs from zarr may be interested in depending on cramjam to simplify its build process ( zarr-developers/numcodecs#464 ). The default and most common codecs used by zarr for new datasets is blosc v1. Blosc v2 also exists and is able to read v1. Is there any interest in adding blosc (v1 or v2) to cramjam? There seems to be quite a few crates in the area.
Presently only 2.8.0 is present on conda, when you have time, could you please update it?
This is blocking conda-forge/uproot-feedstock#144 since the dependency doesn't yet exist.
Looks like there are pypy wheels for linux and macos, but not windows. Are there plans to release pypy wheels for windows?
It was my expectation to optionally support numpy input/output but would also work fine if the user only wanted to make use of bytes
/bytearray
/Buffer
/File
objs. This however does not appear to be the case
Example with an environment without numpy installed
>>> import cramjam
>>> compressed = cramjam.snappy.compress(b'bytes')
>>> out = cramjam.snappy.decompress(compressed)
thread '<unnamed>' panicked at 'Failed to import numpy module', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/numpy-0.13.1/src/npyffi/mod.rs:16:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
pyo3_runtime.PanicException: Failed to import numpy module
>>>
Not sure, but hoping there is a way to not force a numpy install if the user doesn't want it.
In
Gzip decompression.
Python Example
--------------
```python
>>> cramjam.gzip.decompress(compressed_bytes, output_len=Optional[int])
it turns out that cramjam.gzip.decompress(compressed_bytes)
is OK and cramjam.gzip.decompress(compressed_bytes, length)
is OK, but you cannot explicitly give output_length=
as a keyword argument. This is weird behaviour on pyo3's part, but the doc should show ways that actually work.
Right now we just use doc.rs, but that's pretty rust specific, maybe using pdoc or something?
They aren't accounted for in the regular log cleanup routine in the maintenance stack.
blosc2 was first added in v2.8.4-rc1 in the experimental module cramjam.experimental.blosc2
This is the tracking issue to either move it out of experimental or remove it.
A rarer compression algorithm for parquet is LZO. This library seems to do it: https://badboy.github.io/minilzo-rs/minilzo/index.html
As of 2.7.0rc1, cramjam seems to be incompatible with memoryview and PickleBuffer objects.
This is a blocker to the adoption in dask/distributed.
>>> cramjam.bzip2.compress(memoryview(b"123"))
TypeError: argument 'data': failed to extract enum BytesType ('bytes | bytearray | File | Buffer | numpy')
- variant Bytes (bytes): TypeError: failed to extract field BytesType::Bytes.0, caused by TypeError: 'memoryview' object cannot be converted to 'PyBytes'
- variant ByteArray (bytearray): TypeError: failed to extract field BytesType::ByteArray.0, caused by TypeError: 'memoryview' object cannot be converted to 'PyByteArray'
- variant RustyFile (File): TypeError: failed to extract field BytesType::RustyFile.0, caused by TypeError: 'memoryview' object cannot be converted to 'File'
- variant RustyBuffer (Buffer): TypeError: failed to extract field BytesType::RustyBuffer.0, caused by TypeError: 'memoryview' object cannot be converted to 'Buffer'
- variant NumpyArray (numpy): TypeError: failed to extract field BytesType::NumpyArray.0, caused by TypeError: 'memoryview' object cannot be converted to 'PyArray<T, D>'
>>> import pickle, numpy
>>> a = numpy.ones(10)
>>> buffers = []
>>> pickle.dumps(a, buffer_callback=buffers.append)
>>> buffers
[<pickle.PickleBuffer at 0x7ff1f5b28540>]
>>> cramjam.bzip2.compress(buffers[0])
TypeError: argument 'data': failed to extract enum BytesType ('bytes | bytearray | File | Buffer | numpy')
- variant Bytes (bytes): TypeError: failed to extract field BytesType::Bytes.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'PyBytes'
- variant ByteArray (bytearray): TypeError: failed to extract field BytesType::ByteArray.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'PyByteArray'
- variant RustyFile (File): TypeError: failed to extract field BytesType::RustyFile.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'File'
- variant RustyBuffer (Buffer): TypeError: failed to extract field BytesType::RustyBuffer.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'Buffer'
- variant NumpyArray (numpy): TypeError: failed to extract field BytesType::NumpyArray.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'PyArray<T, D>'
Line 22 in 91a329f
They do re-export, not sure how I got to that conclusion.
https://github.com/gyscos/zstd-rs/blob/ed106b224b2d386a95dd54c7feb7b558ea2745d0/src/lib.rs#L25
I'm hoping to be able to use this library rather than python-snappy
as this seems easier to install (whereas python-snappy
needs system installs). But currently when trying to install on Python 3.12 it tries to install the tar.gz because no wheel exists. Would it be possible to create a Python 3.12 wheel on the next release?
Thanks!
This seems like a very useful package, if it is small, fast and compliant!
I'm sure users would appreciate installing pre-compiled versions from conda.
https://conda-forge.org/docs/maintainer/adding_pkgs.html
When running the Python test suite, I'm frequently getting health check failures on different tests. For example:
$ python -m pytest tests/test_variants.py::test_variants_different_dtypes
========================================================= test session starts =========================================================
platform linux -- Python 3.11.8, pytest-8.0.2, pluggy-1.4.0
rootdir: /tmp/cramjam/cramjam-python
plugins: hypothesis-6.98.13, xdist-3.5.0
collected 8 items
tests/test_variants.py FF...... [100%]
============================================================== FAILURES ===============================================================
_______________________________________________ test_variants_different_dtypes[snappy] ________________________________________________
variant_str = 'snappy'
@pytest.mark.parametrize("variant_str", VARIANTS)
> @given(arr=st_np.arrays(st_np.scalar_dtypes(), shape=st.integers(0, int(1e4))))
E hypothesis.errors.FailedHealthCheck: Examples routinely exceeded the max allowable size. (20 examples overran while generating 8 valid ones). Generating examples this large will usually lead to bad results. You could try setting max_size parameters on your collections and turning max_leaves down on recursive() calls.
E See https://hypothesis.readthedocs.io/en/latest/healthchecks.html for more information about this. If you want to disable just this health check, add HealthCheck.data_too_large to the suppress_health_check settings for this test.
tests/test_variants.py:42: FailedHealthCheck
------------------------------------------------------------- Hypothesis --------------------------------------------------------------
You can add @seed(297719150791330741877614251129208577971) to this test or run pytest with --hypothesis-seed=297719150791330741877614251129208577971 to reproduce this failure.
_______________________________________________ test_variants_different_dtypes[brotli] ________________________________________________
variant_str = 'brotli'
@pytest.mark.parametrize("variant_str", VARIANTS)
> @given(arr=st_np.arrays(st_np.scalar_dtypes(), shape=st.integers(0, int(1e4))))
E hypothesis.errors.FailedHealthCheck: Examples routinely exceeded the max allowable size. (20 examples overran while generating 9 valid ones). Generating examples this large will usually lead to bad results. You could try setting max_size parameters on your collections and turning max_leaves down on recursive() calls.
E See https://hypothesis.readthedocs.io/en/latest/healthchecks.html for more information about this. If you want to disable just this health check, add HealthCheck.data_too_large to the suppress_health_check settings for this test.
tests/test_variants.py:42: FailedHealthCheck
------------------------------------------------------------- Hypothesis --------------------------------------------------------------
You can add @seed(151523981063034797703438801667290859669) to this test or run pytest with --hypothesis-seed=151523981063034797703438801667290859669 to reproduce this failure.
======================================================= short test summary info =======================================================
FAILED tests/test_variants.py::test_variants_different_dtypes[snappy] - hypothesis.errors.FailedHealthCheck: Examples routinely exceeded the max allowable size. (20 examples overran while generating 8 v...
FAILED tests/test_variants.py::test_variants_different_dtypes[brotli] - hypothesis.errors.FailedHealthCheck: Examples routinely exceeded the max allowable size. (20 examples overran while generating 9 v...
=============================================== 2 failed, 6 passed in 153.35s (0:02:33) ===============================================
I can reproduce with 496c1ab.
My reproducer:
pip install . pytest pytest-xdist hypothesis numpy
python -m pytest tests/test_variants.py::test_variants_different_dtypes
Already done for gzip
in #22
#22 allows support for both PyBytes
and PyByteArray
; between the variants there is a lot of repeated/very similar code. Let's fix that
This should work: intake/python-snappy#130 (comment)
When using pypy 3.8 I get the error mentioned in the title. This doesn't seem to happen on pypy 3.9. Do you know what might be going on?
To reproduce, save the following contents as compress.py
:
from io import BytesIO
from cramjam import snappy
compressed = snappy.compress_raw(b"123")
BytesIO(compressed)
And then you can use the following Dockerfile
:
FROM pypy:3.8-bullseye
RUN pip install cramjam
COPY compress.py /
RUN pypy compress.py
Note: Change the dockerfile to use pypy:3.9-bullseye
to see that it works in 3.9
cramjam.lz4.Compressor
's docstring is "Snappy Compressor object for streaming compression".
I found this because I wanted to know whether the stream compressor is for the simple or block variant - both should be available, no? The output appears closer to the block version, but not identical.
Same comment for snappy, where I find Compressor is for the framed format, no raw variant.
In [1]: import cramjam
In [2]: cramjam.snappy_decompress(b'abc')
Out[2]: b''
In [3]: cramjam.snappy_decompress(b'abcdefgh')
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: StreamHeader { byte: 97 } }', src/snappy.rs:21:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
[1] 74251 abort py -m IPython
I would expect it to raise a Python exception instead of aborting the process.
de/compress directly into a Python buffer
Hi there,
I'm looking at using fastparquet which seems to have a dependency on cramjam. I'm also using liccheck
to check any python packages we use are compatible with our license needs. It looks like this package is licensed under MIT, however because this information is not included in the PyPI package, liccheck
is flagging cramjam as "unknown license" (see the META section of cramjam compared with e.g. fastparquet).
I'm not familiar with maturin (which I think is used to build this package), but possibly you need to add something like https://www.maturin.rs/metadata.html#python-project-metadata or https://stackoverflow.com/a/73274312/5179470 ?
Many thanks for any help, and for this great package! :)
Something I've stumbled upon packaging this for nixpkgs NixOS/nixpkgs#124862.
Here is the build log https://gist.github.com/veprbl/345d882b01923f31fcfc75c1238d305b
Fastparquet's round trip tests seem to pass just fine. Perhaps they don't use this code path?
In [45]: data = b"oh what a beautiful morning, oh what a beautiful day!!" * 5000000
In [47]: out = cramjam.snappy.compress(data)
In [48]: bout = bytearray(out)
In [49]: assert cramjam.snappy.decompress(out) == data
In [50]: assert cramjam.snappy.decompress(bout) == data
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-50-a424b56cb9e7> in <module>
----> 1 assert cramjam.snappy.decompress(bout) == data
AssertionError:
The decompress with bytearray returns the first 895 bytes of what it should do.
Environment:
x86_64
architectureI originally saw this while working on a python-cramjam
package for Fedora Linux, but I’m able to reproduce it in a simple virtual environment.
To reproduce:
Check out current master
, a1c0c02, and cd
to the cramjam-python/
directory.
rm -rf _e && python3 -m build && python3 -m venv _e && . _e/bin/activate && pip install ./dist/cramjam-2.7.0-cp312-cp312-linux_x86_64.whl && pip install numpy pytest pytest-xdist hypothesis && python3 -m pytest -v -n 16 tests/ && deactivate
Sometimes, all tests pass:
================================================ 564 passed in 34.75s ================================================
…but if I run the command repeatedly, I often see this:
====================================================== FAILURES ======================================================
_______________________________________ test_variants_different_dtypes[brotli] _______________________________________
[gw1] linux -- Python 3.12.1 /home/ben/src/forks/cramjam/cramjam-python/_e/bin/python3
variant_str = 'brotli'
@pytest.mark.parametrize("variant_str", VARIANTS)
> @given(arr=st_np.arrays(st_np.scalar_dtypes(), shape=st.integers(0, int(1e5))))
tests/test_variants.py:35:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = ('brotli', array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j])), kwargs = {}
arg_drawtime = 0.0020151169737800956, initial_draws = 1, start = 463383.986474383, result = None
finish = 463385.225831358, internal_draw_time = 0, runtime = datetime.timedelta(seconds=1, microseconds=239357)
current_deadline = timedelta(milliseconds=1000)
@proxies(self.test)
def test(*args, **kwargs):
arg_drawtime = sum(data.draw_times)
initial_draws = len(data.draw_times)
start = time.perf_counter()
try:
result = self.test(*args, **kwargs)
finally:
finish = time.perf_counter()
internal_draw_time = sum(data.draw_times[initial_draws:])
runtime = datetime.timedelta(
seconds=finish - start - internal_draw_time
)
self._timing_features = {
"time_running_test": finish - start - internal_draw_time,
"time_drawing_args": arg_drawtime,
"time_interactive_draws": internal_draw_time,
}
current_deadline = self.settings.deadline
if not is_final:
current_deadline = (current_deadline // 4) * 5
if runtime >= current_deadline:
> raise DeadlineExceeded(runtime, self.settings.deadline)
E hypothesis.errors.DeadlineExceeded: Test took 1239.36ms, which exceeds the deadline of 1000.00ms
E Falsifying example: test_variants_different_dtypes(
E variant_str='brotli',
E arr=array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]),
E )
_e/lib64/python3.12/site-packages/hypothesis/core.py:845: DeadlineExceeded
----------------------------------------------------- Hypothesis -----------------------------------------------------
WARNING: Hypothesis has spent more than five minutes working to shrink a failing example, and stopped because it is making very slow progress. When you re-run your tests, shrinking will resume and may take this long before aborting again.
PLEASE REPORT THIS if you can provide a reproducing example, so that we can improve shrinking performance for everyone.
============================================== short test summary info ===============================================
FAILED tests/test_variants.py::test_variants_different_dtypes[brotli] - hypothesis.errors.DeadlineExceeded: Test took 1239.36ms, which exceeds the deadline of 1000.00ms
===================================== 1 failed, 563 passed in 335.91s (0:05:35) ======================================
… or this:
====================================================== FAILURES ======================================================
_______________________________________ test_variants_different_dtypes[brotli] _______________________________________
[gw1] linux -- Python 3.12.1 /home/ben/src/forks/cramjam/cramjam-python/_e/bin/python3
variant_str = 'brotli'
@pytest.mark.parametrize("variant_str", VARIANTS)
> @given(arr=st_np.arrays(st_np.scalar_dtypes(), shape=st.integers(0, int(1e5))))
tests/test_variants.py:35:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = ('brotli', array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j])), kwargs = {}
arg_drawtime = 0.0022371799568645656, initial_draws = 1, start = 464013.190208792, result = None
finish = 464014.414762949, internal_draw_time = 0, runtime = datetime.timedelta(seconds=1, microseconds=224554)
current_deadline = timedelta(milliseconds=1000)
@proxies(self.test)
def test(*args, **kwargs):
arg_drawtime = sum(data.draw_times)
initial_draws = len(data.draw_times)
start = time.perf_counter()
try:
result = self.test(*args, **kwargs)
finally:
finish = time.perf_counter()
internal_draw_time = sum(data.draw_times[initial_draws:])
runtime = datetime.timedelta(
seconds=finish - start - internal_draw_time
)
self._timing_features = {
"time_running_test": finish - start - internal_draw_time,
"time_drawing_args": arg_drawtime,
"time_interactive_draws": internal_draw_time,
}
current_deadline = self.settings.deadline
if not is_final:
current_deadline = (current_deadline // 4) * 5
if runtime >= current_deadline:
> raise DeadlineExceeded(runtime, self.settings.deadline)
E hypothesis.errors.DeadlineExceeded: Test took 1224.55ms, which exceeds the deadline of 1000.00ms
E Falsifying example: test_variants_different_dtypes(
E variant_str='brotli',
E arr=array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]),
E )
_e/lib64/python3.12/site-packages/hypothesis/core.py:845: DeadlineExceeded
============================================== short test summary info ===============================================
FAILED tests/test_variants.py::test_variants_different_dtypes[brotli] - hypothesis.errors.DeadlineExceeded: Test took 1224.55ms, which exceeds the deadline of 1000.00ms
=========================================== 1 failed, 563 passed in 31.16s ===========================================
In my testing, it seems like increasing the deadline, e.g.
diff --git a/cramjam-python/tests/test_variants.py b/cramjam-python/tests/test_variants.py
index 4ee4ca3..97e287a 100644
--- a/cramjam-python/tests/test_variants.py
+++ b/cramjam-python/tests/test_variants.py
@@ -12,7 +12,7 @@ VARIANTS = ("snappy", "brotli", "bzip2", "lz4", "gzip", "deflate", "zstd")
# Some OS can be slow or have higher variability in their runtimes on CI
settings.register_profile(
- "local", deadline=timedelta(milliseconds=1000), max_examples=100
+ "local", deadline=timedelta(milliseconds=10000), max_examples=100
)
settings.register_profile("CI", deadline=None, max_examples=25)
if os.getenv("CI"):
is enough to resolve the problem. Note that I am testing on a fairly fast workstation (AMD Ryzen 9 5950X) ; I haven’t yet tried this on slower CI machines, particularly those of other architectures like ppc64le
.
Trying to install
It finds no wheels, attempts to build from source, and fails because I don't have the Rust toolchain installed:
pip install cramjam
Collecting cramjam
Using cached cramjam-2.6.2.tar.gz (1.1 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... error
error: subprocess-exited-with-error
× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [6 lines of output]
Cargo, the Rust package manager, is not installed or is not on PATH.
This package requires Rust and Cargo to compile extensions. Install it through
the system's package manager or via https://rustup.rs/
Checking for Rust toolchain....
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Looking at the list of files on PyPI, it looks like there are MacOS universal binaries for Python 3.8, 3.9, and 3.10 but not for 3.11.
data = b"oh what a beautiful morning, oh what a beautiful day!!" * 5000000 # 270000000 bytes
x = np.zeros(270000000, dtype='uint8') # plenty of space
size = cramjam.gzip.compress_into(data, x)
cramjam.gzip.decompress(x.tobytes()[:size])
gives DecompressionError: corrupt deflate stream
. The value of size
, 15787, is much smaller than the size you get with the compression function(s), 785746.
Snappy framed compress allows output_len and has the _into variants, but snappy_raw does not. Would be nice!
btw: am I right in thinking
I’m almost at the point of introducing a python-cramjam
package to Fedora Linux. I’m currently updating my proposed package from 2.8.0 to 2.8.1.
I am able to use bundled or vendored dependencies when I need to under our packaging guidelines, but it’s preferred to build against system copies. I have already introduced a rust-libcramjam
package, and I can build a python-cramjam-2.8.0
package using it (in lieu of the copy included in the GitHub archive or PyPI sdist).
With 2.8.1, though, the included libcramjam
is updated to 0.2.0, which hasn’t yet been released on crates.io. I can temporarily switch to using the bundled libcramjam
again, but it would make life easier if the version used by the Python extension were reliably available on crates.io. (Rust library crates that are published on crates.io are required to be packaged from crates.io sources, so I can’t just package a snapshot from GitHub as rust-libcrmjam
.)
What do you think? Does this sound reasonable?
I've noticed that in addition to the top-level Cargo.lock
file, there are Cargo.lock
files in individual directories. However, from what I can see cargo
only uses the file from the workspace root, i.e. the top-level file, and the other files are unused. Am I missing something, or can they be removed?
All other compression libraries don't have this caveat and are happy to ingest any PickleBuffer; could you fix it? (no rush)
Originally posted by @crusaderky in #99 (comment)
Hi!
Is it planned or already in the works to have lzma support in cramjam?
A bunch of high energy particle physicists would be rather grateful for that capability in this package.
If you were to seek to some arbitrary location in compressed data, and attempt to start decompressing, you would fail. However, all of the algorithms have some level of block-wise operation.
Is it possible with the dependant libraries here, to be able to find in the original compressed version of some data, byte offsets at which decompression can start? I could imagine doing this in a brute force fashion: try at each byte offset and if some decompression does happen, see if the output is contained in the decompressed output of the whole.
https://pypi.org/project/cramjam/#files
It should only contains 3 abi3(macOs, Windows, Linux) wheels and a dist file, for example: https://pypi.org/project/rjieba/#files
See this discussion here:
https://stackoverflow.com/questions/77347182/how-to-tell-why-an-unknown-feature-feature-is-needed
I think it would be convenient to update the lock version to 1.0.60.
This helps with downstream code.
When running the test suite using PyPy3.10 7.3.15 release, I'm getting lots of test failures. For example:
______________________________ test_obj_api[File] ______________________________
[gw0] linux -- Python 3.10.13 /tmp/cramjam/cramjam-python/.venv/bin/python
tmpdir = local('/tmp/pytest-of-mgorny/pytest-3/popen-gw0/test_obj_api_File_0')
Obj = <class 'File'>
@pytest.mark.parametrize("Obj", (File, Buffer))
def test_obj_api(tmpdir, Obj):
if isinstance(Obj, File):
buf = File(str(tmpdir.join("file.txt")))
else:
buf = Buffer()
assert buf.write(b"bytes") == 5
assert buf.tell() == 5
assert buf.seek(0) == 0
assert buf.read() == b"bytes"
assert buf.seek(-1, 2) == 4 # set one byte backwards from end; position 4
assert buf.read() == b"s"
assert buf.seek(-2, whence=1) == 3 # set two bytes from current (end): position 3
assert buf.read() == b"es"
with pytest.raises(ValueError):
buf.seek(1, 3) # only 0, 1, 2 are valid seek from positions
for out in (
b"12345",
bytearray(b"12345"),
File(str(tmpdir.join("test.txt"))),
Buffer(),
):
buf.seek(0)
expected = b"bytes"
buf.readinto(out)
# Will update the output buffer
if isinstance(out, (File, Buffer)):
out.seek(0)
assert out.read() == expected
elif isinstance(out, bytearray):
assert out == bytearray(expected)
else:
> assert out == expected
E AssertionError: assert b'12345' == b'bytes'
E
E At index 0 diff: b'1' != b'b'
E Use -v to get more diff
tests/test_rust_io.py:44: AssertionError
____________________ test_variants_different_dtypes[snappy] ____________________
[gw0] linux -- Python 3.10.13 /tmp/cramjam/cramjam-python/.venv/bin/python
variant_str = 'snappy'
@pytest.mark.parametrize("variant_str", VARIANTS)
> @given(arr=st_np.arrays(st_np.scalar_dtypes(), shape=st.integers(0, int(1e4))))
tests/test_variants.py:42:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
variant_str = 'snappy', arr = array([], shape=(2, 0), dtype=bool)
@pytest.mark.parametrize("variant_str", VARIANTS)
@given(arr=st_np.arrays(st_np.scalar_dtypes(), shape=st.integers(0, int(1e4))))
def test_variants_different_dtypes(variant_str, arr):
variant = getattr(cramjam, variant_str)
compressed = variant.compress(arr)
decompressed = variant.decompress(compressed)
assert same_same(bytes(decompressed), arr.tobytes())
# And compress n dims > 1
if arr.shape[0] % 2 == 0:
arr = arr.reshape((2, -1))
> compressed = variant.compress(arr)
E TypeError: argument 'data': failed to extract enum BytesType ('Buffer | File | pybuffer')
E - variant RustyBuffer (Buffer): TypeError: failed to extract field BytesType::RustyBuffer.0, caused by TypeError: 'ndarray' object cannot be converted to 'Buffer'
E - variant RustyFile (File): TypeError: failed to extract field BytesType::RustyFile.0, caused by TypeError: 'ndarray' object cannot be converted to 'File'
E - variant PyBuffer (pybuffer): TypeError: failed to extract field BytesType::PyBuffer.0, caused by BufferError: Buffer is not C contiguous
E Falsifying example: test_variants_different_dtypes(
E variant_str='snappy',
E arr=array([], dtype=bool),
E )
tests/test_variants.py:52: TypeError
They all look quite serious. This is with 2b90ebb.
To reproduce, using pypy3.10 venv:
pip install . pytest pytest-xdist hypothesis numpy
python -m pytest -n auto tests
Full test log (1.3M): test.txt
When trying to package pyrus-cramjam for openSUSE Tumbleweed, I get this:
info:obs-service-cargo_audit: Running OBS Source Service : obs-service-cargo_audit
ERROR:obs-service-cargo_audit: possible vulnerabilties: 1
ERROR:obs-service-cargo_audit: /tmp/tmptxa26w30/pyrus-cramjam/Cargo.lock
ERROR:obs-service-cargo_audit: For more information you SHOULD inspect the output of cargo audit manually
ERROR:obs-service-cargo_audit: * RUSTSEC-2021-0131 -> crate: brotli-sys, cvss: None, class: ['memory-corruption']
ERROR:obs-service-cargo_audit: ⚠️ Vulnerabilities may have been found. You must review these.
Aborting: service call failed: /usr/lib/obs/service/cargo_audit --srcdir pyrus-cramjam --outdir /home/ben/src/osc/home:bnavigator:branches:devel:languages:python:numeric/python-cramjam/tmpegree2c6.cargo_audit.service
obs-service-cargo_audit uses the local Cargo.lock file to determine if the related sources in a Rust application have known security vulnerabilities. If vulnerabilities are found, the source service will alert allowing you to update and help upstream update their sources.
The cited security advisory is here: https://rustsec.org/advisories/RUSTSEC-2021-0131.html, bitemyapp/brotli2-rs#45
Seems that the length of the output is the last 4 bytes of the encoded input. as u32
The macos wheels on pypi indicate that they are for macos version >=10.7 and they get installed as such but they are actually built for macos 11.0.
Importing the module on macos 10.13.6 results in
ImportError: dlopen(<snip>/.venv/lib/python3.9/site-packages/cramjam/cramjam.cpython-39-darwin.so, 2): Symbol not found: ____chkstk_darwin
Referenced from: <snip>/.venv/lib/python3.9/site-packages/cramjam/cramjam.cpython-39-darwin.so (which was built for Mac OS X 11.0)
Expected in: /usr/lib/libSystem.B.dylib
@milesgranger - this seems like the best way for me to get in touch, and I don't mind if this is public.
I have been involved in the awkard-array project, which brings numpy like and vectorised processing of variable-length and nested data schemas, i.e., deep parquet or array-like things. This includes numba compiled functions and GPU ops.
The library was designed for high-energy physics, i.e., numerical work. However, we are building out dask-awkward and want to promote it for a much wider audience, since there's nothing else in the python realm that does this kind of work. One major missing piece is (utf8) string handling - like all the python str methods or pandas' .str accessor methods. UTF8 in C/C++ exists, but is non-standard, but it is native in Rust. ...so I am thinking that an external library could exist for string operations on awkward arrays. These arrays are just uint8 numpy arrays/buffers and int32/64 offsets. The point would be to pass buffers around without copy and just rely on rust for the string ops. I don't really know if this is a wise idea!
I am writing here to see whether you might be interested in applying your python-rust buffer passing knowhow to the problem.
When I call snappy.compress with an invalid argument cramjam displays this cryptic exception mentioning NumPy:
>>> from cramjam import snappy
>>> d = snappy.compress(b'b'*1024) # bytes works fine
>>> d
cramjam.Buffer(len=70)
>>> d = snappy.compress(1) # Passing invalid argument produces cryptic error:
thread '<unnamed>' panicked at 'Failed to import NumPy module', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/numpy-0.17.2/src/npyffi/mod.rs:22:9
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
pyo3_runtime.PanicException: Failed to import NumPy module
Env info:
$ python -m pip list
Package Version
---------- -------
cramjam 2.6.2
pip 22.3
setuptools 65.5.0
$ python
Python 3.11.0 (main, Apr 4 2023, 20:04:59) [GCC 8.4.1 20200928 (Red Hat 8.4.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Define one or more exceptions to catch the ugliness that happens during errors such as #5 example.
According to the flate2 crate's README, you can opt to compile with zlib-ng for better performance. Is it a good idea?
I'm working on Apple M1 with a mambaforge deployment of conda.
When trying to save a dataframe to parquet using fastparquet
I'm getting the following error:
ImportError: dlopen(/Users/rpelgrim/mambaforge/envs/cramjam/lib/python3.9/site-packages/cramjam.cpython-39-darwin.so, 2): no suitable image found. Did find:
/Users/rpelgrim/mambaforge/envs/cramjam/lib/python3.9/site-packages/cramjam.cpython-39-darwin.so: mach-o, but wrong architecture
/Users/rpelgrim/mambaforge/envs/cramjam/lib/python3.9/site-packages/cramjam.cpython-39-darwin.so: mach-o, but wrong architecture
Seems like conda is giving me the wrong version of cramjam
?
MRE:
conda create -n cramjam
conda install cramjam ipython
ipython
import cramjam
ImportError: dlopen(/Users/rpelgrim/mambaforge/envs/cramjam/lib/python3.9/site-packages/cramjam.cpython-39-darwin.so, 2): no suitable image found. Did find:
/Users/rpelgrim/mambaforge/envs/cramjam/lib/python3.9/site-packages/cramjam.cpython-39-darwin.so: mach-o, but wrong architecture
/Users/rpelgrim/mambaforge/envs/cramjam/lib/python3.9/site-packages/cramjam.cpython-39-darwin.so: mach-o, but wrong architecture
The C implementation of LZ4 includes xxhash.h, and I guess that's why LZ4-compressed buffers sometimes use xxhash as a checksum on the contents. In Python, we get this through two libraries: it used to be lz4 and xxhash, but now the first library is cramjam.
It would be great if cramjam could include xxhash and we'd use just one library, especially since cramjam works in Pyodide and the xxhash library doesn't.
We're asserting the roundtrip data returns as original, and only done during manual testing that output from de/compression is compatible with original variants, but these should be their own tests.
ie
import gzip
import cramjam
data = b"some bytes"
assert gzip.decompress(cramjam.gzip.compress(data)) == data
assert cramjam.gzip.decompress(gzip.compress(data)) == data
How would I import this as a crate in other python-facing rust package? I would like to use RustyBuffer as a zero-copy way of passing read()-able byte chunks to python within rfsspec. Later, I would also use the (stream) de/compressors. Just naively adding cramjam
into my Cargo.toml causes a big long list of compiler errors related to linker symbols.
Looks like https://github.com/milesgranger/pyrus-cramjam/blob/4b1c21a34195198d24ffdd9730815f4e5cb0d240/.github/workflows/CI.yml#L41 is evaluated false for '3.10'
; while the wheels are built for 3.9
.
Ref: last release, showing 3.10 being skipped.
https://github.com/milesgranger/pyrus-cramjam/runs/4150238631?check_suite_focus=true
conda-forge/cramjam-feedstock#20 (comment)
Will try to fix later, as time permits unless anyone beats me to it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.