tskit-dev / tszip Goto Github PK
View Code? Open in Web Editor NEWGzip-like compression for tskit tree sequences
Home Page: https://tszip.readthedocs.io/
License: MIT License
Gzip-like compression for tskit tree sequences
Home Page: https://tszip.readthedocs.io/
License: MIT License
I'm trying out tszip and I got the same error when I tried it out on my simulation and also the example from the docs.
On mac, I imported tszip 0.1.0 with pip
import msprime
import tszip
ts = msprime.simulate(10, random_seed=1)
tszip.compress(ts, "simulation.trees.tsz")
I got the error:
Traceback (most recent call last):
File "src/data/test_tszip.py", line 5, in <module>
tszip.compress(ts, "simulation.trees.tsz")
File "/Users/agladsteinNew/.local/share/virtualenvs/cnn_classify_demography-zZ5GtBgD/lib/python3.7/site-packages/tszip/compression.py", line 97, in compress
compress_zarr(ts, root, variants_only=variants_only)
File "/Users/agladsteinNew/.local/share/virtualenvs/cnn_classify_demography-zZ5GtBgD/lib/python3.7/site-packages/tszip/compression.py", line 240, in compress_zarr
compressor = numcodecs.Blosc(cname='zstd', clevel=9, shuffle=numcodecs.Blosc.SHUFFLE)
AttributeError: module 'numcodecs' has no attribute 'Blosc'
I'm guessing a dependency issue, but I'm not sure.
To match other tskit projects
Great utility, but I recently ran into an error when trying to use the tszip.compress function within a Python script. It seems to occur even on super simple, just simulated tree-sequences.
The error:
Traceback (most recent call last):
File "<string>", line 4, in <module>
File "/home/kele/anaconda3/envs/testenv/lib/python3.10/site-packages/tszip/compression.py", line 97, in compress
compress_zarr(ts, root, variants_only=variants_only)
File "/home/kele/anaconda3/envs/testenv/lib/python3.10/site-packages/tszip/compression.py", line 239, in compress_zarr
).compress(root, compressor)
File "/home/kele/anaconda3/envs/testenv/lib/python3.10/site-packages/tszip/compression.py", line 127, in compress
shape = self.array.shape
AttributeError: 'str' object has no attribute 'shape'
I can recreate the error via:
conda create --name testenv msprime tszip --channel conda-forge
conda activate testenv
python -c """import msprime
import tszip
ts = msprime.simulate(10, random_seed=1)
tszip.compress(ts, 'simulation.trees.tsz')"""
This env has Python 3.10.2, msprime 1.1.0, and tszip 0.2.0
Edit: I don't seem to see this problem with tszip v0.1.0
Discussion at tskit-dev/tskit#1566 (comment)
I am experiencing an error when trying to compress certain tree sequence files:
ValueError: Codec does not support buffers of > 2147483647 bytes
It seems that this error is originating from some function in the zarr package related to chunking? This occurs with both python and command-line versions of tszip.
I can't upload an example here because even the gzipped version of my file is too large (72.3MB).
Any insight into why this is happening and how I might resolve it?
For tsinferred tree sequences, we have started to stuff quite a lot of information into the metadata fields, e.g. ancestor_data_id
, sample_data_id
, etc. This is not necessary for lossy representation of the data, and there should probably be an option to remove it. We might also want to other sorts of metadata (e.g. for mutations, possibly even for nodes), and perhaps remove provenance, tree-sequence-level metadata, etc. I'm guessing we would have extra parameters to .compress()
that would do this.
The reason I ask is that it would be nice to have some way of constructing a "minimal" TS file that simply reconstructs the sites x samples matrix. I could use this to test how well we think inference is doing, in terms of compression.
When using the temporary directory we should make it hidden, as it'll be confusing to have a directory lying around.
Columns that need special treatment (strings and byte fields) are currently special-cased by name. This is brittle as we have to update tszip every time tskit changes its column and attribute specification. It would be better to store the type information in zarr attrs to use on decode. This would be a new file format, but could be made backwards compatible.
I'm getting an error from my simulation code:
def sim_single_pulse_uni_AB(sample_size, L, u, r, locus_replicates, fixed_params=None):
if fixed_params['u']:
u = fixed_params['u']
if not fixed_params['NA']:
NA = selectVal(1000, 40000)
else:
NA = fixed_params['NA']
if not fixed_params['NB']:
NB = selectVal(1000, 40000)
else:
NB = fixed_params['NB']
if not fixed_params['td']:
td = selectVal(100, 3499.99)
else:
td = fixed_params['td']
if not fixed_params['tm']:
tm = td/10
else:
tm = fixed_params['tm']
if not fixed_params['m_AB']:
m_AB = selectVal(0.1, 0.9)
else:
m_AB = fixed_params['m_AB']
if not fixed_params['seed']:
seed = random.randint(1, 2**32-1)
else:
seed = fixed_params['seed']
print('NA: {}'.format(NA))
print('NB: {}'.format(NB))
print('tm: {}'.format(tm))
print('td: {}'.format(td))
print('m_AB: {}'.format(m_AB))
print('seed: {}'.format(seed))
A, B = 0, 1
population_configurations = [
msprime.PopulationConfiguration(
sample_size=int(sample_size/2), initial_size=NA),
msprime.PopulationConfiguration(
sample_size=int(sample_size/2), initial_size=NB)
]
demographic_events = [
msprime.MassMigration(
time=tm, source=A, destination=B, proportion=m_AB),
msprime.MassMigration(
time=td, source=B, destination=A, proportion=1.0)
]
tree = msprime.simulate(population_configurations=population_configurations,
demographic_events=demographic_events,
length=L,
recombination_rate=r,
mutation_rate=u,
random_seed=seed,
num_replicates=locus_replicates)
tszip.compress(tree, "simulation.trees.tsz")
Traceback (most recent call last):
File "src/data/simulate_msprime.py", line 218, in <module>
main()
File "src/data/simulate_msprime.py", line 206, in main
max_snps = sim_locus_reps(model_func, sample_size, L, u, r, fixed_params, param_file_path, j, out_file_path, max_snps, locus_replicates)
File "src/data/simulate_msprime.py", line 156, in sim_locus_reps
tree_replicates, params, y, label = model_func(sample_size, L, u, r, locus_replicates, fixed_params)
File "/Users/agladsteinNew/dev/cnn_classify_demography/src/data/demographic_models.py", line 336, in sim_single_pulse_uni_AB
tszip.compress(tree, "simulation.trees.tsz")
File "/Users/agladsteinNew/.local/share/virtualenvs/cnn_classify_demography-zZ5GtBgD/lib/python3.7/site-packages/tszip/compression.py", line 97, in compress
compress_zarr(ts, root, variants_only=variants_only)
File "/Users/agladsteinNew/.local/share/virtualenvs/cnn_classify_demography-zZ5GtBgD/lib/python3.7/site-packages/tszip/compression.py", line 160, in compress_zarr
tables = ts.tables
AttributeError: 'generator' object has no attribute 'tables'
#44 added the ability to decompress tszipped files directly to stdout. We would like to be able to do the inverse operation also. This should be straightforward enough, as should should be able to tell Zarr's zipfile implementation to write to a file handle.
Can follow examples from msprime.
We probably need at least 0.6.0. See #25.
We should record the provenance somewhere --- versions of tskit, zarr and numcodecs used.
I installed tszip via conda, and -c (and --stdout) are not recognized as arguments
I tried:
tszip -c test.ts
usage: tszip [-h] [-V] [-v] [--variants-only] [-S SUFFIX] [-k] [-f] [-d | -l]
files [files ...]
tszip: error: unrecognized arguments: -c
hullo, this is possibly not an 'issue' so much as my own problem.
I'm trying to decompress your 1000G tree sequences from Zenodo, but am getting this error message. Do you know what might be going on here?
6200D-132482-M:raw gtsambos$ tsunzip 1kg_chr19.trees.tsz
Traceback (most recent call last):
File "/Users/gtsambos/miniconda3/bin/tsunzip", line 10, in <module>
sys.exit(tsunzip_main())
File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/cli.py", line 170, in tsunzip_main
main(args)
File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/cli.py", line 153, in main
run_decompress(args)
File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/cli.py", line 139, in run_decompress
ts = tszip.decompress(file_arg)
File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/compression.py", line 113, in decompress
return decompress_zarr(root)
File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/compression.py", line 282, in decompress_zarr
coordinates = root["coordinates"][:]
File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/zarr/hierarchy.py", line 326, in __getitem__
synchronizer=self._synchronizer, cache_attrs=self.attrs.cache)
File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/zarr/core.py", line 124, in __init__
self._load_metadata()
File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/zarr/core.py", line 141, in _load_metadata
self._load_metadata_nosync()
File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/zarr/core.py", line 169, in _load_metadata_nosync
self._compressor = get_codec(config)
File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/numcodecs/registry.py", line 36, in get_codec
raise ValueError('codec not available: %r' % codec_id)
ValueError: codec not available: 'blosc'
We're getting a symbol resolution error in when trying to import h5py in tests: https://github.com/tskit-dev/tszip/runs/4021050960?check_suite_focus=true#step:10:111
Looks like the same issue as this one: h5py/h5py#1880
I enabled the pep8speaks GitHub app for this repo, but it doesn't seem to to be working.
It would be very handy to have a function that loads either an uncompressed tskit file or a tszipped one.
I guess mimicking tskit.load would be a natural choice?
Extra CLI script that mimics gunzip.
Summarise #53
Hello, was running into issues installing tszip
because of the numcodecs
build failing. It seems to be a common issue. When I then manually installed an earlier version of numcodecs (0.6.2), everything worked fine. Perhaps worth a note in the Installation docs?
Following usual tskit patterns.
Forgot to add the new feature from #44
Seeing some very strange behaviour:
$ pytest tests/test_compression.py
==================================== test session starts =====================================
platform linux -- Python 3.8.10, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir: /home/jk/work/github/tszip, configfile: setup.cfg
plugins: forked-1.3.0, timeout-1.4.2, regressions-2.2.0, mock-3.5.1, cov-2.10.1, xdist-2.1.0, datadir-1.3.1, hypothesis-6.12.0
gw0 [37] / gw1 [37] / gw2 [37] / gw3 [37]
...........ss........................ [100%]
=============================== 35 passed, 2 skipped in 1.68s ================================
$ pytest -n0 tests/test_compression.py
================================================================================== short test summary info ===================================================================================
FAILED tests/test_compression.py::TestGenotypeRoundTrip::test_small_msprime_individuals_metadata - AssertionError: <tskit.tables.IndividualTable object at 0x7f61ec07d2b0> != <tskit.tables...
FAILED tests/test_compression.py::TestExactRoundTrip::test_all_fields - AssertionError: Metadata schemas differ: self=OrderedDict([('codec', 'json')]) other=None
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_individuals_metadata - AssertionError: IndividualTable row 1 differs:
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_migration - AssertionError: MutationTable row 0 differs:
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_no_recomb - AssertionError: MutationTable row 0 differs:
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_recomb - AssertionError: MutationTable row 0 differs:
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_top_level_metadata - AssertionError: Metadata schemas differ: self=OrderedDict([('codec', 'json'),
========================================================================== 7 failed, 28 passed, 2 skipped in 1.61s ===========================================================================
It seems like we're loading the installed version of tszip
rather than the local one when we're running tests?
Is this possibly the cause of this weird failure we're getting in #45?
Possibly an issue with pytest 6.1.2 and maybe we could downgrade?
Get rid of Appveyor at a minimum
We're not quite handling this properly at the moment:
@aabiddanda would you mind giving an example of what's happening here please?
Should probably change to using a regular file.
Back in April 2021, I had some TreeSequence objects with top-level metadata that I wanted to write out using tszip, so I proposed PR #35. I've stuck to a version of tszip with that change for the past year.
PR #42 involved a rewrite of tszip that mostly supports legacy versions, but I have found that the top-level metadata in files I had previously written seem to be missing. Top-level metadata from newly tszipped files works fine as it is stored in root.items()
and is correctly handled, but the previous metadata is in root.attrs.items()
. Another piece of good news is that PR #35 never got its own tszip version on PyPI, so I'm probably the only who uses it. ๐
After quite a bit of debugging, trying to understand what PR #42 did, I came up with a solution that seems to work locally.
def decompress_zarr(root):
coordinates = root["coordinates"][:]
dict_repr = {"sequence_length": root.attrs["sequence_length"]}
# Added by brianzhang01
for key, value in root.attrs.items():
if key == "metadata_schema":
dict_repr[key] = json.dumps(value)
elif key == "metadata":
dict_repr[key] = json.dumps(value).encode("utf-8")
Would someone with more knowledge of the codebase be able to take a look and modify as necessary? My PR #35 includes a test, test_small_msprime_top_level_metadata
, that can be used to construct a TreeSequence with some top-level metadata. If you write it out with a version right after PR #35 and then read it in with the latest version, I see the following keys for root.attrs.items():
format_name
format_version
metadata
metadata_schema
provenance
sequence_length
and the following are the keys of root.items():
coordinates
edges
individuals
migrations
mutations
nodes
populations
provenances
sites
I would also suggest checking whether the format_name
, format_version
, and provenance
fields of root.attrs
are correctly handled for the legacy versions. It seemed fine to me with the provenance correctly carried over, but I notice that only root.attrs["sequence_length"]
is accessed in the current decompress_zarr()
function.
Thank you very much.
The tszip tests were passing on Nov 8, 2021. Then sometime between then and Jan 20, 2022, they began failing. Here's the latest CI runs, showing 46 failing tests.
During that window, tskit 0.4 was released: https://pypi.org/project/tskit/#history.
I ran some things locally that seem to suggest that tszip 0.2.0 works fine with tskit 0.3.7, but not with tskit 0.4.0. Therefore, I'm pinpointing it on some changes in tskit 0.4. For now, I'll just stick to tskit 0.3.7 until this is resolved.
Make sure we have test coverage of metadata schema round trips across all tables.
Currently we're explicitly writing down the columns that we compress, leading to loss of data when we have format updates in tskit. We should use the dict representation of the tables instead, and automatically create Column objects for the data with good default compression options. We can make some special cases for particular columns if we like, but we should by default always store all the data that comes from the input tree sequence.
related to #35
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.