tskit-dev / tszip Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 6.0 122 KB

Gzip-like compression for tskit tree sequences

Home Page: https://tszip.readthedocs.io/

License: MIT License

Python 100.00%

compression tree-sequence tskit

tszip's People

Contributors

Stargazers

Watchers

Forkers

jeromekelleher orkohunter hyanwong benjeffery brianzhang01 aabiddanda

tszip's Issues

AttributeError: module 'numcodecs' has no attribute 'Blosc'

I'm trying out tszip and I got the same error when I tried it out on my simulation and also the example from the docs.

On mac, I imported tszip 0.1.0 with pip

import msprime
import tszip

ts = msprime.simulate(10, random_seed=1)
tszip.compress(ts, "simulation.trees.tsz")

I got the error:

Traceback (most recent call last):
  File "src/data/test_tszip.py", line 5, in <module>
    tszip.compress(ts, "simulation.trees.tsz")
  File "/Users/agladsteinNew/.local/share/virtualenvs/cnn_classify_demography-zZ5GtBgD/lib/python3.7/site-packages/tszip/compression.py", line 97, in compress
    compress_zarr(ts, root, variants_only=variants_only)
  File "/Users/agladsteinNew/.local/share/virtualenvs/cnn_classify_demography-zZ5GtBgD/lib/python3.7/site-packages/tszip/compression.py", line 240, in compress_zarr
    compressor = numcodecs.Blosc(cname='zstd', clevel=9, shuffle=numcodecs.Blosc.SHUFFLE)
AttributeError: module 'numcodecs' has no attribute 'Blosc'

I'm guessing a dependency issue, but I'm not sure.

Docs should be transferred from readthedocs

To match other tskit projects

New time_units field breaks compression

Great utility, but I recently ran into an error when trying to use the tszip.compress function within a Python script. It seems to occur even on super simple, just simulated tree-sequences.

The error:

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/home/kele/anaconda3/envs/testenv/lib/python3.10/site-packages/tszip/compression.py", line 97, in compress
    compress_zarr(ts, root, variants_only=variants_only)
  File "/home/kele/anaconda3/envs/testenv/lib/python3.10/site-packages/tszip/compression.py", line 239, in compress_zarr
    ).compress(root, compressor)
  File "/home/kele/anaconda3/envs/testenv/lib/python3.10/site-packages/tszip/compression.py", line 127, in compress
    shape = self.array.shape
AttributeError: 'str' object has no attribute 'shape'

I can recreate the error via:

conda create --name testenv msprime tszip --channel conda-forge
conda activate testenv
python -c """import msprime
import tszip
ts = msprime.simulate(10, random_seed=1)
tszip.compress(ts, 'simulation.trees.tsz')"""

This env has Python 3.10.2, msprime 1.1.0, and tszip 0.2.0

Edit: I don't seem to see this problem with tszip v0.1.0

Support opening from URLs

Discussion at tskit-dev/tskit#1566 (comment)

Cannot handle columns > 2GiB

I am experiencing an error when trying to compress certain tree sequence files:

ValueError: Codec does not support buffers of > 2147483647 bytes

It seems that this error is originating from some function in the zarr package related to chunking? This occurs with both python and command-line versions of tszip.

I can't upload an example here because even the gzipped version of my file is too large (72.3MB).

Any insight into why this is happening and how I might resolve it?

For tsinferred tree sequences, we have started to stuff quite a lot of information into the metadata fields, e.g. ancestor_data_id, sample_data_id, etc. This is not necessary for lossy representation of the data, and there should probably be an option to remove it. We might also want to other sorts of metadata (e.g. for mutations, possibly even for nodes), and perhaps remove provenance, tree-sequence-level metadata, etc. I'm guessing we would have extra parameters to .compress() that would do this.

The reason I ask is that it would be nice to have some way of constructing a "minimal" TS file that simply reconstructs the sites x samples matrix. I could use this to test how well we think inference is doing, in terms of compression.

Change temporary directory to be hidden

When using the temporary directory we should make it hidden, as it'll be confusing to have a directory lying around.

Refactor column transforms

Columns that need special treatment (strings and byte fields) are currently special-cased by name. This is brittle as we have to update tszip every time tskit changes its column and attribute specification. It would be better to store the type information in zarr attrs to use on decode. This would be a new file format, but could be made backwards compatible.

argparse argument not showing up on rtd

AttributeError: 'generator' object has no attribute 'head'

Hello, I'm new to coding and a new assignment recently requires me to perform association rules mining. I'm getting the following error and would like some help to resolve it.

Do let me know if you require more information. Thanks in advance!

AttributeError: 'generator' object has no attribute 'tables'

I'm getting an error from my simulation code:

def sim_single_pulse_uni_AB(sample_size, L, u, r, locus_replicates, fixed_params=None):
    if fixed_params['u']:
        u = fixed_params['u']
    if not fixed_params['NA']:
        NA = selectVal(1000, 40000)
    else:
        NA = fixed_params['NA']
    if not fixed_params['NB']:
        NB = selectVal(1000, 40000)
    else:
        NB = fixed_params['NB']
    if not fixed_params['td']:
        td = selectVal(100, 3499.99)
    else:
        td = fixed_params['td']
    if not fixed_params['tm']:
        tm = td/10
    else:
        tm = fixed_params['tm']
    if not fixed_params['m_AB']:
        m_AB = selectVal(0.1, 0.9)
    else:
        m_AB = fixed_params['m_AB']
    if not fixed_params['seed']:
        seed = random.randint(1, 2**32-1)
    else:
        seed = fixed_params['seed']
    print('NA: {}'.format(NA))
    print('NB: {}'.format(NB))
    print('tm: {}'.format(tm))
    print('td: {}'.format(td))
    print('m_AB: {}'.format(m_AB))
    print('seed: {}'.format(seed))
    A, B = 0, 1
    population_configurations = [
        msprime.PopulationConfiguration(
            sample_size=int(sample_size/2), initial_size=NA),
        msprime.PopulationConfiguration(
            sample_size=int(sample_size/2), initial_size=NB)
    ]
    demographic_events = [
        msprime.MassMigration(
            time=tm, source=A, destination=B, proportion=m_AB),
        msprime.MassMigration(
            time=td, source=B, destination=A, proportion=1.0)
    ]
    tree = msprime.simulate(population_configurations=population_configurations,
                            demographic_events=demographic_events,
                            length=L,
                            recombination_rate=r,
                            mutation_rate=u,
                            random_seed=seed,
                            num_replicates=locus_replicates)
    tszip.compress(tree, "simulation.trees.tsz")

Traceback (most recent call last):
  File "src/data/simulate_msprime.py", line 218, in <module>
    main()
  File "src/data/simulate_msprime.py", line 206, in main
    max_snps = sim_locus_reps(model_func, sample_size, L, u, r, fixed_params, param_file_path, j, out_file_path, max_snps, locus_replicates)
  File "src/data/simulate_msprime.py", line 156, in sim_locus_reps
    tree_replicates, params, y, label = model_func(sample_size, L, u, r, locus_replicates, fixed_params)
  File "/Users/agladsteinNew/dev/cnn_classify_demography/src/data/demographic_models.py", line 336, in sim_single_pulse_uni_AB
    tszip.compress(tree, "simulation.trees.tsz")
  File "/Users/agladsteinNew/.local/share/virtualenvs/cnn_classify_demography-zZ5GtBgD/lib/python3.7/site-packages/tszip/compression.py", line 97, in compress
    compress_zarr(ts, root, variants_only=variants_only)
  File "/Users/agladsteinNew/.local/share/virtualenvs/cnn_classify_demography-zZ5GtBgD/lib/python3.7/site-packages/tszip/compression.py", line 160, in compress_zarr
    tables = ts.tables
AttributeError: 'generator' object has no attribute 'tables'

Compress to stdout

#44 added the ability to decompress tszipped files directly to stdout. We would like to be able to do the inverse operation also. This should be straightforward enough, as should should be able to tell Zarr's zipfile implementation to write to a file handle.

convert setup.py to setup.cfg

Can follow examples from msprime.

Need to set minimum version for numcodecs

We probably need at least 0.6.0. See #25.

Need to record provenance

We should record the provenance somewhere --- versions of tskit, zarr and numcodecs used.

cannot write to stdout

I installed tszip via conda, and -c (and --stdout) are not recognized as arguments

I tried:
tszip -c test.ts

usage: tszip [-h] [-V] [-v] [--variants-only] [-S SUFFIX] [-k] [-f] [-d | -l]
files [files ...]
tszip: error: unrecognized arguments: -c

ValueError: codec not available: 'blosc'

hullo, this is possibly not an 'issue' so much as my own problem.
I'm trying to decompress your 1000G tree sequences from Zenodo, but am getting this error message. Do you know what might be going on here?

6200D-132482-M:raw gtsambos$ tsunzip 1kg_chr19.trees.tsz 
Traceback (most recent call last):
  File "/Users/gtsambos/miniconda3/bin/tsunzip", line 10, in <module>
    sys.exit(tsunzip_main())
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/cli.py", line 170, in tsunzip_main
    main(args)
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/cli.py", line 153, in main
    run_decompress(args)
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/cli.py", line 139, in run_decompress
    ts = tszip.decompress(file_arg)
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/compression.py", line 113, in decompress
    return decompress_zarr(root)
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/compression.py", line 282, in decompress_zarr
    coordinates = root["coordinates"][:]
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/zarr/hierarchy.py", line 326, in __getitem__
    synchronizer=self._synchronizer, cache_attrs=self.attrs.cache)
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/zarr/core.py", line 124, in __init__
    self._load_metadata()
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/zarr/core.py", line 141, in _load_metadata
    self._load_metadata_nosync()
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/zarr/core.py", line 169, in _load_metadata_nosync
    self._compressor = get_codec(config)
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/numcodecs/registry.py", line 36, in get_codec
    raise ValueError('codec not available: %r' % codec_id)
ValueError: codec not available: 'blosc'

h5py issue in tests

We're getting a symbol resolution error in when trying to import h5py in tests: https://github.com/tskit-dev/tszip/runs/4021050960?check_suite_focus=true#step:10:111

Looks like the same issue as this one: h5py/h5py#1880

pep8speaks not working

I enabled the pep8speaks GitHub app for this repo, but it doesn't seem to to be working.

Load from compressed or uncompressed

It would be very handy to have a function that loads either an uncompressed tskit file or a tszipped one.

I guess mimicking tskit.load would be a natural choice?

Add tsunzip

Extra CLI script that mimics gunzip.

Update changelog for 0.2.2

Summarise #53

Latest version of numcodecs doesn't build

Hello, was running into issues installing tszip because of the numcodecs build failing. It seems to be a common issue. When I then manually installed an earlier version of numcodecs (0.6.2), everything worked fine. Perhaps worth a note in the Installation docs?

Update CI to use github actions

Following usual tskit patterns.

Add travisCI

Add windows to operating systems in setup.py

Update changelog for 0.2.0

Forgot to add the new feature from #44

Tests pass in pytest by default but not with -n0

Seeing some very strange behaviour:

$ pytest tests/test_compression.py
==================================== test session starts =====================================
platform linux -- Python 3.8.10, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir: /home/jk/work/github/tszip, configfile: setup.cfg
plugins: forked-1.3.0, timeout-1.4.2, regressions-2.2.0, mock-3.5.1, cov-2.10.1, xdist-2.1.0, datadir-1.3.1, hypothesis-6.12.0
gw0 [37] / gw1 [37] / gw2 [37] / gw3 [37]
...........ss........................                                                  [100%]
=============================== 35 passed, 2 skipped in 1.68s ================================

$ pytest -n0 tests/test_compression.py
================================================================================== short test summary info ===================================================================================
FAILED tests/test_compression.py::TestGenotypeRoundTrip::test_small_msprime_individuals_metadata - AssertionError: <tskit.tables.IndividualTable object at 0x7f61ec07d2b0> != <tskit.tables...
FAILED tests/test_compression.py::TestExactRoundTrip::test_all_fields - AssertionError: Metadata schemas differ: self=OrderedDict([('codec', 'json')]) other=None
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_individuals_metadata - AssertionError: IndividualTable row 1 differs:
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_migration - AssertionError: MutationTable row 0 differs:
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_no_recomb - AssertionError: MutationTable row 0 differs:
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_recomb - AssertionError: MutationTable row 0 differs:
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_top_level_metadata - AssertionError: Metadata schemas differ: self=OrderedDict([('codec', 'json'),
========================================================================== 7 failed, 28 passed, 2 skipped in 1.61s ===========================================================================

It seems like we're loading the installed version of tszip rather than the local one when we're running tests?

Is this possibly the cause of this weird failure we're getting in #45?

Possibly an issue with pytest 6.1.2 and maybe we could downgrade?

Update badges

Get rid of Appveyor at a minimum

Handle multiple input files with compressing to stdout

We're not quite handling this properly at the moment:

#53 (comment)

@aabiddanda would you mind giving an example of what's happening here please?

Work directory not deleted when interupted

Should probably change to using a regular file.

Edge case of decompressing legacy file with top-level metadata

Back in April 2021, I had some TreeSequence objects with top-level metadata that I wanted to write out using tszip, so I proposed PR #35. I've stuck to a version of tszip with that change for the past year.

PR #42 involved a rewrite of tszip that mostly supports legacy versions, but I have found that the top-level metadata in files I had previously written seem to be missing. Top-level metadata from newly tszipped files works fine as it is stored in root.items() and is correctly handled, but the previous metadata is in root.attrs.items(). Another piece of good news is that PR #35 never got its own tszip version on PyPI, so I'm probably the only who uses it. 😄

After quite a bit of debugging, trying to understand what PR #42 did, I came up with a solution that seems to work locally.

def decompress_zarr(root):
    coordinates = root["coordinates"][:]
    dict_repr = {"sequence_length": root.attrs["sequence_length"]}

    # Added by brianzhang01
    for key, value in root.attrs.items():
        if key == "metadata_schema":
            dict_repr[key] = json.dumps(value)
        elif key == "metadata":
            dict_repr[key] = json.dumps(value).encode("utf-8")

Would someone with more knowledge of the codebase be able to take a look and modify as necessary? My PR #35 includes a test, test_small_msprime_top_level_metadata, that can be used to construct a TreeSequence with some top-level metadata. If you write it out with a version right after PR #35 and then read it in with the latest version, I see the following keys for root.attrs.items():

format_name
format_version
metadata
metadata_schema
provenance
sequence_length

and the following are the keys of root.items():

coordinates
edges
individuals
migrations
mutations
nodes
populations
provenances
sites

I would also suggest checking whether the format_name, format_version, and provenance fields of root.attrs are correctly handled for the legacy versions. It seemed fine to me with the provenance correctly carried over, but I notice that only root.attrs["sequence_length"] is accessed in the current decompress_zarr() function.

Thank you very much.

tskit 0.4 seems to be breaking tszip tests

The tszip tests were passing on Nov 8, 2021. Then sometime between then and Jan 20, 2022, they began failing. Here's the latest CI runs, showing 46 failing tests.

https://app.circleci.com/pipelines/github/tskit-dev/tszip/102/workflows/085bcc06-8109-4b39-b907-4ab8047524ff/jobs/104

During that window, tskit 0.4 was released: https://pypi.org/project/tskit/#history.

I ran some things locally that seem to suggest that tszip 0.2.0 works fine with tskit 0.3.7, but not with tskit 0.4.0. Therefore, I'm pinpointing it on some changes in tskit 0.4. For now, I'll just stick to tskit 0.3.7 until this is resolved.

Add tests for metadata schemas across tables

Make sure we have test coverage of metadata schema round trips across all tables.

Refactor to use the dict representation of tree sequence

Currently we're explicitly writing down the columns that we compress, leading to loss of data when we have format updates in tskit. We should use the dict representation of the tables instead, and automatically create Column objects for the data with good default compression options. We can make some special cases for particular columns if we like, but we should by default always store all the data that comes from the input tree sequence.

related to #35