Giter VIP home page Giter VIP logo

tszip's People

Contributors

aabiddanda avatar benjeffery avatar brianzhang01 avatar hyanwong avatar jeromekelleher avatar mergify[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

tszip's Issues

AttributeError: module 'numcodecs' has no attribute 'Blosc'

I'm trying out tszip and I got the same error when I tried it out on my simulation and also the example from the docs.

On mac, I imported tszip 0.1.0 with pip

import msprime
import tszip

ts = msprime.simulate(10, random_seed=1)
tszip.compress(ts, "simulation.trees.tsz")

I got the error:

Traceback (most recent call last):
  File "src/data/test_tszip.py", line 5, in <module>
    tszip.compress(ts, "simulation.trees.tsz")
  File "/Users/agladsteinNew/.local/share/virtualenvs/cnn_classify_demography-zZ5GtBgD/lib/python3.7/site-packages/tszip/compression.py", line 97, in compress
    compress_zarr(ts, root, variants_only=variants_only)
  File "/Users/agladsteinNew/.local/share/virtualenvs/cnn_classify_demography-zZ5GtBgD/lib/python3.7/site-packages/tszip/compression.py", line 240, in compress_zarr
    compressor = numcodecs.Blosc(cname='zstd', clevel=9, shuffle=numcodecs.Blosc.SHUFFLE)
AttributeError: module 'numcodecs' has no attribute 'Blosc'

I'm guessing a dependency issue, but I'm not sure.

New time_units field breaks compression

Great utility, but I recently ran into an error when trying to use the tszip.compress function within a Python script. It seems to occur even on super simple, just simulated tree-sequences.

The error:

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/home/kele/anaconda3/envs/testenv/lib/python3.10/site-packages/tszip/compression.py", line 97, in compress
    compress_zarr(ts, root, variants_only=variants_only)
  File "/home/kele/anaconda3/envs/testenv/lib/python3.10/site-packages/tszip/compression.py", line 239, in compress_zarr
    ).compress(root, compressor)
  File "/home/kele/anaconda3/envs/testenv/lib/python3.10/site-packages/tszip/compression.py", line 127, in compress
    shape = self.array.shape
AttributeError: 'str' object has no attribute 'shape'

I can recreate the error via:

conda create --name testenv msprime tszip --channel conda-forge
conda activate testenv
python -c """import msprime
import tszip
ts = msprime.simulate(10, random_seed=1)
tszip.compress(ts, 'simulation.trees.tsz')"""

This env has Python 3.10.2, msprime 1.1.0, and tszip 0.2.0

Edit: I don't seem to see this problem with tszip v0.1.0

Cannot handle columns > 2GiB

I am experiencing an error when trying to compress certain tree sequence files:

ValueError: Codec does not support buffers of > 2147483647 bytes

It seems that this error is originating from some function in the zarr package related to chunking? This occurs with both python and command-line versions of tszip.

I can't upload an example here because even the gzipped version of my file is too large (72.3MB).

Any insight into why this is happening and how I might resolve it?

Remove (some?) metadata

For tsinferred tree sequences, we have started to stuff quite a lot of information into the metadata fields, e.g. ancestor_data_id, sample_data_id, etc. This is not necessary for lossy representation of the data, and there should probably be an option to remove it. We might also want to other sorts of metadata (e.g. for mutations, possibly even for nodes), and perhaps remove provenance, tree-sequence-level metadata, etc. I'm guessing we would have extra parameters to .compress() that would do this.

The reason I ask is that it would be nice to have some way of constructing a "minimal" TS file that simply reconstructs the sites x samples matrix. I could use this to test how well we think inference is doing, in terms of compression.

Refactor column transforms

Columns that need special treatment (strings and byte fields) are currently special-cased by name. This is brittle as we have to update tszip every time tskit changes its column and attribute specification. It would be better to store the type information in zarr attrs to use on decode. This would be a new file format, but could be made backwards compatible.

AttributeError: 'generator' object has no attribute 'head'

Hello, I'm new to coding and a new assignment recently requires me to perform association rules mining. I'm getting the following error and would like some help to resolve it.

image

Do let me know if you require more information. Thanks in advance!

AttributeError: 'generator' object has no attribute 'tables'

I'm getting an error from my simulation code:

def sim_single_pulse_uni_AB(sample_size, L, u, r, locus_replicates, fixed_params=None):
    if fixed_params['u']:
        u = fixed_params['u']
    if not fixed_params['NA']:
        NA = selectVal(1000, 40000)
    else:
        NA = fixed_params['NA']
    if not fixed_params['NB']:
        NB = selectVal(1000, 40000)
    else:
        NB = fixed_params['NB']
    if not fixed_params['td']:
        td = selectVal(100, 3499.99)
    else:
        td = fixed_params['td']
    if not fixed_params['tm']:
        tm = td/10
    else:
        tm = fixed_params['tm']
    if not fixed_params['m_AB']:
        m_AB = selectVal(0.1, 0.9)
    else:
        m_AB = fixed_params['m_AB']
    if not fixed_params['seed']:
        seed = random.randint(1, 2**32-1)
    else:
        seed = fixed_params['seed']
    print('NA: {}'.format(NA))
    print('NB: {}'.format(NB))
    print('tm: {}'.format(tm))
    print('td: {}'.format(td))
    print('m_AB: {}'.format(m_AB))
    print('seed: {}'.format(seed))
    A, B = 0, 1
    population_configurations = [
        msprime.PopulationConfiguration(
            sample_size=int(sample_size/2), initial_size=NA),
        msprime.PopulationConfiguration(
            sample_size=int(sample_size/2), initial_size=NB)
    ]
    demographic_events = [
        msprime.MassMigration(
            time=tm, source=A, destination=B, proportion=m_AB),
        msprime.MassMigration(
            time=td, source=B, destination=A, proportion=1.0)
    ]
    tree = msprime.simulate(population_configurations=population_configurations,
                            demographic_events=demographic_events,
                            length=L,
                            recombination_rate=r,
                            mutation_rate=u,
                            random_seed=seed,
                            num_replicates=locus_replicates)
    tszip.compress(tree, "simulation.trees.tsz")
Traceback (most recent call last):
  File "src/data/simulate_msprime.py", line 218, in <module>
    main()
  File "src/data/simulate_msprime.py", line 206, in main
    max_snps = sim_locus_reps(model_func, sample_size, L, u, r, fixed_params, param_file_path, j, out_file_path, max_snps, locus_replicates)
  File "src/data/simulate_msprime.py", line 156, in sim_locus_reps
    tree_replicates, params, y, label = model_func(sample_size, L, u, r, locus_replicates, fixed_params)
  File "/Users/agladsteinNew/dev/cnn_classify_demography/src/data/demographic_models.py", line 336, in sim_single_pulse_uni_AB
    tszip.compress(tree, "simulation.trees.tsz")
  File "/Users/agladsteinNew/.local/share/virtualenvs/cnn_classify_demography-zZ5GtBgD/lib/python3.7/site-packages/tszip/compression.py", line 97, in compress
    compress_zarr(ts, root, variants_only=variants_only)
  File "/Users/agladsteinNew/.local/share/virtualenvs/cnn_classify_demography-zZ5GtBgD/lib/python3.7/site-packages/tszip/compression.py", line 160, in compress_zarr
    tables = ts.tables
AttributeError: 'generator' object has no attribute 'tables'

Compress to stdout

#44 added the ability to decompress tszipped files directly to stdout. We would like to be able to do the inverse operation also. This should be straightforward enough, as should should be able to tell Zarr's zipfile implementation to write to a file handle.

cannot write to stdout

I installed tszip via conda, and -c (and --stdout) are not recognized as arguments

I tried:
tszip -c test.ts

usage: tszip [-h] [-V] [-v] [--variants-only] [-S SUFFIX] [-k] [-f] [-d | -l]
files [files ...]
tszip: error: unrecognized arguments: -c

ValueError: codec not available: 'blosc'

hullo, this is possibly not an 'issue' so much as my own problem.
I'm trying to decompress your 1000G tree sequences from Zenodo, but am getting this error message. Do you know what might be going on here?

6200D-132482-M:raw gtsambos$ tsunzip 1kg_chr19.trees.tsz 
Traceback (most recent call last):
  File "/Users/gtsambos/miniconda3/bin/tsunzip", line 10, in <module>
    sys.exit(tsunzip_main())
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/cli.py", line 170, in tsunzip_main
    main(args)
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/cli.py", line 153, in main
    run_decompress(args)
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/cli.py", line 139, in run_decompress
    ts = tszip.decompress(file_arg)
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/compression.py", line 113, in decompress
    return decompress_zarr(root)
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/tszip/compression.py", line 282, in decompress_zarr
    coordinates = root["coordinates"][:]
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/zarr/hierarchy.py", line 326, in __getitem__
    synchronizer=self._synchronizer, cache_attrs=self.attrs.cache)
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/zarr/core.py", line 124, in __init__
    self._load_metadata()
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/zarr/core.py", line 141, in _load_metadata
    self._load_metadata_nosync()
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/zarr/core.py", line 169, in _load_metadata_nosync
    self._compressor = get_codec(config)
  File "/Users/gtsambos/miniconda3/lib/python3.7/site-packages/numcodecs/registry.py", line 36, in get_codec
    raise ValueError('codec not available: %r' % codec_id)
ValueError: codec not available: 'blosc'

Load from compressed or uncompressed

It would be very handy to have a function that loads either an uncompressed tskit file or a tszipped one.

I guess mimicking tskit.load would be a natural choice?

Latest version of numcodecs doesn't build

Hello, was running into issues installing tszip because of the numcodecs build failing. It seems to be a common issue. When I then manually installed an earlier version of numcodecs (0.6.2), everything worked fine. Perhaps worth a note in the Installation docs?

Tests pass in pytest by default but not with -n0

Seeing some very strange behaviour:

$ pytest tests/test_compression.py
==================================== test session starts =====================================
platform linux -- Python 3.8.10, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir: /home/jk/work/github/tszip, configfile: setup.cfg
plugins: forked-1.3.0, timeout-1.4.2, regressions-2.2.0, mock-3.5.1, cov-2.10.1, xdist-2.1.0, datadir-1.3.1, hypothesis-6.12.0
gw0 [37] / gw1 [37] / gw2 [37] / gw3 [37]
...........ss........................                                                  [100%]
=============================== 35 passed, 2 skipped in 1.68s ================================

$ pytest -n0 tests/test_compression.py
================================================================================== short test summary info ===================================================================================
FAILED tests/test_compression.py::TestGenotypeRoundTrip::test_small_msprime_individuals_metadata - AssertionError: <tskit.tables.IndividualTable object at 0x7f61ec07d2b0> != <tskit.tables...
FAILED tests/test_compression.py::TestExactRoundTrip::test_all_fields - AssertionError: Metadata schemas differ: self=OrderedDict([('codec', 'json')]) other=None
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_individuals_metadata - AssertionError: IndividualTable row 1 differs:
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_migration - AssertionError: MutationTable row 0 differs:
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_no_recomb - AssertionError: MutationTable row 0 differs:
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_recomb - AssertionError: MutationTable row 0 differs:
FAILED tests/test_compression.py::TestExactRoundTrip::test_small_msprime_top_level_metadata - AssertionError: Metadata schemas differ: self=OrderedDict([('codec', 'json'),
========================================================================== 7 failed, 28 passed, 2 skipped in 1.61s ===========================================================================

It seems like we're loading the installed version of tszip rather than the local one when we're running tests?

Is this possibly the cause of this weird failure we're getting in #45?

Possibly an issue with pytest 6.1.2 and maybe we could downgrade?

Edge case of decompressing legacy file with top-level metadata

Back in April 2021, I had some TreeSequence objects with top-level metadata that I wanted to write out using tszip, so I proposed PR #35. I've stuck to a version of tszip with that change for the past year.

PR #42 involved a rewrite of tszip that mostly supports legacy versions, but I have found that the top-level metadata in files I had previously written seem to be missing. Top-level metadata from newly tszipped files works fine as it is stored in root.items() and is correctly handled, but the previous metadata is in root.attrs.items(). Another piece of good news is that PR #35 never got its own tszip version on PyPI, so I'm probably the only who uses it. ๐Ÿ˜„

After quite a bit of debugging, trying to understand what PR #42 did, I came up with a solution that seems to work locally.

def decompress_zarr(root):
    coordinates = root["coordinates"][:]
    dict_repr = {"sequence_length": root.attrs["sequence_length"]}

    # Added by brianzhang01
    for key, value in root.attrs.items():
        if key == "metadata_schema":
            dict_repr[key] = json.dumps(value)
        elif key == "metadata":
            dict_repr[key] = json.dumps(value).encode("utf-8")

Would someone with more knowledge of the codebase be able to take a look and modify as necessary? My PR #35 includes a test, test_small_msprime_top_level_metadata, that can be used to construct a TreeSequence with some top-level metadata. If you write it out with a version right after PR #35 and then read it in with the latest version, I see the following keys for root.attrs.items():

format_name
format_version
metadata
metadata_schema
provenance
sequence_length

and the following are the keys of root.items():

coordinates
edges
individuals
migrations
mutations
nodes
populations
provenances
sites

I would also suggest checking whether the format_name, format_version, and provenance fields of root.attrs are correctly handled for the legacy versions. It seemed fine to me with the provenance correctly carried over, but I notice that only root.attrs["sequence_length"] is accessed in the current decompress_zarr() function.

Thank you very much.

tskit 0.4 seems to be breaking tszip tests

The tszip tests were passing on Nov 8, 2021. Then sometime between then and Jan 20, 2022, they began failing. Here's the latest CI runs, showing 46 failing tests.

https://app.circleci.com/pipelines/github/tskit-dev/tszip/102/workflows/085bcc06-8109-4b39-b907-4ab8047524ff/jobs/104

During that window, tskit 0.4 was released: https://pypi.org/project/tskit/#history.

I ran some things locally that seem to suggest that tszip 0.2.0 works fine with tskit 0.3.7, but not with tskit 0.4.0. Therefore, I'm pinpointing it on some changes in tskit 0.4. For now, I'll just stick to tskit 0.3.7 until this is resolved.

Refactor to use the dict representation of tree sequence

Currently we're explicitly writing down the columns that we compress, leading to loss of data when we have format updates in tskit. We should use the dict representation of the tables instead, and automatically create Column objects for the data with good default compression options. We can make some special cases for particular columns if we like, but we should by default always store all the data that comes from the input tree sequence.

related to #35

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.