Giter VIP home page Giter VIP logo

vcfnp's People

Contributors

alimanfoo avatar lbeltrame avatar travc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

vcfnp's Issues

Use (optionally) pandas instead of view2d

Pandas is very useful to display tabular data with mixed dtypes. It might be nice to be optionally used to view also heterogeneous dtyped data like view2d uses for homogeneous data.

The disadvantage is that you can't create a view: a new object would need to be created.

vcfnp.calldata bug when differing number of FORMAT fields

It seems that in cases where not all FORMAT fields are populated, default fill values override the actual values in the fields. Here is a small vcf file and commands that hopefully show the problem. I've tried debugging, but I think my python skills are not yet up to it...

##fileformat=VCFv4.1
##INFO=
##FORMAT=
##FORMAT=
##FORMAT=
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT test1 test2 test3 test4
chr1 1 . A G . PASS DP=2 GT:AD 0:1,0 .:1,0 0:0,0 .:0,0
chr1 2 . A G . PASS DP=2 GT:AD:ZZ 0:1,0:dummy 0:1,0 0:0,0 .:0,0
chr1 3 . A G . PASS DP=2 GT:AD:ZZ 0:1,0:dummy .:1,0 0:0,0 .:0,0
import vcfnp
calls = vcfnp.calldata('test.vcf', ploidy=1, progress=20000)
vcfnp.view2d(calls)['AD']
array([[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0]], dtype=uint16)

As you can see in the above, the allele depths of 1,0 for test2 sample are not pulled in.

I believe the above vcf file is on spec. It states in the vcf spec that "Trailing fields can be dropped (with the exception of the GT field, which should always be present if specified in the FORMAT field)". This issue arose when looking at the output of GATK's CombineVariants with the option -genotypeMergeOptions UNIQUIFY

This example also highlights a second problem, that only the first AD gets pulled in. This can be overridden by using arities=dict(GT=1, AD=2) to get the correct result. I've also found I get the correct result if putting Number=G in the AD FORMAT header line, but I think the correct thing here is A (there should be a number for each allele, not for each genotype, though of course in the haploid case these amount to the same thing)

Missing genotype but non-missing calldata

In some circumstances calldata() returns data where the genotype is missing but other fields have values (but should have been filled).

It looks like the sample data is not getting cleared out between parsing variants, so sample data can get carried over from one variant to the next.

vcfnpy2blz

Add a script similar to vcfnpy2hdf5 but loading data into a BLZ barray.

Convert C++ exceptions into Python exceptions

So right now, difficulties parsing VCF files raise unhandled C++ exceptions that eventually lead to program termination. Ideally, we could wrap catch some of these exceptions so that the outcome is a python exception (IOError?) rather than a full stop. Below are two test cases that I've encountered:

import vcfnp

# echo "hey" > test.txt
v = vcfnp.variants("./test.txt")

[vcfnp] 2015-09-03 09:58:24.050742 :: caching is disabled
[vcfnp] 2015-09-03 09:58:24.050779 :: building array
error: no VCF header


import vcfnp

v = vcfnp.variants(".")
[vcfnp] 2015-09-03 09:57:32.435456 :: caching is disabled
[vcfnp] 2015-09-03 09:57:32.435487 :: building array
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr
~/opt/bin/python.app: line 3: 84448 Abort trap: 6  

calldata_2d

Proposed to implement a calldata_2d() convenience function which is effectively composition of calldata() and view2d().

auto cache

Proposed to implement a cache keyword argument to variants(), calldata() and calldata_2d() which if True will automatically save any built numpy arrays to a .npy file, using a filename convention based on the vcf filename but with some suffix, maybe within a directory.

N.B., need to incorporate region string if given.

script for loading hdf5

Add a command line script for loading data from one or more npy splits into an HDF5 file.

combine info() and variants()

Generally I've found there is no utility in separating the INFO from the other fixed fields into two arrays, it would often be more useful to have all variant-related fields in a single structured array/table. E.g., if you wanted to use PyTables in-kernel query engine to express a query for variants based on QUAL or FILTER or num_alleles and any INFO field you could only do that if you have a single structured array. Proposed to deprecate info() and extend variants() to also include INFO fields.

This is a non-backwards compatible change so should imply a major version bump.

Fix calldata fields such that all dimensions are included in shape

Currently fields like 'genotype' where there is a third dimension get stored in the h5 file as a 2 dimensional shape with a compound dtype. This complicates some reading operations. Proposed to change the vcfnpy2hdf5 script such that shape and dtype are properly preserved across input numpy arrays and h5 file.

user-specified cachedir

Add cachedir option to variants, calldata, calldata_2d to manually override cache directory.

git clone error

Might be a problem on my end, but in case it isn't...
It looks like there is something odd with the way vcflib is linked as a submodule in vcfnp

$ git clone -v --recursive git://github.com/alimanfoo/vcfnp.git 
...normal stuff...
Submodule 'vcflib' ([email protected]:alimanfoo/vcflib.git) registered for path 'vcflib'
Cloning into 'vcflib'...
Permission denied (publickey).
fatal: Could not read from remote repository.

If I clone (recursively) your repo of vcflib by hand within the vcfnp directory, it seem to work fine. So I'm thinking that something is wrong with the submodule link.

bug generating hdf5 from vcf

There is a bug when generating hdf5 from vcf...
When vcflib (via htslib) gets a VariantCallFile iterator for a region, it will include variants which have a position before the region start if their length 'overlaps' the region. Eg: a len 5 del at pos 97 will be included in the output for region chr:100.

This can cause duplicates when .npy files for regions are collected into an hdf5. It can even cause non-monotonic POS values in cases like:
98 len 5 del
99 snp
101 snp
if the regions were chr:1-100 and chr:101-200

The behavior of vcflib isn't really wrong, so these cases should be handled at a higher level. The simple way is to just trim off any variants which have a POS value < the region start. I'm not sure where would be best place to put that test though. I'll poke around some more myself, though I suspect you probably have an idea where to put the fix.

Rtfd

Put docs on rtfd.

EFF defaults

Add a default transformers capability and include EFF so that EFF fields get parsed by default when using command line utilities.

flatten FILTER

Add option (default) to flatten the FILTER field into multiple fields prefixed by FILTER_ rather than a structured datatype, to make it such that vcfnp.variants() returns something that is closer to a straight table. E.g., would simplify querying via pytables.

Flatten EFF/ANN annotations?

Would be more convenient if all sub-fields within SNPEFF annotations could be broken out into separate arrays, rather than a single structured array?

EFF Amino_Acid_Change field being truncated

EFF Amino_Acid_Change field is truncated on long transcripts (>10000 residues) since it is dtype a6.

Changing it at the top of eff.py where you define EFF_DEFAULT_DTYPE to something larger (I'd suggest a8 to be safe and even) should fix the problem, at least for SNPs. I'm checking now but will probably sleep before it is done.

I should probably be using ANN, but having the single-letter AA residue to residue mapping right there is just too convenient for my current task.

workaround badly declared info types

GATK's VCF output declares MQ0Fraction as type Integer. The right thing to do is fix the VCF, but for convenience proposed to add an option to treat all numbers as floats in vcfnp when parsing.

Move parameters from comments to docstrings

Currently it is impossible to infer what parameters each function takes. A good idea would be move the existing comments to the docstring.

If this is desirable, I can make a PR.

full vcf to hdf5...

This is probably not much of a priority, but it is bugging me so I'm going to ask...

Is there are more streamlined way of generating hdf5 from vcf without going through the .npy intermediate files? If not, is it likely to be difficult?

Also, unless I'm being confused, you're normally running vcf2npy twice (variants and caldata_2d) before collecting that output into an hdf5. Is there a reason why vcf2npy cannot output both in a single pass? The code isn't structured to make that easy, but I haven't dug deeply enough to tell if it is doing fundamentally different things in the variant pass and the calldata_2d pass. (Though variant is about 5x faster than calldata_2d.)

add more default INFO dtypes

The following are probably all sensible defaults for GATK INFO annotations:

        'ABHet': 'f2',
        'ABHom': 'f2',
        'AC': 'u2',
        'AF': 'f2',
        'AN': 'u2',
        'BaseQRankSum': 'f2',
        'DP': 'u4',
        'Dels': 'f2',
        'FS': 'f2',
        'HRun': 'u1',
        'HaplotypeScore': 'f2',
        'InbreedingCoeff': 'f2',
        'IndelType': 'a12',
        'MQ': 'f2',
        'MQ0Fraction': 'f2',
        'MQRankSum': 'f2',
        'OND': 'f2',
        'QD': 'f2',
        'RPA': 'u2',
        'RU': 'a12',
        'ReadPosRankSum': 'f2',
        'STR': 'b1'

Remove numpy build dependency

Numpy is currently only required for compilation in a very minimal way that probably doesn't improve performance much if at all, propose to remove this as a build dependency.

Reorganise variants as group with child dataset per field

Proposed to change the vcfnpy2hdf5 script default organisation for the variants data such that variants becomes a group and each field becomes a child dataset. (The current arrangement is variants as a dataset with a compound dtype.) This arrangement is faster for retrieval, and also faster for querying via numexpr.

vcfnpy2hdf5 add chunk width option

Currently all 2D datasets are created with chunk width 1 (i.e., single column) which is optimal for accessing data on a single sample but performs badly in other access scenarios (e.g., read whole dataset, read contiguous rows). Benchmarking shows that a wider chunk improves all-round performance with little cost when accessing data on a single sample:

http://nbviewer.ipython.org/gist/alimanfoo/67fdcf58e364763fd0b6/benchmark_hdf5_blz.ipynb

Proposed to add an option to the vcfnpy2hdf5 script to allow configuration of chunk width.

Genotype packed

Add call data field genotype_packed which packs diploid genotype into single byte as per anhima.

genotype_ac

Idea: add a derived field "genotype_ac" to output the per-genotype allele count. E.g., a diploid hom ref 0/0 call becomes (2, 0), a diploid het 0/1 call becomes (1, 1), a triploid het call 0/0/1 becomes (2, 1), etc. This could provide a pathway for analysis of aneuploid data, e.g., Leishmania or cancer, where different individuals have different ploidy in different genome regions.

Compress files created by vcf2npy

The files created by vcf2npy are currently uncompressed .npy files. These can take up a lot of space on large builds. Add an option to save as compressed .npz files.

Wish: support for writing in vcflib's binding

The reason for this request is to enable possibly filtering and manipulation like you can do with PyVCF, but with a much faster (numpy + C++ backend) interface.

I wrote "in vcflib's binding" rather to vcfnp itself because I think that pushing data from numpy to VCF wouldn't be exactly trivial, also considering that vcfnp parses the various bits of a VCF file independently.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.