alimanfoo / vcfnp Goto Github PK

View Code? Open in Web Editor NEW

31.0 31.0 8.0 4.06 MB

Load numpy arrays and HDF5 files from VCF (variant call format)

License: MIT License

Python 85.59% Jupyter Notebook 12.00% PowerShell 1.52% Batchfile 0.89%

vcfnp's People

Contributors

Stargazers

Watchers

Forkers

podpearson jdiez pombredanne mattbawn benjeffery hardingnj travc drorhilman

vcfnp's Issues

Use (optionally) pandas instead of view2d

Pandas is very useful to display tabular data with mixed dtypes. It might be nice to be optionally used to view also heterogeneous dtyped data like view2d uses for homogeneous data.

The disadvantage is that you can't create a view: a new object would need to be created.

Remove version logging from scripts

Currently scripts (e.g., vcf2npy) log dependency versions in a fairly ugly way that also comes out when calling --help.

qsub_vcf2npy breaks for -tc or any similar qsub option

Get's broken apart into -t c by argparse.

scripts for parallelizing conversion to npy

Add command line scripts to facilitate conversion of a VCF file to npy files optionally split by genome region.

vcfnp.calldata bug when differing number of FORMAT fields

It seems that in cases where not all FORMAT fields are populated, default fill values override the actual values in the fields. Here is a small vcf file and commands that hopefully show the problem. I've tried debugging, but I think my python skills are not yet up to it...

##fileformat=VCFv4.1
##INFO=
##FORMAT=
##FORMAT=
##FORMAT=
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT test1 test2 test3 test4
chr1 1 . A G . PASS DP=2 GT:AD 0:1,0 .:1,0 0:0,0 .:0,0
chr1 2 . A G . PASS DP=2 GT:AD:ZZ 0:1,0:dummy 0:1,0 0:0,0 .:0,0
chr1 3 . A G . PASS DP=2 GT:AD:ZZ 0:1,0:dummy .:1,0 0:0,0 .:0,0

import vcfnp
calls = vcfnp.calldata('test.vcf', ploidy=1, progress=20000)
vcfnp.view2d(calls)['AD']

array([[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0]], dtype=uint16)

As you can see in the above, the allele depths of 1,0 for test2 sample are not pulled in.

I believe the above vcf file is on spec. It states in the vcf spec that "Trailing fields can be dropped (with the exception of the GT field, which should always be present if specified in the FORMAT field)". This issue arose when looking at the output of GATK's CombineVariants with the option -genotypeMergeOptions UNIQUIFY

This example also highlights a second problem, that only the first AD gets pulled in. This can be overridden by using arities=dict(GT=1, AD=2) to get the correct result. I've also found I get the correct result if putting Number=G in the AD FORMAT header line, but I think the correct thing here is A (there should be a number for each allele, not for each genotype, though of course in the haploid case these amount to the same thing)

vcfnp2hdf5 add vcf header definitions as attributes on datasets

Add INFO, FILTER and FORMAT field descriptions from the VCF header to the datasets as attributes (if possible).

Missing genotype but non-missing calldata

In some circumstances calldata() returns data where the genotype is missing but other fields have values (but should have been filled).

It looks like the sample data is not getting cleared out between parsing variants, so sample data can get carried over from one variant to the next.

vcfnpy2blz

Add a script similar to vcfnpy2hdf5 but loading data into a BLZ barray.

Convert C++ exceptions into Python exceptions

So right now, difficulties parsing VCF files raise unhandled C++ exceptions that eventually lead to program termination. Ideally, we could wrap catch some of these exceptions so that the outcome is a python exception (IOError?) rather than a full stop. Below are two test cases that I've encountered:

import vcfnp

# echo "hey" > test.txt
v = vcfnp.variants("./test.txt")

[vcfnp] 2015-09-03 09:58:24.050742 :: caching is disabled
[vcfnp] 2015-09-03 09:58:24.050779 :: building array
error: no VCF header


import vcfnp

v = vcfnp.variants(".")
[vcfnp] 2015-09-03 09:57:32.435456 :: caching is disabled
[vcfnp] 2015-09-03 09:57:32.435487 :: building array
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr
~/opt/bin/python.app: line 3: 84448 Abort trap: 6

Gracefully handle VCFs with duplicate field definitions in header

Currently vcfnp fails to parse some VCF files merged from the GATK and annotated with snpEff. The error is simply "two fields with the same name", but it's not very useful, as it does not tell which field is duplicated.

Is this an issue upstream in vcflib?

Local parallelisation for vcf2npy

Add a script local_vcf2npy or similar name which is like qsub_vcf2npy but uses multiple local processes for parallelisation.

calldata_2d

Proposed to implement a calldata_2d() convenience function which is effectively composition of calldata() and view2d().

if no variants match region, defaults to all variants?

auto cache

Proposed to implement a cache keyword argument to variants(), calldata() and calldata_2d() which if True will automatically save any built numpy arrays to a .npy file, using a filename convention based on the vcf filename but with some suffix, maybe within a directory.

N.B., need to incorporate region string if given.

script for loading hdf5

Add a command line script for loading data from one or more npy splits into an HDF5 file.

combine info() and variants()

Generally I've found there is no utility in separating the INFO from the other fixed fields into two arrays, it would often be more useful to have all variant-related fields in a single structured array/table. E.g., if you wanted to use PyTables in-kernel query engine to express a query for variants based on QUAL or FILTER or num_alleles and any INFO field you could only do that if you have a single structured array. Proposed to deprecate info() and extend variants() to also include INFO fields.

This is a non-backwards compatible change so should imply a major version bump.

Fix calldata fields such that all dimensions are included in shape

Currently fields like 'genotype' where there is a third dimension get stored in the h5 file as a 2 dimensional shape with a compound dtype. This complicates some reading operations. Proposed to change the vcfnpy2hdf5 script such that shape and dtype are properly preserved across input numpy arrays and h5 file.

user-specified cachedir

Add cachedir option to variants, calldata, calldata_2d to manually override cache directory.

git clone error

Might be a problem on my end, but in case it isn't...
It looks like there is something odd with the way vcflib is linked as a submodule in vcfnp

$ git clone -v --recursive git://github.com/alimanfoo/vcfnp.git 
...normal stuff...
Submodule 'vcflib' ([email protected]:alimanfoo/vcflib.git) registered for path 'vcflib'
Cloning into 'vcflib'...
Permission denied (publickey).
fatal: Could not read from remote repository.

If I clone (recursively) your repo of vcflib by hand within the vcfnp directory, it seem to work fine. So I'm thinking that something is wrong with the submodule link.

bug generating hdf5 from vcf

There is a bug when generating hdf5 from vcf...
When vcflib (via htslib) gets a VariantCallFile iterator for a region, it will include variants which have a position before the region start if their length 'overlaps' the region. Eg: a len 5 del at pos 97 will be included in the output for region chr:100.

This can cause duplicates when .npy files for regions are collected into an hdf5. It can even cause non-monotonic POS values in cases like:
98 len 5 del
99 snp
101 snp
if the regions were chr:1-100 and chr:101-200

The behavior of vcflib isn't really wrong, so these cases should be handled at a higher level. The simple way is to just trim off any variants which have a POS value < the region start. I'm not sure where would be best place to put that test though. I'll poke around some more myself, though I suspect you probably have an idea where to put the fix.

Rtfd

Put docs on rtfd.

support for custom parsers/transformers of INFO field values

Propose support for custom transformer functions which can be used to transform or parse values from a given INFO field prior to being loaded into an array. (Motivating use case is SNPEff annotations.)

variants() throws ValueError reporting two fields with the same name?

add allele lengths custom field to variants array

Proposed to add allele lengths custom field to variants array to query allele lengths even when ref and alt fields have been truncated by dtype selection.

EFF defaults

Add a default transformers capability and include EFF so that EFF fields get parsed by default when using command line utilities.

fix bug for string fields with multiple values

Bug where string fields have multiple values.

flatten FILTER

Add option (default) to flatten the FILTER field into multiple fields prefixed by FILTER_ rather than a structured datatype, to make it such that vcfnp.variants() returns something that is closer to a straight table. E.g., would simplify querying via pytables.

Support PY3

vcfnpy2hdf5 increase default chunk size to 1Mb

From benchmarking ~1Mb seems a better default than 100k, change the default.

Change samples to dataset

Proposed to change samples from attribute to dataset so it can be used as a dimension scale.

titv function

Propose to implement titv convenience function.

Flatten EFF/ANN annotations?

Would be more convenient if all sub-fields within SNPEFF annotations could be broken out into separate arrays, rather than a single structured array?

EFF Amino_Acid_Change field being truncated

EFF Amino_Acid_Change field is truncated on long transcripts (>10000 residues) since it is dtype a6.

Changing it at the top of eff.py where you define EFF_DEFAULT_DTYPE to something larger (I'd suggest a8 to be safe and even) should fix the problem, at least for SNPs. I'm checking now but will probably sleep before it is done.

I should probably be using ANN, but having the single-letter AA residue to residue mapping right there is just too convenient for my current task.

workaround badly declared info types

GATK's VCF output declares MQ0Fraction as type Integer. The right thing to do is fix the VCF, but for convenience proposed to add an option to treat all numbers as floats in vcfnp when parsing.

fix command line script incorrect use of nargs

Move parameters from comments to docstrings

Currently it is impossible to infer what parameters each function takes. A good idea would be move the existing comments to the docstring.

If this is desirable, I can make a PR.

full vcf to hdf5...

This is probably not much of a priority, but it is bugging me so I'm going to ask...

Is there are more streamlined way of generating hdf5 from vcf without going through the .npy intermediate files? If not, is it likely to be difficult?

Also, unless I'm being confused, you're normally running vcf2npy twice (variants and caldata_2d) before collecting that output into an hdf5. Is there a reason why vcf2npy cannot output both in a single pass? The code isn't structured to make that easy, but I haven't dug deeply enough to tell if it is doing fundamentally different things in the variant pass and the calldata_2d pass. (Though variant is about 5x faster than calldata_2d.)

add more default INFO dtypes

The following are probably all sensible defaults for GATK INFO annotations:

        'ABHet': 'f2',
        'ABHom': 'f2',
        'AC': 'u2',
        'AF': 'f2',
        'AN': 'u2',
        'BaseQRankSum': 'f2',
        'DP': 'u4',
        'Dels': 'f2',
        'FS': 'f2',
        'HRun': 'u1',
        'HaplotypeScore': 'f2',
        'InbreedingCoeff': 'f2',
        'IndelType': 'a12',
        'MQ': 'f2',
        'MQ0Fraction': 'f2',
        'MQRankSum': 'f2',
        'OND': 'f2',
        'QD': 'f2',
        'RPA': 'u2',
        'RU': 'a12',
        'ReadPosRankSum': 'f2',
        'STR': 'b1'

Remove numpy build dependency

Numpy is currently only required for compilation in a very minimal way that probably doesn't improve performance much if at all, propose to remove this as a build dependency.

Reorganise variants as group with child dataset per field

Proposed to change the vcfnpy2hdf5 script default organisation for the variants data such that variants becomes a group and each field becomes a child dataset. (The current arrangement is variants as a dataset with a compound dtype.) This arrangement is faster for retrieval, and also faster for querying via numexpr.

log last variant position with progress

Log last variant position with progress.

vcfnpy2hdf5 add chunk width option

Currently all 2D datasets are created with chunk width 1 (i.e., single column) which is optimal for accessing data on a single sample but performs badly in other access scenarios (e.g., read whole dataset, read contiguous rows). Benchmarking shows that a wider chunk improves all-round performance with little cost when accessing data on a single sample:

http://nbviewer.ipython.org/gist/alimanfoo/67fdcf58e364763fd0b6/benchmark_hdf5_blz.ipynb

Proposed to add an option to the vcfnpy2hdf5 script to allow configuration of chunk width.

Genotype packed

Add call data field genotype_packed which packs diploid genotype into single byte as per anhima.

genotype_ac

Idea: add a derived field "genotype_ac" to output the per-genotype allele count. E.g., a diploid hom ref 0/0 call becomes (2, 0), a diploid het 0/1 call becomes (1, 1), a triploid het call 0/0/1 becomes (2, 1), etc. This could provide a pathway for analysis of aneuploid data, e.g., Leishmania or cancer, where different individuals have different ploidy in different genome regions.