alimanfoo / vcfnp Goto Github PK
View Code? Open in Web Editor NEWLoad numpy arrays and HDF5 files from VCF (variant call format)
License: MIT License
Load numpy arrays and HDF5 files from VCF (variant call format)
License: MIT License
Pandas is very useful to display tabular data with mixed dtypes. It might be nice to be optionally used to view also heterogeneous dtyped data like view2d uses for homogeneous data.
The disadvantage is that you can't create a view: a new object would need to be created.
Currently scripts (e.g., vcf2npy) log dependency versions in a fairly ugly way that also comes out when calling --help.
Get's broken apart into -t c by argparse.
Add command line scripts to facilitate conversion of a VCF file to npy files optionally split by genome region.
It seems that in cases where not all FORMAT fields are populated, default fill values override the actual values in the fields. Here is a small vcf file and commands that hopefully show the problem. I've tried debugging, but I think my python skills are not yet up to it...
##fileformat=VCFv4.1
##INFO=
##FORMAT=
##FORMAT=
##FORMAT=
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT test1 test2 test3 test4
chr1 1 . A G . PASS DP=2 GT:AD 0:1,0 .:1,0 0:0,0 .:0,0
chr1 2 . A G . PASS DP=2 GT:AD:ZZ 0:1,0:dummy 0:1,0 0:0,0 .:0,0
chr1 3 . A G . PASS DP=2 GT:AD:ZZ 0:1,0:dummy .:1,0 0:0,0 .:0,0
import vcfnp
calls = vcfnp.calldata('test.vcf', ploidy=1, progress=20000)
vcfnp.view2d(calls)['AD']
array([[1, 1, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0]], dtype=uint16)
As you can see in the above, the allele depths of 1,0 for test2 sample are not pulled in.
I believe the above vcf file is on spec. It states in the vcf spec that "Trailing fields can be dropped (with the exception of the GT field, which should always be present if specified in the FORMAT field)". This issue arose when looking at the output of GATK's CombineVariants with the option -genotypeMergeOptions UNIQUIFY
This example also highlights a second problem, that only the first AD gets pulled in. This can be overridden by using arities=dict(GT=1, AD=2) to get the correct result. I've also found I get the correct result if putting Number=G in the AD FORMAT header line, but I think the correct thing here is A (there should be a number for each allele, not for each genotype, though of course in the haploid case these amount to the same thing)
Add INFO, FILTER and FORMAT field descriptions from the VCF header to the datasets as attributes (if possible).
In some circumstances calldata() returns data where the genotype is missing but other fields have values (but should have been filled).
It looks like the sample data is not getting cleared out between parsing variants, so sample data can get carried over from one variant to the next.
Add a script similar to vcfnpy2hdf5 but loading data into a BLZ barray.
So right now, difficulties parsing VCF files raise unhandled C++ exceptions that eventually lead to program termination. Ideally, we could wrap catch some of these exceptions so that the outcome is a python exception (IOError?) rather than a full stop. Below are two test cases that I've encountered:
import vcfnp
# echo "hey" > test.txt
v = vcfnp.variants("./test.txt")
[vcfnp] 2015-09-03 09:58:24.050742 :: caching is disabled
[vcfnp] 2015-09-03 09:58:24.050779 :: building array
error: no VCF header
import vcfnp
v = vcfnp.variants(".")
[vcfnp] 2015-09-03 09:57:32.435456 :: caching is disabled
[vcfnp] 2015-09-03 09:57:32.435487 :: building array
terminate called after throwing an instance of 'std::out_of_range'
what(): basic_string::substr
~/opt/bin/python.app: line 3: 84448 Abort trap: 6
Currently vcfnp fails to parse some VCF files merged from the GATK and annotated with snpEff. The error is simply "two fields with the same name", but it's not very useful, as it does not tell which field is duplicated.
Is this an issue upstream in vcflib?
Add a script local_vcf2npy or similar name which is like qsub_vcf2npy but uses multiple local processes for parallelisation.
Proposed to implement a calldata_2d() convenience function which is effectively composition of calldata() and view2d().
Proposed to implement a cache keyword argument to variants(), calldata() and calldata_2d() which if True will automatically save any built numpy arrays to a .npy file, using a filename convention based on the vcf filename but with some suffix, maybe within a directory.
N.B., need to incorporate region string if given.
Add a command line script for loading data from one or more npy splits into an HDF5 file.
Generally I've found there is no utility in separating the INFO from the other fixed fields into two arrays, it would often be more useful to have all variant-related fields in a single structured array/table. E.g., if you wanted to use PyTables in-kernel query engine to express a query for variants based on QUAL or FILTER or num_alleles and any INFO field you could only do that if you have a single structured array. Proposed to deprecate info() and extend variants() to also include INFO fields.
This is a non-backwards compatible change so should imply a major version bump.
Currently fields like 'genotype' where there is a third dimension get stored in the h5 file as a 2 dimensional shape with a compound dtype. This complicates some reading operations. Proposed to change the vcfnpy2hdf5 script such that shape and dtype are properly preserved across input numpy arrays and h5 file.
Add cachedir option to variants, calldata, calldata_2d to manually override cache directory.
Might be a problem on my end, but in case it isn't...
It looks like there is something odd with the way vcflib
is linked as a submodule in vcfnp
$ git clone -v --recursive git://github.com/alimanfoo/vcfnp.git
...normal stuff...
Submodule 'vcflib' ([email protected]:alimanfoo/vcflib.git) registered for path 'vcflib'
Cloning into 'vcflib'...
Permission denied (publickey).
fatal: Could not read from remote repository.
If I clone (recursively) your repo of vcflib
by hand within the vcfnp
directory, it seem to work fine. So I'm thinking that something is wrong with the submodule link.
There is a bug when generating hdf5 from vcf...
When vcflib (via htslib) gets a VariantCallFile iterator for a region, it will include variants which have a position before the region start if their length 'overlaps' the region. Eg: a len 5 del at pos 97 will be included in the output for region chr:100.
This can cause duplicates when .npy files for regions are collected into an hdf5. It can even cause non-monotonic POS values in cases like:
98 len 5 del
99 snp
101 snp
if the regions were chr:1-100 and chr:101-200
The behavior of vcflib isn't really wrong, so these cases should be handled at a higher level. The simple way is to just trim off any variants which have a POS value < the region start. I'm not sure where would be best place to put that test though. I'll poke around some more myself, though I suspect you probably have an idea where to put the fix.
Put docs on rtfd.
Propose support for custom transformer functions which can be used to transform or parse values from a given INFO field prior to being loaded into an array. (Motivating use case is SNPEff annotations.)
Proposed to add allele lengths custom field to variants array to query allele lengths even when ref and alt fields have been truncated by dtype selection.
Add a default transformers capability and include EFF so that EFF fields get parsed by default when using command line utilities.
Bug where string fields have multiple values.
Add option (default) to flatten the FILTER field into multiple fields prefixed by FILTER_ rather than a structured datatype, to make it such that vcfnp.variants() returns something that is closer to a straight table. E.g., would simplify querying via pytables.
From benchmarking ~1Mb seems a better default than 100k, change the default.
Proposed to change samples from attribute to dataset so it can be used as a dimension scale.
Propose to implement titv convenience function.
Would be more convenient if all sub-fields within SNPEFF annotations could be broken out into separate arrays, rather than a single structured array?
EFF Amino_Acid_Change
field is truncated on long transcripts (>10000 residues) since it is dtype a6
.
Changing it at the top of eff.py
where you define EFF_DEFAULT_DTYPE
to something larger (I'd suggest a8
to be safe and even) should fix the problem, at least for SNPs. I'm checking now but will probably sleep before it is done.
I should probably be using ANN, but having the single-letter AA residue to residue mapping right there is just too convenient for my current task.
GATK's VCF output declares MQ0Fraction as type Integer. The right thing to do is fix the VCF, but for convenience proposed to add an option to treat all numbers as floats in vcfnp when parsing.
Currently it is impossible to infer what parameters each function takes. A good idea would be move the existing comments to the docstring.
If this is desirable, I can make a PR.
This is probably not much of a priority, but it is bugging me so I'm going to ask...
Is there are more streamlined way of generating hdf5 from vcf without going through the .npy intermediate files? If not, is it likely to be difficult?
Also, unless I'm being confused, you're normally running vcf2npy twice (variants and caldata_2d) before collecting that output into an hdf5. Is there a reason why vcf2npy cannot output both in a single pass? The code isn't structured to make that easy, but I haven't dug deeply enough to tell if it is doing fundamentally different things in the variant pass and the calldata_2d pass. (Though variant is about 5x faster than calldata_2d.)
The following are probably all sensible defaults for GATK INFO annotations:
'ABHet': 'f2',
'ABHom': 'f2',
'AC': 'u2',
'AF': 'f2',
'AN': 'u2',
'BaseQRankSum': 'f2',
'DP': 'u4',
'Dels': 'f2',
'FS': 'f2',
'HRun': 'u1',
'HaplotypeScore': 'f2',
'InbreedingCoeff': 'f2',
'IndelType': 'a12',
'MQ': 'f2',
'MQ0Fraction': 'f2',
'MQRankSum': 'f2',
'OND': 'f2',
'QD': 'f2',
'RPA': 'u2',
'RU': 'a12',
'ReadPosRankSum': 'f2',
'STR': 'b1'
Numpy is currently only required for compilation in a very minimal way that probably doesn't improve performance much if at all, propose to remove this as a build dependency.
Proposed to change the vcfnpy2hdf5 script default organisation for the variants data such that variants becomes a group and each field becomes a child dataset. (The current arrangement is variants as a dataset with a compound dtype.) This arrangement is faster for retrieval, and also faster for querying via numexpr.
Log last variant position with progress.
Currently all 2D datasets are created with chunk width 1 (i.e., single column) which is optimal for accessing data on a single sample but performs badly in other access scenarios (e.g., read whole dataset, read contiguous rows). Benchmarking shows that a wider chunk improves all-round performance with little cost when accessing data on a single sample:
http://nbviewer.ipython.org/gist/alimanfoo/67fdcf58e364763fd0b6/benchmark_hdf5_blz.ipynb
Proposed to add an option to the vcfnpy2hdf5 script to allow configuration of chunk width.
Add call data field genotype_packed which packs diploid genotype into single byte as per anhima.
Idea: add a derived field "genotype_ac" to output the per-genotype allele count. E.g., a diploid hom ref 0/0 call becomes (2, 0), a diploid het 0/1 call becomes (1, 1), a triploid het call 0/0/1 becomes (2, 1), etc. This could provide a pathway for analysis of aneuploid data, e.g., Leishmania or cancer, where different individuals have different ploidy in different genome regions.
The files created by vcf2npy are currently uncompressed .npy files. These can take up a lot of space on large builds. Add an option to save as compressed .npz files.
If possible, enable manual control of loading of info fields where the header definition is missing, emitting a warning, rather than failing.
Add a script for converting fixed fields or calldata to CSV file.
To update on progress while loading, only output one line per batch.
The reason for this request is to enable possibly filtering and manipulation like you can do with PyVCF, but with a much faster (numpy + C++ backend) interface.
I wrote "in vcflib's binding" rather to vcfnp itself because I think that pushing data from numpy to VCF wouldn't be exactly trivial, also considering that vcfnp parses the various bits of a VCF file independently.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.