ryanlayer / gqt Goto Github PK

View Code? Open in Web Editor NEW

133.0 133.0 25.0 921 KB

A genotype query interface.

License: MIT License

Python 2.64% Shell 4.61% Makefile 0.35% C 83.92% Ruby 1.45% Lex 0.09% C++ 6.96%

gqt's People

Contributors

Stargazers

Watchers

gqt's Issues

Exclude the 4th bitmap from the range loops.

Unable to compile gqt from source

I have followed the instructions in the installation section carefully, but whenever I run make, I get the following error:

/bin/ld: /home/azureuser/vcf-browsers/htslib/libhts.a(hfile_s3.o): undefined reference to symbol 'EVP_sha256@@OPENSSL_1_1_0' /bin/ld: /lib/x86_64-linux-gnu/libcrypto.so.1.1: error adding symbols: DSO missing from command line collect2: error: ld returned 1 exit status make[1]: *** [Makefile:73: ../bin/read_binary_uints] Error 1 make[1]: Leaving directory '/home/azureuser/vcf-browsers/gqt/src' make: *** [Makefile:8: all] Error 2

I have deduced that this is a problem with the openssl package, but I cannot figure out how to solve it. I have tried downloading different versions of the openssl library, but nothing seems to work. I am working on an Ubuntu 20.04 on a Azure VM.

Make AVX add_n_wahbm

Phase 3 1000G files with complete annotations?

A properly comparison the size of the WAH index to the original BCF should be based upon a fully annotated BCF. The Phase 1 VCFs had extensive annotations in the INFO field, but the more recent Phase 3 VCFs have minimal annotations. We need to double check this to make sure that there are not actually fully annotated Phase 3 VCFs. If not, we will need to annotate them ourselves with VEP.

These (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp//release/20130502/supporting/functional_annotation/) files appear to have annotation, but they do not include genotypes (sites only). However, we could "paste" the genotype info at the end as a compromise, assuming the sites match exactly.

Caching queries?

Saw your very excellent talk at ASHG. I was just wondering if there would be support for query caching (memcached, redis, etc.) to speed things up even more to interactive speed, especially since there are certain numbers (i.e. metadata across the data set for each variant column) that you would not need to calculate more than once for each version of the database. For instance, pulling up all variants with allele frequency of 10% in a certain 1kG population should be pulling from pre-calculated allele frequencies for that slice?

Reduce file size by inferring Homo Ref.

How best to extract BCF records based on result of WAH query?

Standardize genotype shortcut constants

Use HET, HOM_REF, HOM_ALT

Auto detect VID, BIM, WAH, DB file names based on BCF path in "query"

Creating sample database on disk very slow

It appears that Sqlite is flushing to disk quite aggressively, which makes importing large pedigree databases extremely slow. Even on an SSD, it takes a very long time to run the inserts even though the CPU load is low. Running all the inserts within a transaction would probably speed things up quite a lot.

Can I use gqt for Imputation data generated by IMPUTE2?

Hello Folks,

I have imputation data of 1.5TB -- One file per chromosome. I usually asked to search for data of a list of SNPs. Currently, I am using very naive technique i.e. fgrep. But it takes several days to complete the search if I have a long list of SNPs (e.g. 400) (FYI, I am doing a search using 64GB RAM). I am just curious that can I use gqt software for imputation data? If so, how can I accomplish it? I meant is there any tutorial available for that?

Thanks for your help.

Best,
Tushar

make Error in MacOS Sierra

I am compiling a gqt in mac and met with the below error. I have installed htslib succesfully and added the path in the gqt src/Makefile. Can anyone help to fix this?

$ make
cd src; /Applications/Xcode.app/Contents/Developer/usr/bin/make
flex -oparse_q.yy.c parse_q.l
gcc -c -g -D_FILE_OFFSET_BITS=64 -o ../obj/parse_q.yy.o parse_q.yy.c
-I/Users/xxxx/db/htslib-1.9 -I/Users/xxxx/db/sqlite-amalgamation-3080701
gcc -g -D_FILE_OFFSET_BITS=64 -o ../bin/read_binary_uints
read_binary_uints.c ../obj/timer.o ../obj/misc.o ../obj/query.o
../obj/pop.o ../obj/bim.o ../obj/vid.o ../obj/off.o ../obj/parse_q.yy.o
../obj/parse_q.o ../obj/ped.o ../obj/pq.o ../obj/bcf.o
../obj/quick_file.o ../obj/output_buffer.o ../obj/pthread_pool.o
../obj/genotq.o ../obj/convert.o ../obj/plt.o ../obj/wah.o
../obj/wahbm.o ../obj/wahbm_in_place.o
../obj/wahbm_compressed_in_place.o ../obj/ubin.o ../obj/view.o
../obj/sort.o ../obj/count.o ../obj/sandbox.o
-I/Users/xxxx/db/htslib-1.9
-I/Users/xxxx/db/sqlite-amalgamation-3080701
-DSAMTOOLS=1
/Users/xxxx/db/htslib-1.9/libhts.a
/Users/xxxx/db/sqlite-amalgamation-3080701/sqlite3.o
-ldl -lz -lm -pthread
Undefined symbols for architecture x86_64:
"_BZ2_bzBuffToBuffCompress", referenced from:
_cram_compress_block in libhts.a(cram_io.o)
_cram_compress_by_method in libhts.a(cram_io.o)
"_BZ2_bzBuffToBuffDecompress", referenced from:
_cram_uncompress_block in libhts.a(cram_io.o)
"_curl_easy_cleanup", referenced from:
_libcurl_open in libhts.a(hfile_libcurl.o)
_libcurl_close in libhts.a(hfile_libcurl.o)
_restart_from_position in libhts.a(hfile_libcurl.o)
"_curl_easy_duphandle", referenced from:
_restart_from_position in libhts.a(hfile_libcurl.o)
"_curl_easy_getinfo", referenced from:
_easy_errno in libhts.a(hfile_libcurl.o)
_libcurl_open in libhts.a(hfile_libcurl.o)
"_curl_easy_init", referenced from:
_libcurl_open in libhts.a(hfile_libcurl.o)
"_curl_easy_pause", referenced from:
_libcurl_read in libhts.a(hfile_libcurl.o)
_libcurl_write in libhts.a(hfile_libcurl.o)
_libcurl_close in libhts.a(hfile_libcurl.o)
_restart_from_position in libhts.a(hfile_libcurl.o)
"_curl_easy_reset", referenced from:
_restart_from_position in libhts.a(hfile_libcurl.o)
"_curl_easy_setopt", referenced from:
_libcurl_open in libhts.a(hfile_libcurl.o)
_restart_from_position in libhts.a(hfile_libcurl.o)
"_curl_global_cleanup", referenced from:
_hfile_plugin_init_libcurl in libhts.a(hfile_libcurl.o)
_libcurl_exit in libhts.a(hfile_libcurl.o)
"_curl_global_init", referenced from:
_hfile_plugin_init_libcurl in libhts.a(hfile_libcurl.o)
"_curl_multi_add_handle", referenced from:
_libcurl_open in libhts.a(hfile_libcurl.o)
_restart_from_position in libhts.a(hfile_libcurl.o)
"_curl_multi_cleanup", referenced from:
_libcurl_open in libhts.a(hfile_libcurl.o)
_libcurl_close in libhts.a(hfile_libcurl.o)
"_curl_multi_fdset", referenced from:
_wait_perform in libhts.a(hfile_libcurl.o)
"_curl_multi_info_read", referenced from:
_wait_perform in libhts.a(hfile_libcurl.o)
"_curl_multi_init", referenced from:
_libcurl_open in libhts.a(hfile_libcurl.o)
"_curl_multi_perform", referenced from:
_wait_perform in libhts.a(hfile_libcurl.o)
"_curl_multi_remove_handle", referenced from:
_libcurl_open in libhts.a(hfile_libcurl.o)
_libcurl_close in libhts.a(hfile_libcurl.o)
_restart_from_position in libhts.a(hfile_libcurl.o)
"_curl_multi_timeout", referenced from:
_wait_perform in libhts.a(hfile_libcurl.o)
"_curl_share_cleanup", referenced from:
_hfile_plugin_init_libcurl in libhts.a(hfile_libcurl.o)
_libcurl_exit in libhts.a(hfile_libcurl.o)
"_curl_share_init", referenced from:
_hfile_plugin_init_libcurl in libhts.a(hfile_libcurl.o)
"_curl_share_setopt", referenced from:
_hfile_plugin_init_libcurl in libhts.a(hfile_libcurl.o)
"_curl_version_info", referenced from:
_hfile_plugin_init_libcurl in libhts.a(hfile_libcurl.o)
"_lzma_code", referenced from:
_cram_uncompress_block in libhts.a(cram_io.o)
"_lzma_easy_buffer_encode", referenced from:
_cram_compress_block in libhts.a(cram_io.o)
_cram_compress_by_method in libhts.a(cram_io.o)
"_lzma_easy_decoder_memusage", referenced from:
_cram_uncompress_block in libhts.a(cram_io.o)
"_lzma_end", referenced from:
_cram_uncompress_block in libhts.a(cram_io.o)
"_lzma_stream_buffer_bound", referenced from:
_cram_compress_block in libhts.a(cram_io.o)
_cram_compress_by_method in libhts.a(cram_io.o)
"_lzma_stream_decoder", referenced from:
_cram_uncompress_block in libhts.a(cram_io.o)
ld: symbol(s) not found for architecture x86_64
collect2: error: ld returned 1 exit status
make[1]: *** [../bin/read_binary_uints] Error 1
make: *** [all] Error 2

$ gcc --version
gcc (MacPorts gcc5 5.5.0_1) 5.5.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The `convert wahbm` tool should infer `-f` and `-v` from BCF

Currently, one must state the -f (number of samples) and -r (number of variants) on the command line. It is easy enough to grab these from the BCF using bcftools stats, but it would be even better if were automatically calculated from the BCF. This probably requires a pass through the file unless the number of records is in the header.

GQT error with URL data input

I wonder if it is possible to run the gqt commands on remote data by providing urls? as this would solve a big hurdle of data transfers of huge VCF files with thousands of individuals to local servers or systems where gqt is installed. Instead reading the remote file will save a lot of time.

Try 1: vcf.gz and index files are in the ftp path but the GQT is not able to read the “.gqt” file although it looks in the right path as shown below.

$ /gqt/bin/gqt query -i http://share/chr20.vcf.gz -v -d Test.phe.db -p "Phenotype=2" -g "HOM_REF HOM_ALT UNKNOWN" > common.vcf Auto detect failure: GQT file 'http://share/chr20.vcf.gz.gqt' not found

Try2: .gqt file is parsed with the optional parameter -G but now it says Input/output error.

$ /gqt/bin/gqt query -i http://share/shinyTest/chr20.vcf.gz -v -d Test.phe.db -p "Phenotype=2" -g "HOM_REF HOM_ALT UNKNOWN" -G http://share/chr20.vcf.gz.gqt > common.vcf gqt: Cannot open WAHBM file 'http://share/chr20.vcf.gz.gqt': Input/output error

Compilation Problems and Handling Imputed Dosages?

Hi Ryan,

My group (Channing Division of Network Medicine, Brigham and Women's Hospital) is always looking for new solutions for large-scale data storage/retrieval/query and are taking a look at gqt. I enountered a few problems during compilation, and also have a question:

Problem 1) I had to change line 60 of src/c/Makefile to:

$(LEX) -oparse_q.yy.c parse_q.l

Note, no space in "-oparse_q.yy.c"

I don't have an "immintrin.h", so I removed all includes for this file. gqt still compiled without them, but I am curious on your thoughts on how this will impact performance.
Do you have any plans to add support for quantitative values such as imputed dosages?

Thanks,
JZ

Convert VCF to allele sorted WAHBM

Support free form queries based on WAH index and PED

Once there is support for an arbitrary query in each of the WAH functions, we should extend the functionality in the sum tool to create a more general query tool. Example (intentionally assinine) functions:

Report all variants where all CEU females are heterozygous.

gqt query -i 1kg.chr22.bcf.wahbm \
                -d integrated_call_samples.20130502.ALL.1092.db \
                -q "Population = 'CEU' and Gender = 2" \
                -r "==HET"

Report all variants where all CEU females are heterozygous AND all CEU males are homozygous for the reference allele.

gqt query -i 1kg.chr22.bcf.wahbm \
                -d integrated_call_samples.20130502.ALL.1092.db \
                -q "Population = 'CEU' and Gender = 2" \
                -r "==HET" \
                -q "Population = 'CEU' and Gender = 1" \
                -r "==HOM_REF"

Report all variants where >=20 CEU females are heterozygous AND >=30 CEU males are homozygous for the reference allele.

gqt query -i 1kg.chr22.bcf.wahbm \
                -d integrated_call_samples.20130502.ALL.1092.db \
                -q "Population = 'CEU' and Gender = 2" \
                -r "count(==HET) >= 20" \
                -q "Population = 'CEU' and Gender = 1" \
                -r "count(==HOM_REF) >= 30"

Report all variants where >=20% CEU females are heterozygous AND >=30% CEU males are homozygous for the reference allele.

gqt query -i 1kg.chr22.bcf.wahbm \
                -d integrated_call_samples.20130502.ALL.1092.db \
                -q "Population = 'CEU' and Gender = 2" \
                -r "pct(==HET) >= 0.20" \
                -q "Population = 'CEU' and Gender = 1" \
                -r "pct(==HOM_REF) >= 0.30"

Once this is in place, we can support very powerful queries by piping the BCF (TODO) output of the query to bcftools in order to filter the returned variants based on metadata in the variant records.

#1. Report all variants where >=20% CEU females are heterozygous 
# AND >=30% CEU males are homozygous for the reference allele
#2. Then use bcftools to filter the returned variants to include solely
# those variants with min(DP) >= 20 and are missense according to VEP
gqt query -i 1kg.chr22.bcf.wahbm \
                -d integrated_call_samples.20130502.ALL.1092.db \
                -q "Population = 'CEU' and Gender = 2" \
                -r pct(==HET) >= 0.20" \
                -q "Population = 'CEU' and Gender = 1" \
                -r "pct(==HOM_REF) >= 0.30" | \
bctools view -i 'min(DP)>=20 && CSQ[*] ~ "missense_variant"' - \
> interesting_variants.bcf

Wiki describing usage examples and proposed work.

add avx add to query

Streamline BCF to WAH index conversion

Make "maf" fucntion

gqt query crashes with large annotations (fix inside)

I have a BCF file, and one line has a very large annotation field (2 million characters - almost 10MB), which crashes gqt query.

This stems from the conservative 1MiB output_buffer_main_buf_size as defined in src/output_buffer.c, so this issue can can be solved by increasing this and recompiling.

I'm wondering if you think it's worth increasing the buffer size or allowing for dynamic growth or automatic flushing of the output buffer?

If not, at least this issue will help other people solve this issue if they run into it :)

Allow concurrent indexing jobs in the same directory

Right now the tmp files are all named the same, and mess each other up.

perhaps a typo in functional_tests.sh

Thanks for providing this tool. Upon installing today, functional_tests.sh failed on RHEL5 with error:

SUCCESS(555): GQT count matches BCFTOOLS count
./functional_tests.sh: line 565: BCFTOOLS: command not found
tail: cannot open `../data/A_vs_B.weir.fst' for reading: No such file or directory
ERROR(603): GQT Fst does not match VCFTOOLS fst
cat: ../data/A_vs_B.weir.fst: No such file or directory

On line 565, replacing 'BCFTOOLS' with '$BCFTOOLS' resolved issue in my environment.

Need to handle multiallelic variants that have been decomposed and normalized.

The appropriate workflow for multi-allelic variants is to decompose and normalize them so that each REF/ALT combination produces a distinct record. This is a preprocessing step that will be done to the VCF before it is given to GQT. For example, consider the following multi-allelic record:

2   44101649    .   G   GC,GCC,GCCC,GCCCC   1/3:0,20,15,0,0:42:99:1528,741,642,703,0,777,1173,533,675,1117,1221,504,873,1158,1478

After decomposing and normalizing with vt, this will be split into 4 records.

2   44101649    .   G   GC  1/.:0,20,15,0,0:42:99:1528,741,642
2   44101649    .   G   GCC ./.:0,20,15,0,0:42:99:1528,703,777
2   44101649    .   G   GCCC    ./1:0,20,15,0,0:42:99:1528,1173,1117
2   44101649    .   G   GCCCC   ./.:0,20,15,0,0:42:99:1528,1221,1478

Notice that since the genotype in the original record is a heterozygote 1/3, the individual's genotype is a heterozygote for the first and third new record. GQT needs to recognize genotypes such as "1/." and "./1" as heterozygotes, not unknowns.

Now that GEMINI handles this decompsed VCFs (work from @brentp), adding this functionality will facilitate using GQT as the pre-processing engine underlying GEMINI.

Possible typo in functional_tests.sh script

Hello Authors,

At line 573 of script functional_tests.sh under gqt/test/func, the script should have $BCFTOOLS. But the script has BCFTOOLS which cause and error.

Also in the same script at several places, it has bcftool instead of $BCFTOOLS. Can some one explain that why the script has both BCFTOOLS variable and bcftools (the command that invokes the bcftools utility)?

I hope all these make sense.

Thanks.

Best,
Tushar

error with example from README

$ wget --trust-server-names http://bit.ly/gqt_bcf
...
$ bcftools index chr11.11q14.3.bcf 
[bcf_sync] incorrect number of fields (6 != 5) at 50463490:50463490

$ gqt convert bcf -i chr11.11q14.3.bcf
Attempting to autodetect number of variants and samples from chr11.11q14.3.bcf
Could not load CSI index: chr11.11q14.3.bcf

PED v. VCF/BCF sanity checking.

Currently, we solely enforce that the number of individuals in the PED and GQT index of the VCF/BCF match. However, this is still a bit dangerous, since we really want to ensure that all of the sample ids AND the order of the sample Id match.

-h returns 64 exit code

Should return 0, otherwise many automated build/packaging systems will break :_(

mba-2:bin romanvg$ ./gqt
gqt, v1.1.2
usage:   gqt <command> [options]
         convert    Convert between file types
         query      Query the index
         pca-shared Compute the similarity matrix for PCA base
                    on the number of shared non-reference loci.
         calpha     Calculate C-alpha paramters (Neal 2011)
         gst        Calculate Gst statistic (Neil 1973)
         fst        Calculate Fst statistic (Weir and Cockerham 1984)
mba-2:bin romanvg$ echo $?
64

Add error message for mismatched PED/WAH index.

Create high-level and low-level APIs

Motivated by suggestions from @brentp and Manuel Rivas (https://twitter.com/manuelrivascruz/status/590250919570186240), it is clear that a simple API would be beneficial. In particular, providing a well-designed C API will have multiple benefits:

It will allow the GQT indexing, querying, and bitmap comparison operations to be accessible to other programming languages (in particular, Python and R).
It will allow GQT's efficient genotype bitmaps to be used by others for the applcation of statistics, method development, etc. For example, the existing PCA functionality is a perfect example of using the WAH-encoded bitmaps to quickly compute metrics that are otherwise very computationally intensive.
It will make it easier for others to understand and therefore contribute to the codebase.

In my mind, the API needs high level functions, such as the following to support point number 1:

// create a GQT index of a VCF, VCF.GZ, or BCF file. Returns indicator of success
int  gqt_index(const char *vcf_fn, const char *gqt_fn);

// open a GQT index; return a pointer to a GQT file struct containing file handle, index information, etc.
gqt_file *gqt_open(const char *fn);

// open a sample database; return a pointer to a struct with SQLIte database handle.
smp_file *db_open(const char *fn);

// query the gqt index to return the set of variant row numbers that match
// the set of phenotype and genotype criteria
int *gqt_query(const char **phenotype_rules, const char **genotype_rules); 

// close
int gqt_close(gqt_file *);
int db_close(smp_file *);

In addition, in order to support goals 2 and 3, users will need a lower-level API to gain access to the set of bitmaps for a given individual, compare bit vectors, etc. This one is a bit trickier and will require discussion, since users will need to be able to access individual bit vectors for individual samples, compare bit vectors across samples, and apply comparison operators that assess all of an individual's bit vectors for a given attribute (that is, genotypes, depths, phases, PLs, etc.).

no variants returned without leading zero in `-g "maf()>0.10"`

a query with -g "maf()>.10" returns zero variants but -g "maf()>0.10" works.

compiling error

Dear GQT developers,

I got the following error while compiling gqt, could you give some advice?

convert.c:122:17: warning: implicit declaration of function 'hts_get_format' is invalid in C99 [-Wimplicit-function-declaration]
if (hts_get_format(fp)->format==vcf) {
^
convert.c:122:37: error: member reference type 'int' is not a pointer
if (hts_get_format(fp)->format==vcf) {
~~~~~~~~~~~~~~~~~~ ^
convert.c:122:45: error: use of undeclared identifier 'vcf'
if (hts_get_format(fp)->format==vcf) {
^
convert.c:128:45: error: member reference type 'int' is not a pointer
} else if ( hts_get_format(fp)->format==bcf ) {
~~~~~~~~~~~~~~~~~~ ^
convert.c:128:53: error: use of undeclared identifier 'bcf'
} else if ( hts_get_format(fp)->format==bcf ) {
^
1 warning and 4 errors generated.

Thanks,
Zhihao

ryanlayer / gqt Goto Github PK

gqt's People

Contributors

Stargazers

Watchers

Forkers

gqt's Issues

Recommend Projects

Recommend Topics

Recommend Org