Giter VIP home page Giter VIP logo

emerald's Introduction

emeraLD

Tools for rapid on-the-fly LD calculation

About

  • Exploits sparsity and haplotype structure to efficiently calculate LD
  • Uses tabix indexes to support rapid querying of genomic regions
  • Supports VCF (phased or unphased) and M3VCF formats
  • Supports integration with Python and R

Installing

git clone https://github.com/statgen/emeraLD.git  
cd emeraLD  
make  

Usage

  • Example usage from command line
# example usage for calculating LD in a region:
bin/emeraLD -i example/chr20.1KG.25K_m.m3vcf.gz --region 20:60479-438197 --stdout | bgzip -c > output.txt.gz

Software References

Libraries and resources used or adapted in emeraLD:

Contributors

Special thanks to Daniel Taliun and Ryan Welch

Citation

Feedback and bug reports

emerald's People

Contributors

corbinq avatar dtaliun avatar welchr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

emerald's Issues

LD between chromosome

Hi,
Is it posssible to calculate LD between chromosomes ( they are not actual chromosomes, they are scaffolds which might be physically linked)? I have not been able to find the proper option.

Alternative question: do you think it would be possible to calculate LD for Whole-genome set of SNPs (about 1 000 000)? Or is it unrealistic and I'd rather draw a subset by scaffold?
I'm interested in getting a matrix of LD for subsequent analysis with LDna for instance.

Thanks a lot for this promising new tool,
Claire

Ofast is Generally Not Safe

Would you please consider changing the default makefile to use -O3 rather than -Ofast? Or perhaps give a warning about the risks of -Ofast compilation in the readme file?

My understanding is that -Ofast allows for unsafe math shortcuts that probably don't belong in scientific software, and if one is going to use them then each user should carefully verify the output of their compiled program built on their specific build toolchain. My particular version of gcc seems to do exactly the same thing for -Ofast and -O3 for this program, as my binary file seems identical, but that is not guaranteed. Please see https://simonbyrne.github.io/notes/fastmath/ for a more complete discussion of fast math.

Thanks for this software, it is indeed very fast. While on my system those binaries appear to be the same (they are the same size, I didn't verify their content is actually exactly the same), some others might have a speed difference between -O3 and -Ofast. In my case there was no speed downside excluding the use of -Ofast. Some may have a difference, and I suppose they are the ones that should be most cautious about the results from the -Ofast compiled version.

LD difference between plink and emeraLD using unphased 1kg data

Hello:

I am trying to compare the result using 1kg unphased between plink and emeraLD, and I got totally different result. Also, I found a lot of LD generated by emeraLD have R vale >1 or <-1; Can you tell me the difference of LD calculation between plink and emeraLD? Here is one example; Thanks.

$emeraLD -i test.vcf.gz --stdout --dstats --no-phase
emeraLD v0.1 (c) 2018 corbin quick ([email protected])

reading from vcf file...

assuming unphased data (reporting diploid genotype LD)...

processed genotype data for 503 individuals...

calculating LD for 2 SNPs...

#CHR POS1 POS2 R Rsq D Dprime
9 133830619 134457580 -15.71375 246.92204 -0.66269 -99.00000
done!! thanks for using emeraLD

plink1.9 --vcf test.vcf.gz --ld rs3780269 rs111207562
PLINK v1.90b3.44 64-bit (17 Nov 2016) https://www.cog-genomics.org/plink2
(C) 2005-2016 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to plink.log.
Options in effect:
--ld rs3780269 rs111207562
--vcf test.vcf.gz

257652 MB RAM detected; reserving 128826 MB for main workspace.
--vcf: plink-temporary.bed + plink-temporary.bim + plink-temporary.fam written.
2 variants loaded from .bim file.
503 people (0 males, 0 females, 503 ambiguous) loaded from .fam.
Ambiguous sex IDs written to plink.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 503 founders and 0 nonfounders present.
Calculating allele frequencies... done.
2 variants and 503 people pass filters and QC.
Note: No phenotypes present.

--ld rs3780269 rs111207562:

R-sq = 0.0145356 D' = 0.17131

Haplotype Frequency Expectation under LE


      AT      0.137429                0.165839
      GT      0.360583                0.332173
      AC      0.195573                0.167163
      GC      0.306415                0.334825

In phase alleles are AC/GT

non-symmetric LD matrix

I'm calculating an LD matrix from UK Biobank data. When I use this as input into another program in R it gives an error: isSymmetric = FALSE
I've checked the values to identify which ones are causing the problems and it seems to be that the values are pretty close. Should I expect that the output matrix will be symmetric, or is this not the case due to rounding or approximation somewhere in the emeraLD code?
Example 1:

temp[1919,1064]
V1064
-0.02305
temp[1064,1919]
V1919
-0.02601

Example 2:

temp[648,650]
V650
1
temp[650,648]
V648
0.92538

make error

Thanks for developing and sharing this wonderful tool! I was trying to install the software but unfortunately got different errors in both macOS and Linux environment. Below are the error messages. I wonder if you could help identify the problems at your convenience.

error in macOS

# xiangzhu @ stanford in ~/GitHub/emeraLD on git:master o [15:58:14] C:2
$ make
c++ -std=c++11 -Ofast -flto -pipe  -c -o src/Main.o src/Main.cpp
In file included from src/Main.cpp:1:
In file included from src/processGenotypes.hpp:13:
src/boost/dynamic_bitset.hpp:15:10: fatal error: 'boost/dynamic_bitset/dynamic_bitset.hpp' file not
      found
#include "boost/dynamic_bitset/dynamic_bitset.hpp"
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
make: *** [src/Main.o] Error 1

Here is the info about this MacOS environment:

$ uname -a
Darwin stanford 17.7.0 Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64 x86_64

error in Linux

# xiangzhu @ midway-login1 in ~/emeraLD on git:master o [18:03:41]
$ make
g++ -std=c++11 -Ofast -flto -pipe  -c -o src/calcLD.o src/calcLD.cpp
cc1plus: error: invalid option argument ‘-Ofast’
cc1plus: error: unrecognized command line option "-std=c++11"
cc1plus: error: unrecognized command line option "-flto"
make: *** [src/calcLD.o] Error 1

Here is the info about this Linux environment:

$ cat /proc/version
Linux version 2.6.32-696.23.1.el6.x86_64 ([email protected]) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-18) (GCC) ) #1 SMP Tue Mar 13 17:46:31 CDT 2018

Seg fault on large VCF

Hi,
i wanted to try this tool as it seems very promising. I have multiple hundred human samples with SNPs called and wanted to run emeraLD. However, within seconds I get a seg fault:

NOTE: genotype data appear to be unphased
reporting genotype LD rather than haplotype LD
use "--phased" option to override this behaviour
./run_LDanalysis.sh: line 14: 15620 Segmentation fault ~/mydir/programs/emeraLD/bin/emeraLD -i $reads --out output_chr22.txt --region chr22:10511578-20511578

were $reads is my bgziped and indexed VCF file. I also run into the same problem when I dont define a region.
Here is a short sample of the VCF file (without the header) and just the first two samples. As you can see they are not phased per sample.

Please let me what is going wrong as I would really like to run emeraLD on that data set.
Thanks
Fritz

chr22   10511193        .       T       C       56      .       .       GT:GQ:PL:DP:RR:VR:FT:RNC        ./.:.:0,27,30:0:0:0:No_data:..  1/1:.:27,3,0:2:0:2:low_coverage;low_Var
chr22   10511228        .       T       A       98      .       .       GT:GQ:PL:DP:RR:VR:FT:RNC        ./.:.:0,27,30:0:0:0:No_data:..  0/0:.:0,35,114:2:2:0:No_var:..  ./.:.:0
chr22   10511254        .       A       G       87      .       .       GT:GQ:PL:DP:RR:VR:FT:RNC        0/0:.:0,32,80:1:1:0:No_var:..   0/0:.:0,35,114:2:2:0:No_var:..  ./.:.:0
chr22   10511255        .       C       A       73      .       .       GT:GQ:PL:DP:RR:VR:FT:RNC        0/0:.:0,32,80:1:1:0:No_var:..   0/0:.:0,35,114:2:2:0:No_var:..  ./.:.:0
chr22   10511270        .       T       A       229     .       .       GT:GQ:PL:DP:RR:VR:FT:RNC        0/0:.:0,32,80:1:1:0:No_var:..   0/0:.:0,35,114:2:2:0:No_var:..  ./.:.:0

Savvy Library input

Hi Corbin,

Thank you so much for developing emeraLD. Your tool is lightning fast.

Do you think there is a chance that you will add savvy format support in the near future? (https://github.com/statgen/savvy) to emeraLD? Could be a great combination of highly efficient tools.

Thanks,
Lars

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.