lh3 / chm-eval Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
I downloaded the 20180222 release version, but it doesn't appear to include the run-flt script. So, I cloned the master branch, which does have the run-flt script, but has a very different directory structure than what's described in the readme. When I try to run run-flt, I get error messages like this:
sh: CHM-eval/k8: No such file or directory
sh: CHM-eval/k8: No such file or directory
sh: CHM-eval/k8: No such file or directory
sh: CHM-eval/htsbox: No such file or directory
sh: CHM-eval/htsbox: No such file or directory
You may need to include the run-flt script in the release, and/or update paths/documentation for the master branch.
Thanks!
I've noticed a few instances where the SynDip truth sets asserts heterozygous calls that are actually different representations of the same haplotype. For example, in chr1:2106223-2107071
of full.38.vcf.gz
we have
chr1 2106469 . A G 30 . . GT:AD 1|1:0,2
chr1 2106527 . GGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCCA G 30 . . GT:AD 0|1:1,1
chr1 2106546 . A G,* 30 . . GT:AD 1|2:0,1,1
chr1 2106547 . C T,* 30 . . GT:AD 1|2:0,1,1
chr1 2106584 . C T,* 30 . . GT:AD 1|2:0,1,1
chr1 2106585 . A G,* 30 . . GT:AD 1|2:0,1,1
chr1 2106604 . A G 30 . . GT:AD 1|1:0,2
chr1 2106605 . C T 30 . . GT:AD 1|1:0,2
chr1 2106634 . AGCCCCTCTGGTGGGCGAGGACCTCCACGTGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCCGGTGGGCGAGGACCTCCATGCGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCTGGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTGACTCTTCAGCAG 30 . . GT:AD 1|0:1,1
chr1 2106692 . AGCCCCTCCGGTGGGCGAGGACCTCCATGCGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCTGGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTGACTCTTCAGCAG A,* 30 . . GT:AD 2|1:0,1,1
chr1 2106855 . C T 30 . . GT:AD 1|1:0,2
If I decompose these into haplotypes:
CHM1
chr1 2106469 . A G 30 . . GT 1
chr1 2106546 . A G 30 . . GT 1
chr1 2106547 . C T 30 . . GT 1
chr1 2106584 . C T 30 . . GT 1
chr1 2106585 . A G 30 . . GT 1
chr1 2106604 . A G 30 . . GT 1
chr1 2106605 . C T 30 . . GT 1
chr1 2106634 . AGCCCCTCTGGTGGGCGAGGACCTCCACGTGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCCGGTGGGCGAGGACCTCCATGCGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCTGGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTGACTCTTCAGCAG 30 . . GT 1
chr1 2106855 . C T 30 . . GT 1
CHM13
chr1 2106469 . A G 30 . . GT 1
chr1 2106527 . GGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCCA G 30 . . GT 1
chr1 2106604 . A G 30 . . GT 1
chr1 2106605 . C T 30 . . GT 1
chr1 2106692 . AGCCCCTCCGGTGGGCGAGGACCTCCATGCGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCTGGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTGACTCTTCAGCAG A 30 . . GT 1
chr1 2106855 . C T 30 . . GT 1
The consensus sequence of both is identical:
>chr1:2106223-2107071
CTCCAGCCAAGGCATCCAAACATCAAAAGGCAGAACTGAGCGGCTTGGTACTTGAAAAGT
TTTTATTAGGAAAAATGCCAAACTGACAGAAGTAGAGAGAATTACATAGTGAGGCCTCGT
GCACACCCTGCCTGGCTCCTGGCAACCTGCACTCCAGCCGATACCTGTGACTCTCAGCAA
GCCCCTCTAGTGGGCGAGGACCTCCACACGTGTCGCCAGGCCAGGCGACTCTCAGCAAGC
CCCTCCGGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCC
CTCTGGTGGGCGAGGACCTCCACGTGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCT
CTGGTGGGCGAGGACCTCCACGTGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCT
GGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGCGACTCTTCAGCAAGCCCCTCCA
CACGTGTCACCAGGCCAGGTGACTCTCAGCAAGCCCCTCCGGTGGGCGAGGACCTCTGCA
CGTGTCTCCAGAGGCCAAAGCAGAAGAAAACGTTAGCACAGGAGTCACTTGACTTCACCA
AACGCAGCCAGGATTGCGGTTTCTCCGGCTCGGCTGTCTCAGTTGTTTAAGAGAGTTCAT
GCTTTTGAGATCAA
I'm assuming this is a result of the way that SynDip calls were generated, and in particular because the merging process was not haplotype aware. I guess that it's not technically wrong, but is somewhat confusing, and does throw off comparison tools. For example, if I try to evaluate the following calls
chr1 2106468 . CAGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCTGGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTAACTCTCAGCAA C 421.84 PASS . GT 1|1
chr1 2106577 . GCCCCTCCA G 314.04 PASS . 1|1
chr1 2106700 . C T 270.54 PASS . GT 1|1
chr1 2106719 . T C 336.06 PASS . GT 1|1
chr1 2106721 . C T 336.06 PASS . GT 1|1
chr1 2106796 . TGACTCTTCAGCAGGCCCCTCTGGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGC T 125.96 PASS . GT 1|1
Which is yet another representation of the same haplotype, with RTGTools vcfeval (without the --ref-overlap
option), then all the calls are reported false positive and all the SynDip calls are false negatives. Adding --ref-overlap
does resolve the issue, but IMO it shouldn't be necessary.
Hi,
Does the procedure outlined at https://github.com/lh3/CHM-eval/tree/master/dip-call identify structural variants from split alignments? I noticed there were no inversions called. Are only variants spanned by reads called?
Regards,
Wouter
The line:
CHM-eval.kit/rtg -o hs37.sdf hs37.fa # if you haven't done this before
should be:
CHM-eval.kit/rtg format -o hs37.sdf hs37.fa # if you haven't done this before
I understand the reason behind simply providing the locations, but in many real-world scenarios, especially in coding regions, it's important to get the exact variant correctly.
Is there any plan to provide a syndip set with the REF and ALT alleles inferred from the pac-bio data?
Is there a hg38 aligned bam (Illumina reads) available for download for CHM1, CHM13 and CHM1_CHM13? I've just been able to find the hs37 aligned bams.
Hey, i'm new in bioinformatics.
Please, i have a question, What is this file "rep2.37.broad.hc.raw.vcf.gz" in the release? Is this the truth dataset ? if not, how can i get the syndip dataset with all variants in a file vcf.
Thank you!
Hi Heng,
I noticed that there are variants in rep2.37.broad.hc.raw.vcf.gz that are out of range of their chromosomes. An example is on chromosome 9, position 138395067.
I did not look into this in any further detail.
Is there another vcf you would recommend as truth set for syndip? Or would it be best to filter in this vcf.
Thanks,
Wouter
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.