Giter VIP home page Giter VIP logo

chm-eval's People

Contributors

lh3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chm-eval's Issues

Missing run-flt in release

I downloaded the 20180222 release version, but it doesn't appear to include the run-flt script. So, I cloned the master branch, which does have the run-flt script, but has a very different directory structure than what's described in the readme. When I try to run run-flt, I get error messages like this:

sh: CHM-eval/k8: No such file or directory
sh: CHM-eval/k8: No such file or directory
sh: CHM-eval/k8: No such file or directory
sh: CHM-eval/htsbox: No such file or directory
sh: CHM-eval/htsbox: No such file or directory

You may need to include the run-flt script in the release, and/or update paths/documentation for the master branch.

Thanks!

SynDip reports different representations of the same haplotype

I've noticed a few instances where the SynDip truth sets asserts heterozygous calls that are actually different representations of the same haplotype. For example, in chr1:2106223-2107071 of full.38.vcf.gz we have

chr1	2106469	.	A	G	30	.	.	GT:AD	1|1:0,2
chr1	2106527	.	GGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCCA	G	30	.	.	GT:AD	0|1:1,1
chr1	2106546	.	A	G,*	30	.	.	GT:AD	1|2:0,1,1
chr1	2106547	.	C	T,*	30	.	.	GT:AD	1|2:0,1,1
chr1	2106584	.	C	T,*	30	.	.	GT:AD	1|2:0,1,1
chr1	2106585	.	A	G,*	30	.	.	GT:AD	1|2:0,1,1
chr1	2106604	.	A	G	30	.	.	GT:AD	1|1:0,2
chr1	2106605	.	C	T	30	.	.	GT:AD	1|1:0,2
chr1	2106634	.	AGCCCCTCTGGTGGGCGAGGACCTCCACGTGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCCGGTGGGCGAGGACCTCCATGCGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCTGGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTGACTCTTCAGCAG	30	.	.	GT:AD	1|0:1,1
chr1	2106692	.	AGCCCCTCCGGTGGGCGAGGACCTCCATGCGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCTGGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTGACTCTTCAGCAG	A,*	30	.	.	GT:AD	2|1:0,1,1
chr1	2106855	.	C	T	30	.	.	GT:AD	1|1:0,2

If I decompose these into haplotypes:

CHM1

chr1	2106469	.	A	G	30	.	.	GT	1
chr1	2106546	.	A	G	30	.	.	GT	1
chr1	2106547	.	C	T	30	.	.	GT	1
chr1	2106584	.	C	T	30	.	.	GT	1
chr1	2106585	.	A	G	30	.	.	GT	1
chr1	2106604	.	A	G	30	.	.	GT	1
chr1	2106605	.	C	T	30	.	.	GT	1
chr1	2106634	.	AGCCCCTCTGGTGGGCGAGGACCTCCACGTGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCCGGTGGGCGAGGACCTCCATGCGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCTGGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTGACTCTTCAGCAG	30	.	.	GT	1
chr1	2106855	.	C	T	30	.	.	GT	1

CHM13

chr1	2106469	.	A	G	30	.	.	GT	1
chr1	2106527	.	GGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCCA	G	30	.	.	GT	1
chr1	2106604	.	A	G	30	.	.	GT	1
chr1	2106605	.	C	T	30	.	.	GT	1
chr1	2106692	.	AGCCCCTCCGGTGGGCGAGGACCTCCATGCGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCTGGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTGACTCTTCAGCAG	A	30	.	.	GT	1
chr1	2106855	.	C	T	30	.	.	GT	1

The consensus sequence of both is identical:

>chr1:2106223-2107071
CTCCAGCCAAGGCATCCAAACATCAAAAGGCAGAACTGAGCGGCTTGGTACTTGAAAAGT
TTTTATTAGGAAAAATGCCAAACTGACAGAAGTAGAGAGAATTACATAGTGAGGCCTCGT
GCACACCCTGCCTGGCTCCTGGCAACCTGCACTCCAGCCGATACCTGTGACTCTCAGCAA
GCCCCTCTAGTGGGCGAGGACCTCCACACGTGTCGCCAGGCCAGGCGACTCTCAGCAAGC
CCCTCCGGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCC
CTCTGGTGGGCGAGGACCTCCACGTGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCT
CTGGTGGGCGAGGACCTCCACGTGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCT
GGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGCGACTCTTCAGCAAGCCCCTCCA
CACGTGTCACCAGGCCAGGTGACTCTCAGCAAGCCCCTCCGGTGGGCGAGGACCTCTGCA
CGTGTCTCCAGAGGCCAAAGCAGAAGAAAACGTTAGCACAGGAGTCACTTGACTTCACCA
AACGCAGCCAGGATTGCGGTTTCTCCGGCTCGGCTGTCTCAGTTGTTTAAGAGAGTTCAT
GCTTTTGAGATCAA

I'm assuming this is a result of the way that SynDip calls were generated, and in particular because the merging process was not haplotype aware. I guess that it's not technically wrong, but is somewhat confusing, and does throw off comparison tools. For example, if I try to evaluate the following calls

chr1	2106468	.	CAGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTAACTCTCAGCAAGCCCCTCTGGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGTAACTCTCAGCAA	C	421.84	PASS	.	GT	1|1
chr1	2106577	.	GCCCCTCCA	G	314.04	PASS	.	1|1
chr1	2106700	.	C	T	270.54	PASS	.	GT	1|1
chr1	2106719	.	T	C	336.06	PASS	.	GT	1|1
chr1	2106721	.	C	T	336.06	PASS	.	GT	1|1
chr1	2106796	.	TGACTCTTCAGCAGGCCCCTCTGGTGGGCGAGGACCTCCACACGTGTCACCAGGCCAGGC	T	125.96	PASS	.	GT	1|1

Which is yet another representation of the same haplotype, with RTGTools vcfeval (without the --ref-overlap option), then all the calls are reported false positive and all the SynDip calls are false negatives. Adding --ref-overlap does resolve the issue, but IMO it shouldn't be necessary.

Minor command line type in README

The line:

CHM-eval.kit/rtg -o hs37.sdf hs37.fa   # if you haven't done this before

should be:

CHM-eval.kit/rtg format -o hs37.sdf hs37.fa   # if you haven't done this before

best-effort set with actual alleles?

I understand the reason behind simply providing the locations, but in many real-world scenarios, especially in coding regions, it's important to get the exact variant correctly.

Is there any plan to provide a syndip set with the REF and ALT alleles inferred from the pac-bio data?

hg38 bam available?

Is there a hg38 aligned bam (Illumina reads) available for download for CHM1, CHM13 and CHM1_CHM13? I've just been able to find the hs37 aligned bams.

rep2.37.broad.hc.raw.vcf.gz

Hey, i'm new in bioinformatics.
Please, i have a question, What is this file "rep2.37.broad.hc.raw.vcf.gz" in the release? Is this the truth dataset ? if not, how can i get the syndip dataset with all variants in a file vcf.
Thank you!

Variants out of range in rep2.37.broad.hc.raw.vcf.gz

Hi Heng,

I noticed that there are variants in rep2.37.broad.hc.raw.vcf.gz that are out of range of their chromosomes. An example is on chromosome 9, position 138395067.
I did not look into this in any further detail.

Is there another vcf you would recommend as truth set for syndip? Or would it be best to filter in this vcf.

Thanks,
Wouter

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.