Giter VIP home page Giter VIP logo

jharri34 / krueger Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 1.0 228.62 MB

Closed - Moved to new repo

Home Page: https://www.biology.cis.uab.edu/krueger

Shell 0.97% Ruby 0.02% Java 0.37% C 77.94% Makefile 1.20% Roff 7.46% JavaScript 0.90% Perl 7.70% C++ 0.85% M4 1.16% Objective-C 0.37% Scilab 0.07% Lua 0.70% Python 0.19% R 0.08% PHP 0.02%
marine-evolutionary-ecology population-genetics molecular-ecology phycology scientific-communication

krueger's People

Contributors

jharri34 avatar jimmycoder avatar radamec avatar

Watchers

 avatar  avatar  avatar

Forkers

smk5g5

krueger's Issues

Design Workflow Framework

WDL will basically be the framework for our pipeline. All of our command line commands that we will be using to invoke GATK, Picard, Samtools and etc will all be housed under WDL. Currently designing the pipeline and how we should go about the process.

convert bam to gvcf

HaplotypeCaller
Call germline SNPs and indels via local re-assembly of haplotypes
Overview
The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the so-called GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate genomic gVCF (gVCF), which can then be used for joint genotyping of multiple samples in a very efficient way, which enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes (e.g. the 92K exomes of ExAC).

In addition, HaplotypeCaller is able to handle non-diploid organisms as well as pooled experiment data. Note however that the algorithms used to calculate variant likelihoods is not well suited to extreme allele frequencies (relative to ploidy) so its use is not recommended for somatic (cancer) variant discovery. For that purpose, use MuTect2 instead.

Finally, HaplotypeCaller is also able to correctly handle the splice junctions that make RNAseq a challenge for most variant callers.

Input
Input bam file(s) from which to make calls

Output
Either a VCF or gVCF file with raw, unfiltered SNP and indel calls. Regular VCFs must be filtered either by variant recalibration (best) or hard-filtering before use in downstream analyses. If using the reference-confidence model workflow for cohort analysis, the output is a GVCF file that must first be run through GenotypeGVCFs and then filtering before further analysis.

Usage examples
Single-sample GVCF calling on DNAseq (for -ERC GVCF cohort analysis workflow)
java -jar GenomeAnalysisTK.jar
-R reference.fasta
-T HaplotypeCaller
-I sample1.bam
--emitRefConfidence GVCF
[--dbsnp dbSNP.vcf]
[-L targets.interval_list]
-o output.raw.snps.indels.g.vcf
Caveats
We have not yet fully tested the interaction between the GVCF-based calling or the multisample calling and the RNAseq-specific functionalities. Use those in combination at your own risk.
Many users have reported issues running HaplotypeCaller with the -nct argument, so we recommend using Queue to parallelize HaplotypeCaller instead of multithreading.
Special note on ploidy
This tool is able to handle almost any ploidy (except very high ploidies in large pooled experiments); the ploidy can be specified using the -ploidy argument for non-diploid organisms.

convert ubam -> bam

second step of the workflow has you marking adapter sequences, e.g. arising from read-through of short inserts, using MarkIlluminaAdapters such that they contribute minimally to alignments and allow the aligner to map otherwise unmappable reads. The third step pipes three processes to produce the final BAM. Piping SamToFastq, BWA-MEM and MergeBamAlignment saves time and allows you to bypass storage of larger intermediate FASTQ and SAM files. In particular, MergeBamAlignment merges defined information from the aligned SAM with that of the uBAM to conserve read data, and importantly, it generates additional meta information and unifies meta data. The resulting clean BAM is coordinate sorted, indexed.

convert fastq to .ubam

(A) Convert FASTQ to uBAM and add read group information using FastqToSam

Picard's FastqToSam transforms a FASTQ file to an unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for GATK Best Practices Workflows. All other read group fields are optional.

java -Xmx8G -jar picard.jar FastqToSam
FASTQ=6484_snippet_1.fastq \ #first read file of pair
FASTQ2=6484_snippet_2.fastq \ #second read file of pair
OUTPUT=6484_snippet_fastqtosam.bam
READ_GROUP_NAME=H0164.2 \ #required; changed from default of A
SAMPLE_NAME=NA12878 \ #required
LIBRARY_NAME=Solexa-272222 \ #required
PLATFORM_UNIT=H0164ALXX140820.2 \
PLATFORM=illumina \ #recommended
SEQUENCING_CENTER=BI \
RUN_DATE=2014-08-20T00:00:00-0400
Some details on select parameters:

For paired reads, specify each FASTQ file with FASTQ and FASTQ2 for the first read file and the second read file, respectively. Records in each file must be queryname sorted as the tool assumes identical ordering for pairs. The tool automatically strips the /1 and /2 read name suffixes and adds SAM flag values to indicate reads are paired. Do not provide a single interleaved fastq file, as the tool will assume reads are unpaired and the SAM flag values will reflect single ended reads.
For single ended reads, specify the input file with FASTQ.
QUALITY_FORMAT is detected automatically if unspecified.
SORT_ORDER by default is queryname.
PLATFORM_UNIT is often in run_barcode.lane format. Include if sample is multiplexed.
RUN_DATE is in Iso8601 date format.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.