jharri34 / krueger Goto Github PK
View Code? Open in Web Editor NEWClosed - Moved to new repo
Home Page: https://www.biology.cis.uab.edu/krueger
Closed - Moved to new repo
Home Page: https://www.biology.cis.uab.edu/krueger
http://gatkforums.broadinstitute.org/gatk/discussion/2908/howto-revert-a-bam-file-to-fastq-format
Revert a BAM file back to FastQ. This comes in handy when you receive data that has been processed but not according to GATK Best Practices, and you want to reset and reprocess it properly.
report finding and differences
remove aln prefix from .bam files
WDL will basically be the framework for our pipeline. All of our command line commands that we will be using to invoke GATK, Picard, Samtools and etc will all be housed under WDL. Currently designing the pipeline and how we should go about the process.
HaplotypeCaller
Call germline SNPs and indels via local re-assembly of haplotypes
Overview
The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.
In the so-called GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate genomic gVCF (gVCF), which can then be used for joint genotyping of multiple samples in a very efficient way, which enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes (e.g. the 92K exomes of ExAC).
In addition, HaplotypeCaller is able to handle non-diploid organisms as well as pooled experiment data. Note however that the algorithms used to calculate variant likelihoods is not well suited to extreme allele frequencies (relative to ploidy) so its use is not recommended for somatic (cancer) variant discovery. For that purpose, use MuTect2 instead.
Finally, HaplotypeCaller is also able to correctly handle the splice junctions that make RNAseq a challenge for most variant callers.
Input
Input bam file(s) from which to make calls
Output
Either a VCF or gVCF file with raw, unfiltered SNP and indel calls. Regular VCFs must be filtered either by variant recalibration (best) or hard-filtering before use in downstream analyses. If using the reference-confidence model workflow for cohort analysis, the output is a GVCF file that must first be run through GenotypeGVCFs and then filtering before further analysis.
Usage examples
Single-sample GVCF calling on DNAseq (for -ERC GVCF
cohort analysis workflow)
java -jar GenomeAnalysisTK.jar
-R reference.fasta
-T HaplotypeCaller
-I sample1.bam
--emitRefConfidence GVCF
[--dbsnp dbSNP.vcf]
[-L targets.interval_list]
-o output.raw.snps.indels.g.vcf
Caveats
We have not yet fully tested the interaction between the GVCF-based calling or the multisample calling and the RNAseq-specific functionalities. Use those in combination at your own risk.
Many users have reported issues running HaplotypeCaller with the -nct argument, so we recommend using Queue to parallelize HaplotypeCaller instead of multithreading.
Special note on ploidy
This tool is able to handle almost any ploidy (except very high ploidies in large pooled experiments); the ploidy can be specified using the -ploidy argument for non-diploid organisms.
second step of the workflow has you marking adapter sequences, e.g. arising from read-through of short inserts, using MarkIlluminaAdapters such that they contribute minimally to alignments and allow the aligner to map otherwise unmappable reads. The third step pipes three processes to produce the final BAM. Piping SamToFastq, BWA-MEM and MergeBamAlignment saves time and allows you to bypass storage of larger intermediate FASTQ and SAM files. In particular, MergeBamAlignment merges defined information from the aligned SAM with that of the uBAM to conserve read data, and importantly, it generates additional meta information and unifies meta data. The resulting clean BAM is coordinate sorted, indexed.
(A) Convert FASTQ to uBAM and add read group information using FastqToSam
Picard's FastqToSam transforms a FASTQ file to an unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for GATK Best Practices Workflows. All other read group fields are optional.
java -Xmx8G -jar picard.jar FastqToSam
FASTQ=6484_snippet_1.fastq \ #first read file of pair
FASTQ2=6484_snippet_2.fastq \ #second read file of pair
OUTPUT=6484_snippet_fastqtosam.bam
READ_GROUP_NAME=H0164.2 \ #required; changed from default of A
SAMPLE_NAME=NA12878 \ #required
LIBRARY_NAME=Solexa-272222 \ #required
PLATFORM_UNIT=H0164ALXX140820.2 \
PLATFORM=illumina \ #recommended
SEQUENCING_CENTER=BI \
RUN_DATE=2014-08-20T00:00:00-0400
Some details on select parameters:
For paired reads, specify each FASTQ file with FASTQ and FASTQ2 for the first read file and the second read file, respectively. Records in each file must be queryname sorted as the tool assumes identical ordering for pairs. The tool automatically strips the /1 and /2 read name suffixes and adds SAM flag values to indicate reads are paired. Do not provide a single interleaved fastq file, as the tool will assume reads are unpaired and the SAM flag values will reflect single ended reads.
For single ended reads, specify the input file with FASTQ.
QUALITY_FORMAT is detected automatically if unspecified.
SORT_ORDER by default is queryname.
PLATFORM_UNIT is often in run_barcode.lane format. Include if sample is multiplexed.
RUN_DATE is in Iso8601 date format.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.