jharri34 / krueger Goto Github PK

Closed - Moved to new repo

Home Page: https://www.biology.cis.uab.edu/krueger

Shell 0.97% Ruby 0.02% Java 0.37% C 77.94% Makefile 1.20% Roff 7.46% JavaScript 0.90% Perl 7.70% C++ 0.85% M4 1.16% Objective-C 0.37% Scilab 0.07% Lua 0.70% Python 0.19% R 0.08% PHP 0.02%

marine-evolutionary-ecology population-genetics molecular-ecology phycology scientific-communication

krueger's People

Contributors

Watchers

Forkers

smk5g5

krueger's Issues

revert bam file fastq

http://gatkforums.broadinstitute.org/gatk/discussion/2908/howto-revert-a-bam-file-to-fastq-format

Revert a BAM file back to FastQ. This comes in handy when you receive data that has been processed but not according to GATK Best Practices, and you want to reset and reprocess it properly.

convert fastq files into .bam files

research wdl

https://software.broadinstitute.org/wdl/userguide/

https://software.broadinstitute.org/wdl/userguide/topic?name=wdl-tutorials

update file path with .bam file path ext

add filepath to each sample into database

research docker and its uses for the project

added fields to database for preprocessing steps file

rename aln.bam files to just .bam

remove aln prefix from .bam files

WDL will basically be the framework for our pipeline. All of our command line commands that we will be using to invoke GATK, Picard, Samtools and etc will all be housed under WDL. Currently designing the pipeline and how we should go about the process.

automate generating fai file

add columns representing files state

update .fastq file with .bam

research MuTect

generate dict file

automate generate .bai file

discuss whether we would want to merge gvcf's for expandability

create varient discovery pipeline

create database update script

figure out what .sai files are and what to do with them

convert bam to gvcf

HaplotypeCaller
Call germline SNPs and indels via local re-assembly of haplotypes
Overview
The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the so-called GVCF mode used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate genomic gVCF (gVCF), which can then be used for joint genotyping of multiple samples in a very efficient way, which enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes (e.g. the 92K exomes of ExAC).

In addition, HaplotypeCaller is able to handle non-diploid organisms as well as pooled experiment data. Note however that the algorithms used to calculate variant likelihoods is not well suited to extreme allele frequencies (relative to ploidy) so its use is not recommended for somatic (cancer) variant discovery. For that purpose, use MuTect2 instead.

Finally, HaplotypeCaller is also able to correctly handle the splice junctions that make RNAseq a challenge for most variant callers.

Input
Input bam file(s) from which to make calls

Output
Either a VCF or gVCF file with raw, unfiltered SNP and indel calls. Regular VCFs must be filtered either by variant recalibration (best) or hard-filtering before use in downstream analyses. If using the reference-confidence model workflow for cohort analysis, the output is a GVCF file that must first be run through GenotypeGVCFs and then filtering before further analysis.

Usage examples
Single-sample GVCF calling on DNAseq (for -ERC GVCF cohort analysis workflow)
java -jar GenomeAnalysisTK.jar
-R reference.fasta
-T HaplotypeCaller
-I sample1.bam
--emitRefConfidence GVCF
[--dbsnp dbSNP.vcf]
[-L targets.interval_list]
-o output.raw.snps.indels.g.vcf
Caveats
We have not yet fully tested the interaction between the GVCF-based calling or the multisample calling and the RNAseq-specific functionalities. Use those in combination at your own risk.
Many users have reported issues running HaplotypeCaller with the -nct argument, so we recommend using Queue to parallelize HaplotypeCaller instead of multithreading.
Special note on ploidy
This tool is able to handle almost any ploidy (except very high ploidies in large pooled experiments); the ploidy can be specified using the -ploidy argument for non-diploid organisms.

create Data Pre-processing pipeline

convert ubam -> bam

second step of the workflow has you marking adapter sequences, e.g. arising from read-through of short inserts, using MarkIlluminaAdapters such that they contribute minimally to alignments and allow the aligner to map otherwise unmappable reads. The third step pipes three processes to produce the final BAM. Piping SamToFastq, BWA-MEM and MergeBamAlignment saves time and allows you to bypass storage of larger intermediate FASTQ and SAM files. In particular, MergeBamAlignment merges defined information from the aligned SAM with that of the uBAM to conserve read data, and importantly, it generates additional meta information and unifies meta data. The resulting clean BAM is coordinate sorted, indexed.

convert fastq to .ubam

(A) Convert FASTQ to uBAM and add read group information using FastqToSam

Picard's FastqToSam transforms a FASTQ file to an unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for GATK Best Practices Workflows. All other read group fields are optional.

java -Xmx8G -jar picard.jar FastqToSam
FASTQ=6484_snippet_1.fastq \ #first read file of pair
FASTQ2=6484_snippet_2.fastq \ #second read file of pair
OUTPUT=6484_snippet_fastqtosam.bam
READ_GROUP_NAME=H0164.2 \ #required; changed from default of A
SAMPLE_NAME=NA12878 \ #required
LIBRARY_NAME=Solexa-272222 \ #required
PLATFORM_UNIT=H0164ALXX140820.2 \
PLATFORM=illumina \ #recommended
SEQUENCING_CENTER=BI \
RUN_DATE=2014-08-20T00:00:00-0400
Some details on select parameters:

For paired reads, specify each FASTQ file with FASTQ and FASTQ2 for the first read file and the second read file, respectively. Records in each file must be queryname sorted as the tool assumes identical ordering for pairs. The tool automatically strips the /1 and /2 read name suffixes and adds SAM flag values to indicate reads are paired. Do not provide a single interleaved fastq file, as the tool will assume reads are unpaired and the SAM flag values will reflect single ended reads.
For single ended reads, specify the input file with FASTQ.
QUALITY_FORMAT is detected automatically if unspecified.
SORT_ORDER by default is queryname.
PLATFORM_UNIT is often in run_barcode.lane format. Include if sample is multiplexed.
RUN_DATE is in Iso8601 date format.

jharri34 / krueger Goto Github PK

krueger's People

Contributors

Watchers

Forkers

krueger's Issues

Recommend Projects

Recommend Topics

Recommend Org