The rna_mutect_1.0-1 from deslauriers-andre

----------
RNA-MuTect
----------

*****************************************
The pipeline uses the following software:
1. samtools (1.2+)
2. bamtools (1.0.2+)
3. PLINK/SEQ (0.08+)
4. zlib (1.2.6+)
5. HiSat2 (2.0.3-beta)
6. MuTect (1.1.6)
7. Matlab (2016a+)
*****************************************

* Example outputs for a TCGA sample are provided 

Input: 
------
1. A matched-normal DNA BAM file
2. RNA BAM file aligned with STAR. Parameters used to run STAR are found in Supplementary Table 14.
3. A MAF file listing inital set of mutations called by MuTect with the -U ALLOW_N_CIGAR_READS flag. Other callers and filters may be used as well as long as the output file is in a MAF format.

* For example MAF see: example_RNA_vs_DNA_somatic_pair_RNA.txt in example_output directory

To run the RNA-MuTect pipeline please follow these steps:
---------------------------------------------------------
1. Realignment step:

  a. sh HiSat_realign_preprocess.sh <Dir> <MAF_file>  <RNA_BAM_file>  <Reference_genome_fasta_format> <case_sample_ID>
  b. sh HiSat_realign_preprocess.sh <Dir> <MAF_file>  <Matched-normal_DNA_BAM_file>  <Reference_genome_fasta_format> <control_sample_ID>
     
  Each of these steps will extract from the corresponding BAM file all the reads that are aligned with the called mutations regions and will write them into fastq files.
  A text file with the path to the new fastq files will be generated.
     
  <Dir> - directory with jar files (FilterSamReads; GenomeAnalysisTK; picard; SamToFastq) This jar files were taken from: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA, 2010 GENOME RESEARCH 20:1297-303 
  <MAF_file> - input MAF file
  <RNA_BAM_file> - BAM file of the RNA sample
  <Reference_genome_fasta_format> - reference genome in a fasta format that matches the input BAM file (hg19)
  <case_sample_ID>/<control_sample_ID> - ID of input sample
     
  Output:
  -------
  1. sample_ID.rna_reads_fastq_list.list - A text file with the path to the two generated fastq files
  Running time: ~1 hour, for ~2000 initially called variants

  c. run the HiSat2 aligner for the case sample with the parameters specified in Supplementary Table 14 and fastq files list generated in (a)
  d. run the HiSat2 aligner for the control sample with the parameters specified in Supplementary Table 14 and fastq files list generated in (b)
  
  Output:
    -------
  1. case_sample_ID.aligned.sorted_by_coord.hisat2.bam - A HiSat2 aligned BAM file for the case sample
  2. control_sample_ID.aligned.sorted_by_coord.hisat2.bam - A HiSat2 aligned BAM file for the control sample
  Running time: ~10 minutes, for ~2000 initially called variants
  
  e. Run MuTect with:
     1. The HiSat BAMs (generated in (c) and (d))
     2. The -U ALLOW_N_CIGAR_READS flag
     3. The --force_output flag
     4. An interval list containing chromosome and position of mutations listed in the input MAF file (format = 'chromosome:position', each mutation in different line)
     
  Output:
  -------
  1. pair_ID.call_stats.txt - A call_stats file which is used as put for the next step.  
  Running time: ~1 hour, for ~2000 initially called variants
  * For example MAF see: example_RNA_vs_DNA_somatic_pair_RNA.call_stats.no_germline.txt in example_output directory. Note that suspected germline variants were moved from this file.
  
     
2. Filtering steps (includes two matlab scripts):    
  a. ./run_FilterRNAMutations.sh <Matlab_Runtime_Location> <pair_ID> <MAF_file> <call_stats_file> <MIN_ALT_COUNT> <PON_THR> <Darned_mat.mat> <Radar_mat.mat> <Exac_mat.mat> <PoN_GTEx>
     <Matlab_Runtime_Location> - path to matlab runtime
     <pair_ID> - ID of the input pair 
     <MAF_file> - input MAF file
     <call_stats_file> - output of Step 1
     <MIN_ALT_COUNT> - the minimal number of reads required supporting the alternate allele (MIN_ALT_COUNT=3 was used in the paper for TCGA samples and MIN_ALT_COUNT=4 for GTEx samples)
     <PON_THR> - A threshold for PoN filtering (PON_THR=-3 was used in the paper)
     <Darned_mat.mat> - input file found in directory (mat_files) - downloaded on September 2015 - Kiran A, O'Mahony J, Sanjeev K, Baranov PV (2013) Darned in 2013: Inclusion of model organisms and linking with Wikipedia. Nucleic Acids Res, Epub ahead of print
     <Radar_mat.mat> - input file found in directory (mat_files)  - downloaded on September 2015 - Ramaswami, G. & Li, J. B. RADAR: a rigorously annotated database of A-to-I RNA editing. Nucleic Acids Res. 42, D109–D113 (2014)
     <Exac_mat.mat> - input file found in directory (mat_files) - downloaded on June 2015 - Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016)
     <PoN_GTEx> - input file that should be retreived from dbGAP. See below for an option to run this step without this file
     
     Output:
     -------
     1. pair_ID.intersect.txt - A MAF file containing variants detcted but both aligners (STAR and HiSat2)
     2. pair_ID.post_filtering.txt - A MAF file containing variants remained after applying various filtering criteria
     3. pair_ID.pre_filtering_plus_info.txt - A MAF file containing all input variants with additional columns per filtering criteria, indicating whether a variant was filtered (1) or not (0) by each criteria
     Running time: ~5 minutes, for ~2000 initially called variants
     * For example MAFs see: (1) example_RNA_vs_DNA_somatic.intersect.txt; (2) example_RNA_vs_DNA_somatic.post_filtering.txt; (3) example_RNA_vs_DNA_somatic.pre_filtering_plus_info.txt, in example_output directory
     
  b. ./run_FilterRNAMutationsBasedOnDuplicateReads <Matlab_Runtime_Location> <pair_ID> <RNA_BAM_file> <post_filtering_MAF_file> <MIN_ALT_COUNT>
     <Matlab_Runtime_Location> - path to matlab runtime
     <pair_ID> - ID of the input pair
     <RNA_BAM_file> - BAM file of the RNA sample
     <post_filtering_MAF_file> - A MAF file with variants remained after applying carious filtering criteria - output (2) of the previous step
     <MIN_ALT_COUNT> - the minimal number of reads required supporting the alternate allele (MIN_ALT_COUNT=3 was used in the paper for TCGA samples and MIN_ALT_COUNT=4 for GTEx samples)
	
     Output:
     -------
     1. pair_ID.post_filtering_remove_duplicates.txt - A MAF file containing variants remained after applying the duplicate reads filtering. This is the final MAF file.
     Running time: ~30 minutes, for ~60 called variants
     * For example MAF see: (1) example_RNA_vs_DNA_somatic.post_filtering_remove_duplicates.txt, in example_output directory
	
*** In case the PoN is not available, (2a) can be run as follows:
    ./run_FilterRNAMutationsNoPoN <Matlab_Runtime_Location> <pair_ID> <MAF_file> <call_stats_file> <MIN_ALT_COUNT> <Darned_mat.mat> <Radar_mat.mat> <Exac_mat.mat>
    <Matlab_Runtime_Location> - path to matlab runtime
    <pair_ID> - ID of the input pair 
    <MAF_file> - input MAF file
    <call_stats_file> - output of Step 1
    <MIN_ALT_COUNT> - the minimal number of reads required supporting the alternate allele (MIN_ALT_COUNT=3 was used in the paper for TCGA samples and MIN_ALT_COUNT=4 for GTEx samples)
    <Darned_mat.mat> - input file found in directory (mat_files)
    <Radar_mat.mat> - input file found in directory (mat_files)
    <Exac_mat.mat> - input file found in directory (mat_files)

    * For example MAFs see: (1) example_RNA_vs_DNA_somatic.intersect_no_pon.txt; (2) example_RNA_vs_DNA_somatic.post_filtering_no_pon.txt; (3) example_RNA_vs_DNA_somatic.pre_filtering_plus_info_no_pon.txt, in example_output directory
deslauriers-andre / rna_mutect_1.0-1 Goto Github PK

rna_mutect_1.0-1's Introduction

rna_mutect_1.0-1's People

Contributors

Stargazers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent