Giter VIP home page Giter VIP logo

rna_mutect_1.0-1's Introduction

----------
RNA-MuTect
----------

*****************************************
The pipeline uses the following software:
1. samtools (1.2+)
2. bamtools (1.0.2+)
3. PLINK/SEQ (0.08+)
4. zlib (1.2.6+)
5. HiSat2 (2.0.3-beta)
6. MuTect (1.1.6)
7. Matlab (2016a+)
*****************************************

* Example outputs for a TCGA sample are provided 

Input: 
------
1. A matched-normal DNA BAM file
2. RNA BAM file aligned with STAR. Parameters used to run STAR are found in Supplementary Table 14.
3. A MAF file listing inital set of mutations called by MuTect with the -U ALLOW_N_CIGAR_READS flag. Other callers and filters may be used as well as long as the output file is in a MAF format.

* For example MAF see: example_RNA_vs_DNA_somatic_pair_RNA.txt in example_output directory

To run the RNA-MuTect pipeline please follow these steps:
---------------------------------------------------------
1. Realignment step:

  a. sh HiSat_realign_preprocess.sh <Dir> <MAF_file>  <RNA_BAM_file>  <Reference_genome_fasta_format> <case_sample_ID>
  b. sh HiSat_realign_preprocess.sh <Dir> <MAF_file>  <Matched-normal_DNA_BAM_file>  <Reference_genome_fasta_format> <control_sample_ID>
     
  Each of these steps will extract from the corresponding BAM file all the reads that are aligned with the called mutations regions and will write them into fastq files.
  A text file with the path to the new fastq files will be generated.
     
  <Dir> - directory with jar files (FilterSamReads; GenomeAnalysisTK; picard; SamToFastq) This jar files were taken from: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA, 2010 GENOME RESEARCH 20:1297-303 
  <MAF_file> - input MAF file
  <RNA_BAM_file> - BAM file of the RNA sample
  <Reference_genome_fasta_format> - reference genome in a fasta format that matches the input BAM file (hg19)
  <case_sample_ID>/<control_sample_ID> - ID of input sample
     
  Output:
  -------
  1. sample_ID.rna_reads_fastq_list.list - A text file with the path to the two generated fastq files
  Running time: ~1 hour, for ~2000 initially called variants

  c. run the HiSat2 aligner for the case sample with the parameters specified in Supplementary Table 14 and fastq files list generated in (a)
  d. run the HiSat2 aligner for the control sample with the parameters specified in Supplementary Table 14 and fastq files list generated in (b)
  
  Output:
    -------
  1. case_sample_ID.aligned.sorted_by_coord.hisat2.bam - A HiSat2 aligned BAM file for the case sample
  2. control_sample_ID.aligned.sorted_by_coord.hisat2.bam - A HiSat2 aligned BAM file for the control sample
  Running time: ~10 minutes, for ~2000 initially called variants
  
  e. Run MuTect with:
     1. The HiSat BAMs (generated in (c) and (d))
     2. The -U ALLOW_N_CIGAR_READS flag
     3. The --force_output flag
     4. An interval list containing chromosome and position of mutations listed in the input MAF file (format = 'chromosome:position', each mutation in different line)
     
  Output:
  -------
  1. pair_ID.call_stats.txt - A call_stats file which is used as put for the next step.  
  Running time: ~1 hour, for ~2000 initially called variants
  * For example MAF see: example_RNA_vs_DNA_somatic_pair_RNA.call_stats.no_germline.txt in example_output directory. Note that suspected germline variants were moved from this file.
  
     
2. Filtering steps (includes two matlab scripts):    
  a. ./run_FilterRNAMutations.sh <Matlab_Runtime_Location> <pair_ID> <MAF_file> <call_stats_file> <MIN_ALT_COUNT> <PON_THR> <Darned_mat.mat> <Radar_mat.mat> <Exac_mat.mat> <PoN_GTEx>
     <Matlab_Runtime_Location> - path to matlab runtime
     <pair_ID> - ID of the input pair 
     <MAF_file> - input MAF file
     <call_stats_file> - output of Step 1
     <MIN_ALT_COUNT> - the minimal number of reads required supporting the alternate allele (MIN_ALT_COUNT=3 was used in the paper for TCGA samples and MIN_ALT_COUNT=4 for GTEx samples)
     <PON_THR> - A threshold for PoN filtering (PON_THR=-3 was used in the paper)
     <Darned_mat.mat> - input file found in directory (mat_files) - downloaded on September 2015 - Kiran A, O'Mahony J, Sanjeev K, Baranov PV (2013) Darned in 2013: Inclusion of model organisms and linking with Wikipedia. Nucleic Acids Res, Epub ahead of print
     <Radar_mat.mat> - input file found in directory (mat_files)  - downloaded on September 2015 - Ramaswami, G. & Li, J. B. RADAR: a rigorously annotated database of A-to-I RNA editing. Nucleic Acids Res. 42, D109โ€“D113 (2014)
     <Exac_mat.mat> - input file found in directory (mat_files) - downloaded on June 2015 - Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285โ€“291 (2016)
     <PoN_GTEx> - input file that should be retreived from dbGAP. See below for an option to run this step without this file
     
     Output:
     -------
     1. pair_ID.intersect.txt - A MAF file containing variants detcted but both aligners (STAR and HiSat2)
     2. pair_ID.post_filtering.txt - A MAF file containing variants remained after applying various filtering criteria
     3. pair_ID.pre_filtering_plus_info.txt - A MAF file containing all input variants with additional columns per filtering criteria, indicating whether a variant was filtered (1) or not (0) by each criteria
     Running time: ~5 minutes, for ~2000 initially called variants
     * For example MAFs see: (1) example_RNA_vs_DNA_somatic.intersect.txt; (2) example_RNA_vs_DNA_somatic.post_filtering.txt; (3) example_RNA_vs_DNA_somatic.pre_filtering_plus_info.txt, in example_output directory
     
  b. ./run_FilterRNAMutationsBasedOnDuplicateReads <Matlab_Runtime_Location> <pair_ID> <RNA_BAM_file> <post_filtering_MAF_file> <MIN_ALT_COUNT>
     <Matlab_Runtime_Location> - path to matlab runtime
     <pair_ID> - ID of the input pair
     <RNA_BAM_file> - BAM file of the RNA sample
     <post_filtering_MAF_file> - A MAF file with variants remained after applying carious filtering criteria - output (2) of the previous step
     <MIN_ALT_COUNT> - the minimal number of reads required supporting the alternate allele (MIN_ALT_COUNT=3 was used in the paper for TCGA samples and MIN_ALT_COUNT=4 for GTEx samples)
	
     Output:
     -------
     1. pair_ID.post_filtering_remove_duplicates.txt - A MAF file containing variants remained after applying the duplicate reads filtering. This is the final MAF file.
     Running time: ~30 minutes, for ~60 called variants
     * For example MAF see: (1) example_RNA_vs_DNA_somatic.post_filtering_remove_duplicates.txt, in example_output directory
	
*** In case the PoN is not available, (2a) can be run as follows:
    ./run_FilterRNAMutationsNoPoN <Matlab_Runtime_Location> <pair_ID> <MAF_file> <call_stats_file> <MIN_ALT_COUNT> <Darned_mat.mat> <Radar_mat.mat> <Exac_mat.mat>
    <Matlab_Runtime_Location> - path to matlab runtime
    <pair_ID> - ID of the input pair 
    <MAF_file> - input MAF file
    <call_stats_file> - output of Step 1
    <MIN_ALT_COUNT> - the minimal number of reads required supporting the alternate allele (MIN_ALT_COUNT=3 was used in the paper for TCGA samples and MIN_ALT_COUNT=4 for GTEx samples)
    <Darned_mat.mat> - input file found in directory (mat_files)
    <Radar_mat.mat> - input file found in directory (mat_files)
    <Exac_mat.mat> - input file found in directory (mat_files)

    * For example MAFs see: (1) example_RNA_vs_DNA_somatic.intersect_no_pon.txt; (2) example_RNA_vs_DNA_somatic.post_filtering_no_pon.txt; (3) example_RNA_vs_DNA_somatic.pre_filtering_plus_info_no_pon.txt, in example_output directory

rna_mutect_1.0-1's People

Contributors

kyizhak avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.