The epiteome from seanchen607

INSTALL

Dependencies
    All dependencies must be executable and findable in the user's PATH.

    perl (version 5.x): Generally installed in linux and mac OS by default. Expected to be installed at /usr/bin/perl

    perl lib (Bio::SeqIO, Tools::GFF, DB::Sam, Set::IntervalTree, Statistics::Descriptive)

    samtools (version 1.3.1 or higher)

    bedtools (version v2.26.0 or higher)

    ngsutils (version 0.5.7 or higher)

    segemehl (version 0.2.0 or higher)

    gzip/bzip2 Generally installed in linux and mac OS by default.

Install
    EpiTEome is a perl program that does not need to be compiled.
    Make sure it is executable. For convenience, can be added to your PATH.
    epiTEome assumes your perl installation to be at /usr/bin/perl.

Tips
    All libraries can be installed using `perl -MCPAN -e 'install Bio::SeqIO'`
    ngsutils can be installed using `pip install ngsutils`

Test environment
    This release of epiTEome (v1) was tested on Mac OSX (10.11.6), perl
    5.18.2, samtools 1.3.1, bedtools v2.26.0, ngsutils 0.5.7, segemehl 0.2.0

CITATION

If you use epiTEome in your work, please cite one of the following:

USAGE

INDEX: Reference fasta file should be indexed in the segemehl index format.
       Prior the indexing, idxEpiTEome.pl will mask the 3’ edge of the LTR5 
       and 5’ edge of the LTR3 to avoid multi-mapping read competition within 
       a single TE. Because LTRs of a single TE are identically duplicated at
       the TE insertion time, this light masking will prevent split-reads that
       could map at the junctions TE-flanking DNA to map inside the TE.

    Usage: idxEpiTEome.pl —l [max read length] -gff <gff3> -t <target> —ref <fasta>

    <gff3>    TE annotation in gff3 format.
        
    <target>  list of TEid of interest

    <ref>     FASTA formated (.fa, .fna or fasta) genome file.

    -l        Maximum read length present in FASTQ file.


EPITEOME: Identify new TE insertion sites and quantify their methylation level from MethylC-seq datasets.

    Usage: epiTEome.pl [options] -gff <gff3> -t <target> —ref <fasta> -un <fastq>

    <gff3>    TE annotation in gff3 format.
        
    <target>  TEid list of interest

    <ref>     FASTA formated (.fa, .fna or fasta) genome file.

    <un>      FASTQ file of reads that failed to map to the reference genome (unmapped reads)

OPTIONS
  EpiTEome Specific Options:
    -chop [integer] : read end length of chopped (defaut 25,30,40).
                      Usage of several lengths will improve epiTEome sensitivity. 
    -b    [integer] : number of TEs per batch (defaut 5000).
    -w    [integer] : window size for methylation metaplot analysis (defaut 10 bp)

  Alignment Options:
    -E    [integer] : segemehl max evalue (default:5)
    -p    [integer] : number of threads used in segemehl (defaut 1).
                      All other portions of epiTEome are single-threaded.

OUTPUT
  epiTEome output 4 different files .newInsertionSite.tab, .newInsertionSite.sam, .met.meta.tab and .met.row.tab

GFF3 INPUT FILE FORMAT

GFF3 input file follow the standard GFF3 format, except column 3 and 9 that have specific tags.
TE annotated features (mandatory): 
    - Column 3 (type) should be referred to as 'te'
    - Column 9 (attributes) should have the following list of attributes: ID=teid, sF=superfamily_name, fam=family_name.
LTR annotated features (optional):
     - Column 3 (type) should be referred to as LTR5 or LTR3
     - Column 9 (attributes) should have tag Parent=teid.

OUTPUT FILE FORMAT

.newInsertionSite.tab: coordinate of non-reference TEs
    1.  chrom - name of the chromosome or scaffold
    2.  chromStart - start position of feature containing new insertion site (0 base)
    3.  chromEnd - end position of the feature containing new insertion site
    4.  name - feature name
    5.  mapping type [uniq|multi] - Feature has been identified using split-reads that uniquely map to the reference sequence (uniq)
    or map to the reference sequence multiple time (multi).  
    6.  strand
    7.  tsdStart - Start of the TSD (target-site duplication)
    8.  tsdEnd - End of the TSD
    9.  nubReads - Number of split-reads aligned to identify this feature
    10. family - TE family name
    11. teid - teid name

.newInsertionSite.sam: standard sam aligment file diplaying split-reads aligment profile. Note that all 
                       split-reads susceptible to detect a non-reference TE will be store in this file,
                       allowing user to identify false negative predictions.

.met.row.tab: methylation level at each cytosine position (used for barplot, Figure 4A)
    1. methylation context [CG|CHG|CHH]
    2. location [neo|te] - neo (at flanking DNA at newinsertion site) or te (at TE)
    3. edge [5|3|8] - 5prime (5), 3prime (3), both (8)
    4. nbCm - number of cytosines methylated
    5. nbR - number of reads mapped
    6. name - feature name
    7. teid - teid name

.met.meta.tab: process methylation level for metaplot analysis (used for metaplot, Figure 4B)
    1. methylation context [CG|CHG|CHH]
    2. location [neo|te] - neo (at flanking DNA at newinsertion site) or te (at TE)
    3. edge: [5|3|8] - 5prime (5), 3prime (3), both (8)
    4. window id
    5. methylation level (%)
    6. confidence interval (95%)

TEST

Test / demonstration data for epiTEome.
- Step 1: Indexing reference file
   $idxEpiTEome.pl -l 85 -gff tair10TEs.gff3 -t subteid.lst -fasta Chr2.fasta 

- Step 2: Run epiTEome analysis
   $epiTEome.pl -gff tair10TEs.gff3 -ref Chr2.epiTEome.masked.fasta -un unmapped.fastq -t teid.lst 

- Output: 4 different files (unmapped.newInsertionSite.tab, unmapped.newInsertionSite.sam, unmapped.met.meta.tab and unmapped.met.row.tab) will be automatically generated by epiTEome in the $CWD. To check whether epiTEome worked successfully, those files could be compared to reference output files present in the test folder (refOutput_*).

seanchen607 / epiteome Goto Github PK

epiteome's Introduction

epiteome's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent