Giter VIP home page Giter VIP logo

mex-pipeline's Introduction

MeX Pipeline

A pipeline for identification and annotation of transposable element (TE) insertions using next generation sequencing (NGS) data.

MeX Diagram

Pre-requisites

  1. Conda (Miniconda or Anaconda)
  2. Linux
  3. Git

Getting Started

Creating conda environment

git clone https://github.com/RawalTeam/MeX-Pipeline.git
cd  Mex-Pipeline
conda env create -f envs/mex.yaml --name mex

Installing additional external dependencies

conda activate mex
conda install mamba -n base -c conda-forge
python install_deps.py --processes 2 --assembly GRCh38 --cachedir ~/.vep
usage: install_deps.py [-h] [-p PROCESSES] [-a ASSEMBLY] [-d CACHEDIR]
                       [-oa ONLY_ASSEMBLY]

optional arguments:
  -h, --help            show this help message and exit
  -p PROCESSES, --processes PROCESSES
                        Number of processes used (default: 2)
  -a ASSEMBLY, --assembly ASSEMBLY
                        Genome assembly ex., GRCh38, GRCh37, and other. See
                        VEP docs (https://www.ensembl.org/info/docs/tools/vep
                        /script/vep_other.html#assembly) (default: GRCh38)
  -d CACHEDIR, --cachedir CACHEDIR
                        VEP Data directory (default: /home/dell/.vep)
  -oa ONLY_ASSEMBLY, --only-assembly ONLY_ASSEMBLY
                        Download Genome assembly ex., GRCh38, GRCh37, and
                        other. See VEP docs (https://www.ensembl.org/info/doc
                        s/tools/vep/script/vep_other.html#assembly) in
                        existing VEP cache directory. Requires config.json in
                        installation directory (default: None)

Adding new human genome assembly into existing VEP cache

  • Require config.json in installation directory which was created in above step automatically.
conda activate mex
python install_deps.py --only-assembly GRCh37

Downloading sample data (Human)
Contents

  • Paired NGS reads files of human
  • Human Chromosome 1, 2, and 3 Genome Fasta
  • FASTA of Alu Element

50 GB disk space required

conda activate mex
python download_example_files.py

Running MeX Pipeline

conda activate mex
python mex.py \
    --fq1 example/SRR622461_1.filt.fastq \
    --fq2 example/SRR622461_2.filt.fastq \
    --genome example/hg38_chr123.fa \
    --te example/RMRBSeqs_Original_Alu.fasta -p 2 \
    --outdir example/results \
    --processes 4

Help

conda activate mex
python mex.py -h
usage: mex.py -1 FQ1 -g GENOME -te TE -O OUTDIR [-h] [-2 FQ2] [-p PROCESSES]
              [--force] [--annotation ANNOTATION] [--window WINDOW]
              [--min_mapq MIN_MAPQ] [--min_af MIN_AF] [--tsd_max TSD_MAX]
              [--gap_max GAP_MAX] [--keep_files] [--assembly ASSEMBLY]

required arguments:
  -1 FQ1, --fq1 FQ1     FASTQ Read 1 (default: None)
  -g GENOME, --genome GENOME
                        Genome FASTA (default: None)
  -te TE, --te TE       TE FASTA (default: None)
  -O OUTDIR, --outdir OUTDIR
                        Output Directory (default: None)

optional arguments:
  -h, --help            show this help message and exit
  -2 FQ2, --fq2 FQ2     FASTQ Read 2 (default: None)
  -p PROCESSES, --processes PROCESSES
                        Number of processes for multiprocessing (default: 2)
  --force               Rerun entire MeX pipeline (default: False)

ngs_te_mapper2 arguments:
  https://github.com/bergmanlab/ngs_te_mapper2#command-line-help-page

  --annotation ANNOTATION
                        reference TE annotation in GFF3 format (must have
                        'Target' attribute in the 9th column) (default: None)
  --window WINDOW       merge window for identifying TE clusters (default: 10)
  --min_mapq MIN_MAPQ   minimum mapping quality of alignment (default: 20)
  --min_af MIN_AF       minimum allele frequency (default: 0.1)
  --tsd_max TSD_MAX     maximum TSD size (default: 25)
  --gap_max GAP_MAX     maximum gap size (default: 5)
  --keep_files          If provided then all ngs_te_mapper2 intermediate files
                        will be kept (default: False)

Ensembl Variant Effect Predictor (VEP) arguments:
  https://asia.ensembl.org/info/docs/tools/vep/script/vep_options.html#basic

  --assembly ASSEMBLY   Genome assembly ex., GRCh38, GRCh37, and other. See
                        VEP docs (https://www.ensembl.org/info/docs/tools/vep/
                        script/vep_other.html#assembly) (default: GRCh38)

Components of MeX Pipeline

  • FASTp
    A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with a multithreading supported to afford high performance.

  • FASTQc
    FastQC is a program designed to spot potential problems in high througput sequencing datasets. It runs a set of analyses on one or more raw sequence files in fastq or bam format and produces a report which summarises the results.

  • ngs_te_mapper2
    ngs_te_mapper2 is a re-implementation of the method for detecting transposable element (TE) insertions from next-generation sequencing (NGS) data originally described in Linheiro and Bergman (2012) PLoS ONE 7(2): e30008. ngs_te_mapper2 uses a three-stage procedure to annotate non-reference TEs as the span of target site duplication (TSD), following the framework described in Bergman (2012) Mob Genet Elements. 2:51-54.

  • Ensembl Variant Effect Predictor (VEP)
    VEP determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions.

Inputs

Required

  1. FASTq file 1 (--fq1, -1)
    Either the Read1 FASTQ file from a paired-end sequencing, or the FASTQ file from an unpaired sequencing.

  2. Genome FASTA file (--genome, -g)
    The genome sequence of the reference genome in FASTA format.

  3. TE FASTA file (--te, -te)
    A FASTA file containing a consensus sequence for each family.

Optional

  1. FASTq file 2 (--fq2, -2)
    The Read2 FASTQ file from a paired-end sequencing run.

Outputs

--- /path/of/outdir
    |_ logs* (various log files)
    |_ outputs
        |_ fastp*
        |_ fastqc*
        |_ ngs_te_mapper2*
        |_ vep*
    |_ config.json (internal configuration file)
    |_ Snakefile (snakemake file)
    |_ workflow.html (snakemake report)

* Is a directory

mex-pipeline's People

Contributors

robin2897 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.