Giter VIP home page Giter VIP logo

plasmoseq-dualtech's Introduction

A snakemake pipeline for variant calling from P. falciparum short amplicon reads

Motivation

PS: The pipeline is still at it's infancy stage

We sequenced an Illumina sequencing library on the Oxford Nanopore MinION (ONT) to evaluate the cost of this approach.

  • PCR amplicons from Plasmodium falciparum drug resistance markers (ama1, k13, dhps, dhfr and mdr1) were generated in duplicate.
  • Illumina sequencing libraries were generated using KAPA reagents and KAPA indexes.
  • Finally, ONT sequence libraries were generated using just one set of ONT adapters and sequenced on the ONT using the Flow Cell R9.4.1.
  • Hence, we cannot demultiplex the sequences into individual samples and further analyses were done at the population level.

Below are the project dependencies:

     Package management

  • conda - an open-source package management system and environment management system that runs on various platforms, including Windows, MacOS, Linux.

     Workflow management

  • snakemake - a workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in python style.

     Bioinformatics tools (packages)

  • fastqc - a tool for a quality control tool for high throughput sequence data
  • multiqc - a tool for aggregating bioinformatics analysis reports across many samples and tools
  • porechop - a tool for finding and removing adapters from Oxford Nanopore reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.
  • cutadapt - at tool that finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
  • bwa - an aligner for short-read alignment (see minimap2 for long-read alignment)
  • bedtools - allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF
  • bcftools - a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF.
  • snpEff - a genetic variant annotation and effect prediction toolbox
  • SnpSift - a toolbox that allows you to filter and manipulate annotated files.

Where to start

  • Clone this project into your computer using Git (installation instructions) with the following command:
    • git clone https://github.com/kevin-wamae/PlasmoSeq-DualTech.git
  • Navigate into the cloned directory using the following command:
    • cd PlasmoSeq-DualTech

Directory structure

  • Below is the default directory structure:
    • config/ - contains the workflow configuration files
    • env/ - contains the Conda environment files
    • input/ - contains fastq, adaptors and genome files
    • output/ - contains the output from the analysis
    • workflow/ - contains the Snakemake script (snakefile) and additonal scripts
.
├── LICENSE
├── README.md
├── config
│ └── config.yaml
├── env
│ └── environment.yml
├── input
│ ├── 01_fastq
│ │ ├── file-1_0.fastq.gz
│ │ └── file-2_1.fastq.gz
│ ├── 02_adapters
│ │ ├── illumina-TruSeq-adapters.fasta
│ │ └── illumina-indexes.txt
│ └── 03_genome
│     ├── genome.fasta
│     └── genome_annotations.gff
├── output
└── workflow
    ├── scripts
    │ └── create_snpeff_db.sh
    └── snakefile

Running the analysis

Install conda and execute the following commands:

1 - Create the conda analysis environment and install the dependencies from the env/environment.yml by running the following command in your terminal:

  • conda env create --file env/environment.yml

2 - Activate the conda environment:

  • PS - This needs to be done every time you want to execute this pipeline:
  • conda activate ampseq-analysis

3 - Create the snpEff database by executing the bash script below. This script will download P. falciparum genome files from PlasmoDB and create and a snpEff database:

  • PS - for this analysis, we will use genome data release-51 from PlasmoDB, and we only need to run it once:
  • bash workflow/scripts/create_snpeff_db.sh

4 - Finally, execute the whole Snakemake pipeline by running the following command in your terminal:

  • PS - Replace 4 in the command with the number of CPUs you wish to use
  • snakemake -c4

5 - Alternatively, you can execute a specific rule by running the following command in your terminal:

  • PS - Replace rule in the command with respective rule-name from the workflow/Snakefile
  • snakemake -c4 rule (for example snakemake -c4 qc_raw_files)

Expected output

Below is the expected directory structure of the output/ directory:

  • 01_snpeff_database/ - contains the snpEff database for variant calling
  • 02_qc_raw/ - contains the fastqc QC reports from the raw fastq files
  • 03_multiqc_raw/ - contains the aggregated fastqc QC reports
  • 04_trim_fastq_ont/ - contains fastq files after trimming ONT adaptors
  • 05_trim_fastq_illumina/ - contains fastq files after trimming Illumina adaptors
  • 06_qc_trimmed_files/ - contains the fastqc QC reports from the fastq files after quality trimming
  • 07_read_mapping/ - contains genome mapping files (index, bam and bed)
  • 08_variant_calling/ - contains variant calling files
output/
├── 01_snpeff_database
│   ├── P.falciparum
│   └── genomes
├── 02_qc_raw
├── 03_multiqc_raw
│   └── multiqc_data
├── 04_trim_fastq_ont
├── 05_trim_fastq_illumina
├── 06_qc_trimmed_filesmed
├── 07_read_mapping
│   └── genomeIndex
└── 08_variant_calling

plasmoseq-dualtech's People

Contributors

kevin-wamae avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

lndwiga ssyamoako

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.