Giter VIP home page Giter VIP logo

mbovpan's Introduction

image search api

mbovpan

Mbovpan is a nextflow bioinformatic pipeline for Mycobacterium bovis pangenome analysis. The goal of Mbovpan is to make the insights from M. bovis genomics easily accesible to reseachers.

Mbovpan can be ran in three separate modes: SNP mode for only inferring Single Nucleotide Polymorphisms, PAN mode for assessing gene presence absences, or ALL to do a full analysis (default). Below you can witness the main workflow encapsulated by mbovpan. image

Installation

The first step would be to download the pipeline using the following git command.

$git clone https://github.com/salvadorlab/mbovpan.git

We encourage the user to add the 'mbovpan' directory to the PATH variable. This makes running the pipeline much easier due to not needing to specify the absolute path of the 'mbovpan.nf' file.

#if NOT in the directory, give the absolute path
export PATH=$PATH:/path/to/mbovpan

#if already in the directory
export PATH=$PATH:$(pwd)

After downloading, the user will need to create the mbovpan environment that will make it possible to run the pipeline.

#create the environment with mamba as the first package, then activate
$conda create -p </path/to/>mbovpan 
$conda activate </path/to/>mbovpan
(</path/to/>mbovpan)$conda install -c conda-forge mamba
(</path/to/>mbovpan)$conda install -c bioconda -c conda-forge nextflow=22.10.6
(</path/to/>mbovpan)$conda install -c bioconda -c conda-forge pandas
(</path/to/>mbovpan)$conda install -c bioconda -c conda-forge panaroo

#now test that everything downloaded appropriately with a simple help command
(</path/to/>mbovpan)$ nextflow run mbovpan --help 

Quickstart

Downloading example sequence data

#create a new directory that will house NGS files, navigate to the newly created file
(</path/to/>mbovpan)$mkdir mbovis_input
(</path/to/>mbovpan)$cd mbovis_input

#using sratoolkit download 5 M. bovis sequences extracted from United Kingdom badgers
(</path/to/>mbovpan)$fasterq-dump --verbose --split-3 SRR10482974
(</path/to/>mbovpan)$fasterq-dump --verbose --split-3 SRR10482944
(</path/to/>mbovpan)$fasterq-dump --verbose --split-3 ERR11893527
(</path/to/>mbovpan)$fasterq-dump --verbose --split-3 SRR23174187
(</path/to/>mbovpan)$fasterq-dump --verbose --split-3 SRR23174144

#once the downloads are complete, exit into the previous directory
(</path/to/>mbovpan)$cd ../

Example run

(</path/to/>mbovpan)$nextflow run mbovpan --input ./mbovis_input/ --run snp --output ./ 

In this command, 'nextflow run' is the command used to look at and execute the pipeline instructions in 'mbovpan'. This looks into the mbovpan directory and runs the workflow instructions that are contained in 'main.nf'

With the test data already downloaded, the parameter '--input ./mbovis_input/' looks for if paired end sequences are present in the input directory to initiate the pipeline. Raw sequence data is analyzed to generate the spoligotyping pattern of the isolates, and if the pattern that is generated does not present as M. bovis, the file will be filtered out from analysis.

'--run snp' signifies what analysis mode mbovpan utilizes. 'snp' mode maps the paired end files to the reference genome while, 'pan' mode creates de novo genomes from scratch. if no option is supplied, the pipeline will run both.

--output ./ stipulates where the output directory "mbovpan_results" will be created

Additional Usages

# boost the number of threads utilized
(</path/to/>mbovpan)$nextflow run mbovpan/mbovpan.nf --input ./mbovpan/seqs/ --run snp --output ./ --threads 16

# modulate the minimum quality and maximum SNP depth required
(</path/to/>mbovpan)$nextflow run mbovpan/mbovpan.nf --input ./mbovpan/seqs/ --run snp --output ./ --qual 20 --depth 20

# Using most of the parameters that mbovpan has to offer to decipher the pangenome
(</path/to/>mbovpan)$nextflow run mbovpan/mbovpan.nf --input ./mbovpan/seqs/ --run snp --output ./ --qual 100 --depth 5 --mapq 50 --threads 30 --run pan

Inputs

mbovpan requires as input paired end FASTQ files originating from Mycobacterium bovis. mbovpan runs the tool spotyping that can use the reads to deduce what MTBC member the sequences originate from.

Outputs

from the mbovpan manuscript

all mode - Maximum Likelihood Phylogenies

For each mode in Mbovpan, a maximum likelihood phylogeny based on SNPs is produced in order to visualize the genomic variation amongst the isolates. If run in ‘pan’ mode, the phylogeny produced will be based on SNPs that were located in core genes inferred from Roary. Otherwise, if the pipeline is run in ‘snp’ mode, SNPs will be inferred directly from the sequences aligned to the reference genome. Given user supplied metadata, the phylogenies will be annotated with the metadata at the tips of the tree. Phylogenies will be generated using Iqtree with 1000 Ultrafast bootstraps for nodal support, and the nucleotide substitution model being inferred through use of the implemented ModelFinder approach (Kalyaanamoorthy et al., 2017; Hoang et al., 2018; Minh et al., 2020).

all mode - Spoligotyping and Lineage Classification

Computation of the spoligotype patterns and lineage of M. bovis isolates are oftentimes important aspects of outbreak investigations and analysis of global variation (Fuente et al., 2015; Zimpel et al., 2017). Mbovpan will output data tables as CSV files that link an isolate with their spoligotyping pattern and associated Lineage information through the use of SpoTyping and TB-Profiler (Xia et al., 2016; Phelan et al., 2019), respectively.

snp mode – Read mapping files and Variant Calling Format

If the user is concerned with computing the use of SNPs only, mbovpan will produce files associated with read-mapping to the integrated reference genome (as well as a duplicate read removal version in BAM format). Freebayes is then used to infer SNP sites and then output the results in a VCF file. A subsequent filtering of SNPs occurs where the user specifies threshold values for mapping quality, SNP depth, and SNP quality.

pan mode - Accessory Genome PCA and PanGWAS

To further assess the similarity of the accessory genome based on user provided metadata, mbovpan produces a Principal Component Analysis score plot that will reduce the large dimensionality of the accessory genome. Given multiple fields of metadata in CSV format as input, there will be as many PCA plots as metadata available. Using the same metadata, Mbovpan will also produce a genomic variation visualization, and a pangenome Genome Wide Association Study (panGWAS) implemented through the software Scoary (Brynildsrud et al., 2016) to check if accessory genes can discriminate the inputted binary traits. The program requires as input the pangenome inferred through mbovpan alongside a traits file that links a particular trait with an isolate. This analysis will output genes that have a significant association (based on a Chi-Squared test) to the trait of interest that is not based on lineage alone. These results are returned as a CSV formatted table.

pan mode – M. bovis virulence gene presence/absence matrix

If the ‘pan’ mode is chosen, after the pangenome is inferred from Roary, Mbovpan takes the gene presence absence data and only keeps the accessory genes that are present in a predetermined list of M. bovis virulent genes. This results in the creation of an output file with only these genes, and additionally a visualization to show how the patterns in virulence gene presence and absence content relates to the user supplied metadata. The matrix will be paired with a hierarchical clustering dendrogram made from applying the Ward clustering algorithm on a Jaccard Similarity matrix generated from the virulent gene presence/absence matrix (Ceres et al., 2022).

Help

    M B O V P A N (v0.1)    
=============================
A pangenomic pipeline for the analysis of
Mycobacterium bovis isolates 


usage: nextflow run mbovpan/mbovpan.nf [options] --input ./path/to/input --output ./path/to/output
  options:
    --run [all|snp|pan]: 
        Specifies in what mode to run mbovpan in [DEFAULT:all]
    --qual [INT]:
        The minimum QUAL score for a SNP to be considered [DEFAULT:20]
    --depth [INT]:
        The minimum DP score for a SNP to be considered [DEFAULT:25]
    --mapq [INT]:
        The minimum MQ score for a SNP to be considered [DEFAULT:40]
    --threads [INT]:
        How many threads to use for the programs [DEFAULT:(number of avail. threads)/2]
    --help
        Prints this help message
    --version
        Prints the current version 
=============================

Authors

Noah Legall, Ph. D. (conceptualization) Bella Arenas (maintaince) Liliana C. M. Salvador, Ph. D. (supervisor)

mbovpan's People

Contributors

noahaus avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

bellaarenas

mbovpan's Issues

Put processes in separate code files

Currently the code is all in one master nextflow file. It would be simpler to edit individual processes if they were in their own files

  • For each process or related processes, put in a separate file. connect them based off user parameter specification

Fix post_fastqc

In main.nf, both pre- and post-fastqc is run on the raw, untrimmed reads. spoligo_post is the raw reads, not the trimmed reads (fastp_reads1) from fastp. Furthermore, fastp_reads1 is not used anywhere else in the file. spoligo_post should be removed from the channels.into in Line 184.

Line 247-248:
input:
tuple file(trim1), file(trim2) from spoligo_post

Fix gene_prab and accessory_pca

I still think these processes should be nextflow scripts, but they should be easier to start and test out. I'll work towards that

Basic Code Changes

  • default - half of available cores, otherwise specify thread number
  • certain output in certain folders
  • NOT CRITICAL, restructure code to be more readable, easier to edit.
  • Output file names cleaner

Convert nextflow code to DSL 2

This is the standard for nextflow code, and this will be with future updates as well. best to refactor the code earlier rather than later.

SNP Calling Outputs

When running "snp" mode, we need some basic outputs for non-bioinformatician users.

  • SNP table (VCF summary)
  • ML phylogeny

Software Enhancements

  • replace Roary (outdated) with panaroo
  • test scoary functionality
  • revisit the implementation of coinfinder
  • get rid of warning symbols with params values.

Add Lineage determination

got tb-profiler to work, now create python script to parse the output and return the lineage to each sample

mbovpan rewrite + update

Here, I think I have enough data to rewrite the manuscript for submission. Will transition and do work as necessary for the paper

Scripts to fix

  • accessory_pca
  • gene_prab
  • remove pan_curve and just use output from roary

Pangenome Outputs

When using "pan" mode, we need to have general outputs that non-bioinformaticians can use quickly.

  • Pangenome Curve
  • core genome phylogeny
  • Co-occurrence/Dissociation network

Downloading via bioconda

as the subject line suggests, I feel like installation is not user friendly. Adding on bioconda might alleviate some issues.

MultiQC reports

MultiQC can help summarize the quality of data from the SNP calling and Pangenome portions of the pipeline

  • Download BUSCO/QUAST/CheckM
  • Update MultiQC with conditional inputs based on what is available

Core phylogeny + hypothetical protein naming using BLAST

Would be useful to take the core genome phylogeny and annotate the edges with M. bovis genes that are considered virulent for for the species.

Step 1) BLAST the pangenome_reference file against the M. bovis genome. Only have unique lines in the output
Step 2) Change the original names to their pseudogene result
Step 3) Filter by matching in the mbovis virulent list
Step 4) plot the accessory genes that pass the filter alongside the core genome phylogeny.

Send to Liliana

  • Download using instructions on README.md
  • Test SNP mode
  • Test PAN mode
  • update README
  • clean up the 'main' branch

Statistics

After analysis is complete, generate master statistic file as a supplementary file.

Quality Checks before running

would be good to do some physical checks on the input to see if it is usable, at least to ensure that the sequences are actually m. bovis. Make this part of a parameter called careful

  • add 'spotyping' functionality
  • a quick diagnostic check if the input is paired (probably good enough to just ignore the ones that fail?
  • update the main script

Standardizing Naming throughout the pipeline.

The addition of new scripts and linking metadata means the naming must be deterministic every run of the pipeline. clean up the code in mbovpan so that these links can be made automatically without any user polishing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.