Giter VIP home page Giter VIP logo

argprofiler's Introduction

ARGprofiler

A tool for large-scale analysis of antimicrobial resistance genes (ARGs) and their flanking regions in metagenomic datasets.

ARGprofiler pipeline

Introduction

ARGprofiler is a newly developed Snakemake pipeline designed to analyze ARGs' read distances, abundances, and genomic flanking regions in metagenomic sequencing data. It has been adapted to work for short-read sequencing datasets. The pipeline also includes the recently made PanRes database, a combined collection of current ARG databases, and ARGextender, an assembly tool for extending the genomic flanking region around genes of interest.

ARGprofiler uses the following tools:

  • fastq-dl for downloading raw reads from ENA
  • fastp for trimming and QC of raw reads
  • KMA for alignment of raw reads against reference databases
  • ARGextender for extracting the genomic flanking regions around ARGs
  • Mash for creating sketches to estimate genetic distances

The tool is described in Paper_here

Installation

The best way to install the ARGprofiler pipeline is to clone this GitHub repository. The pipeline uses the Conda package manager to deploy the defined software packages in the specified version without requiring admin or root privileges.

git clone https://github.com/genomicepidemiology/ARGprofiler.git

This command will create the ARGprofiler directory in the current directory.

Since ARGprofiler is a Snakemake pipeline, the user should install Snakemake workflow management following the guide here.

Input

ARGprofiler takes as input a JSON file named input.json in the following format:

{run_accession:{"type":READ_TYPE},"run_accession":{"type":READ_TYPE}}

run_accession is the ENA id for the read sequencing datasets, and READ_TYPE can be either PAIRED or SINGLE.

Example:

{"ERR3593315":{"type":"PAIRED"},"SRR7533096":{"type":"SINGLE"}}

The user can also opt to specify the name of the input file in the Snakefile (with open...).

For instructions on how to analyze unpublished sequencing reads check Tips and Tricks

Running ARGprofiler

The user has the option to run the pipeline either on an HPC or locally. For running on HPC, we provide the option of executing the workflow using environment modules or conda packages.

HPC

The user should specify the preferable option for executing the pipeline in the config file. If wanting to use a conda environment, keep use-conda:True; otherwise, replace with use-envmodules:True.

To run ARGprofiler on an HPC with a queuing system, the user should execute the following command:

snakemake --profile profile_argprofiler

Locally

While we have designed ARGprofiler to run in an HPC environment (specifically Computerome), it is possible to run the pipeline locally. Therefore, we recommend creating a mamba environment as follows:

mamba env create --name argprofiler --file rules/environment_argprofiler.yaml

Since we are not executing ARGprofiler in HPC, the user should remove the following flag from the config file: cluster, cluster-config and add the following flag: cores (The cores flag should be changed to reflect the number of cores for Snakemake to use).

Then activate the environment and run Snakemake:

mamba activate argprofiler
snakemake --profile profile_argprofiler

Output

When successfully executed, ARGprofiler creates a directory named results, where the user can find all the available results from all the analysis steps (results are separated into single and paired-reads results). More specifically:

  • raw_reads directory contains all the downloaded sequencing datasets.
  • trimmed_reads directory contains all the trimmed sequencing datasets.
  • kma_mOTUs directory contains all the alignment result files with the mOTUs database.
    • .res = A result overview giving the most common statistics for each mapped template.
    • .mapstat = Summarized read abundances.
    • .fsa.gz = The consensus sequences are drawn from the alignments (zipped).
    • .mat.gz = Base counts on each position in each template (zipped).
    • .vcf.gz = Sequence variations.
  • kma_panres directory contains all the alignment result files with the PanRes database.
    • .res = A result overview giving the most common statistics for each mapped template.
    • .mapstat = Summarized read abundances.
    • .mapstat.filtered = Filtered summarized read abundances (Learn more about our filtering scheme here).
    • .bam = Alignment information (compressed).
    • .fsa.gz = The consensus sequences are drawn from the alignments (zipped).
    • .mat.gz = Base counts on each position in each template (zipped).
    • .vcf.gz = Sequence variations.
  • argextender directory for extracting the genomic flanking regions around ARGs.
    • .fasta = fasta file with the extracted flanking sequences.
    • .frag.gz = overview file that contains information on the following: Contig_seq, Number of matching ARGs, Alignment score, Start pos., End pos., Template name, Contig name.
    • .frag_raw.gz = Similar file with fra.gz but with all ARGs that can align to any of the contigs. Read, Number of equally well mapping templates, Mapping score, Start pos., End pos. (w.r.t. template), Choosen template(s).
    • gfa.gz = Sequence graph.
  • Mash directory contains the mash sketches for each sequencing dataset.

Tips and Tricks

  • For the analysis of unpublished sequencing reads, the user should create a new directory by the name local_reads and place the sequencing reads (both paired and single) in that directory.

  • The pipeline makes use of Snakemake profiles to specify the configuration of the pipeline. The required flags are specified in the files of the profile_argprofiler directory.

    • We provide a config file for executing the pipeline in an HPC with qsub
  • To save space ARGprofiler, by default, removes the raw and trimmed reads from each sample upon successful competition of all rules.

  • Besides the regular output files, the kma_PanRes rule also outputs a BAM file.

  • ARGextender will, by default, run until there is nothing more to extend.

  • The user should create a logs directory in the main directory. Log files for each job will be placed there.

Citation

Martiny, H. M., Pyrounakis, N., Petersen, T. N., Lukjančenko, O., Aarestrup, F. M., Clausen, P. T., & Munk, P. (2024). ARGprofiler—a pipeline for large-scale analysis of antimicrobial resistance genes and their flanking regions in metagenomic datasets. Bioinformatics, btae086. https://doi.org/10.1093/bioinformatics/btae086

Accompanying code and data for ARGprofiler publication

Code

Data

Feedback and issues

We welcome any comments, bug reports, and suggestions, as they will help us improve ARGprofiler. You can leave comments and bug reports in the repository issue tracker or reach out by e-mail to [email protected] or [email protected]

argprofiler's People

Contributors

hmmartiny avatar blackadder901 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.