Giter VIP home page Giter VIP logo

alfatclust's Introduction

ALFATClust - ALignment-Free Adaptive Threshold Clustering

Overview

Biological sequence clustering tool with dynamic threshold for individual clusters. Suitable for clustering multiple groups of homologous sequences.

Citation

Chiu, J.K.H., Ong, R.TH. Clustering biological sequences with dynamic sequence similarity threshold. BMC Bioinformatics 23, 108 (2022). https://doi.org/10.1186/s12859-022-04643-9

Sequence file requirements

The input sequence file must be:

  1. Consisting of either DNA or protein sequences
  2. In FASTA format

Installation

  • Docker
    ALFATClust is available as a Docker package, from which a Docker image can be built and then use it to create a Docker container as a virtual environment.

    Step 1: The Docker image can be built via either the repository URL or the local directory:

    • Option 1: Build with repository URL
      The following command builds a Docker image managed by the host Docker engine:

      docker build -t <image name> github.com/phglab/ALFATClust

      The Docker image built will be named as <image name>. Users may name their own images such as "phglab/alfatclust".

    • Option 2: Build locally
      Image can also be built after cloning or downloading the ALFATClust repository to local directory:

      docker build -t <image name> <path of dockerfile>

      <path of dockerfile> locates the file path of "Dockerfile". If the current path is the root directory of ALFATClust (i.e. the original parent directory of Dockerfile), just specify "." for it.

    Step 2: Once the image is built, a Docker container can be created from it:

    docker run -it --mount type=bind,src=<path of host data directory>,dst=<path of container data directory> --name <container name> <image name>

    A directory (specified in <path of host data directory>) on the host machine can be mounted to the Docker container as <path of container data directory>. A full path is recommended. Users may also name their own containers such as "alfatclust".

    Refer to here for more mounting options.

    Step 3: To start an existing container:

    docker start -ai <container name>

  • Direct execution in host
    The source codes of ALFATClust are under the directory "main". Simply copy the contents in the "main" folder to a local folder. Users may consider adding the path of this local folder to PATH variable. Also, make sure the following tools and libraries are properly installed and can be invoked by ALFATClust. The version tested is indicated in parentheses.

    • Python runtime:

      • Python 3 (3.7.1)
    • Python packages:

      • numpy (1.20.1)
      • scipy (1.6.0)
      • pandas (1.2.2)
      • biopython (1.78)
      • python-igraph (0.8.3)
      • leidenalg (0.8.3)
    • Third-party tool:

      • Mash (2.2.2)
      • MMseqs2 (12.113e3)

Mash [1] can be installed using apt in Ubuntu; an alternative is to download its source codes (requires compilation) or binaries from here. MMseqs2 [2] is used for pre-clustering only. Make sure they are included in the system path.

Usage

Command

  • Docker:

    alfatclust [optional arguments] -i <sequence file path> -o <output cluster file path>

  • Direct execution (under the directory containing "alfatclust.py"):

    ./alfatclust.py [optional arguments] -i <sequence file path> -o <output cluster file path>

Mandatory arguments

Argument name Description
-i/--input <sequence file path> input DNA/protein sequence FASTA file path
-o/--output <output cluster file path> output sequence cluster file path

Optional arguments

Argument name Description [default value]
-e/--evaluate <cluster eval file path> evaluate the clusters and export the evaluation results to <cluster eval file path>
-l/--lower <lower> set the lower bound of the sequence distance estimate (resolution parameter) to <lower> [0.75]
-d/--step <step> set the step size of the sequence distance estimate range to <step> [0.025]
-p/--precluster always run pre-clustering
-k/--kmer <kmer> set the Mash kmer size parameter to <kmer> [DNA: 17; protein: 9]
-s/--sketch <sketch> set the Mash sketch size parameter to <sketch> [2000]
-m/--margin <margin> ignore any Mash distance above 1 - max(<lower> - <margin>, 0) [0.2]
-f/--filter <filter> discard a Mash distance when its shared hash ratio is below <filter>, NOT recommended
-t/--thread <thread> set the number of threads to <thread> (for Mash and cluster evaluation only) [all available CPU cores]
-S/--seed <seed> set the seed value to <seed>
-h/--help show help message and exit

Evaluation report

The evaluation report consists of the following columns:

Column name Description
Cluster Id Cluster Id for the non-singleton cluster
No. of sequences Number of sequences in the cluster
Average sequence identity Cluster average pairwise sequence identity with respect to the center sequence*
Min. sequence identity Cluster minimum pairwise sequence identity with respect to the center sequence*
Center sequence Representative center sequence selected for cluster
Sequence for min. sequence identity Sequence showing the lowest pairwise sequence identity with the center sequence

*Sequence identity = number of matched nucleotides or amino acids / (alignment length - terminal gaps)

Configuration file

The configuration file "settings.cfg" is located under directory "/usr/local/bin/phglab/alfatclust" inside the Docker container, or under the same host directory as the main Python script "alfatclust.py". It consists of the default values for the following parameters organized into various categories:

  • EstimatedSimilarity

    • High: Upper bound of the sequence distance estimate (resolution parameter) range
    • Low: Lower bound of the sequence distance estimate range
    • StepSize: Step size of the sequence distance estimate range during clustering
  • Threshold

    • Precluster: No. of sequences above which pre-clustering is performed to partition sequences
  • DNAMash

    • Kmer: DNA k-mer size for Mash
    • Sketch: DNA sketch size for Mash
  • ProteinMash

    • Kmer: Protein k-mer size for Mash
    • Sketch: Protein k-mer size for Mash
  • NoiseFilter

    • Margin: Any pairwise Mash distance d > 1 - max(<lower bound> - <margin>, 0) is regarded as noise and discarded
  • DNAEvaluation (for sequence alignment during DNA cluster evaluation)

    • MatchScore: Score for a nucleotide match
    • MismatchPenalty: Penalty for a nucleotide mismatch
    • GapOpeningPenalty: Penalty for opening a gap
    • GapExtensionPenalty: Penalty for extending a gap
  • ProteinEvaluation (for sequence alignment during protein cluster evaluation)

    • ScoreMatrix: Score matrix used for amino acid matching (refer to Biopython documentation for the built-in score matrices available)
    • GapOpeningPenalty: Penalty for opening a gap
    • GapExtensionPenalty: Penalty for extending a gap

Sample datasets

The sample datasets are available in folder sample_datasets, which includes:

  1. Antimicrobial resistance (AMR) gene datasets (data sources: ARG-ANNOT [3], CARD [4-6], and ResFinder [7]) argdit_nt_06feb2020_full.fa (DNA) argdit_aa_06feb2020_full.fa (protein)

  2. Non-AMR plasmid gene dataset (data source: PLSDB [8]) plasmid_genes_20191017.fa (DNA)

References

[1] Ondov, B. D., et al. (2016). Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1), 132.
[2] Steinegger, M. and J. Söding. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35, 1026.
[3] Gupta, S. K., et al. (2014). ARG-ANNOT, a New Bioinformatic Tool To Discover Antibiotic Resistance Genes in Bacterial Genomes. Antimicrobial Agents and Chemotherapy, 58(1), 212-220.
[4] Alcock, B. P., et al. (2019). CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Research, 48(D1), D517-D525.
[5] Jia, B., et al. (2017). CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Research, 45(D1), D566-D573.
[6] McArthur, A. G., et al. (2013). The Comprehensive Antibiotic Resistance Database. Antimicrobial Agents and Chemotherapy, 57(7), 3348-3357.
[7] Zankari, E., et al. (2012). Identification of acquired antimicrobial resistance genes. Journal of Antimicrobial Chemotherapy, 67(11), 2640-2644.
[8] Galata, V., et al. (2018). PLSDB: a resource of complete bacterial plasmids. Nucleic Acids Research, 47(D1), D195-D202.

alfatclust's People

Contributors

jimmykhchiu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.