Giter VIP home page Giter VIP logo

swarm's Introduction

Build Status codecov

swarm

A robust and fast clustering method for amplicon-based studies.

The purpose of swarm is to provide a novel clustering algorithm that handles massive sets of amplicons. Results of traditional clustering algorithms are strongly input-order dependent, and rely on an arbitrary global clustering threshold. swarm results are resilient to input-order changes and rely on a small local linking threshold d, representing the maximum number of differences between two amplicons. swarm forms stable, high-resolution clusters, with a high yield of biological information.

To help users, we describe a complete pipeline starting from raw fastq files, clustering with swarm and producing a filtered occurrence table.

swarm 3.0 introduces:

  • a much faster default algorithm,
  • a reduced memory footprint,
  • binaries for Windows x86-64, macOS ARM64, GNU/Linux ARM64, and GNU/Linux POWER8,
  • an updated, hardened, and thoroughly tested code (806 carefully crafted black-box tests),
  • strict dereplication of input sequences is now mandatory,
  • --seeds option (-w) now outputs results sorted by decreasing abundance, and then by alphabetical order of sequence labels.

swarm 2.0 introduced several novelties and improvements over swarm 1.0:

  • built-in breaking phase now performed automatically,
  • possibility to output cluster representatives in fasta format (option -w),
  • fast algorithm now used by default for d = 1 (linear time complexity),
  • a new option called fastidious that refines d = 1 results and reduces the number of small clusters.

Common misconceptions

swarm is a single-linkage clustering method, with some superficial similarities with other clustering methods (e.g., Huse et al, 2010). swarm's novelty is its iterative growth process and the use of sequence abundance values to delineate clusters. swarm properly delineates large clusters (high recall), and can distinguish clusters with as little as two differences between their centers (high precision).

swarm uses a local clustering threshold (d), not a global clustering threshold like other algorithms do. Users may be tempted to convert a 97%-global similarity threshold into a number of differences, and to use large d values. This is not a correct use of swarm. Clusters produced by swarm are naturally larger than d, and tests have shown that using the default d value (d = 1) gives good results on most datasets. Using the new fastidious option further improves the quality of results. For long amplicons or shallow sequencing, higher d values can be used (d = 2 or d = 3, very rarely more).

swarm produces high-resolution results, especially when using d = 1. Under certain rare conditions though, a given marker may not evolve fast enough to distinguish molecular taxa. If it concerns abundant sequences, swarm may form a cluster with a large radius, whereas classic clustering methods will cut through randomly, forcing delineation where the 97%-threshold falls. So, keep in mind that molecular markers have limitations too.

Quick start

swarm most simple usage is:

./swarm amplicons.fasta

That command will apply default parameters (-d 1) to the fasta file amplicons.fasta. The fasta file must be formatted as follows:

>seqID1_1000
acgtacgtacgtacgt
>seqID2_25
cgtcgtcgtcgtcgt

where sequence identifiers are unique and end with a value indicating the number of occurrences of the sequence (e.g., _1000). Alternative format is possible with the option -z, please see the user manual. Swarm requires each fasta entry to present a number of occurrences to work properly. That crucial information can be produced during the dereplication step.

Use swarm -h to get a short help, or see the user manual for a complete description of input/output formats and command line options.

The memory footprint of swarm is roughly 0.6 times the size of the input fasta file. When using the fastidious option, memory footprint can increase significantly. See options -c and -y to control and cap swarm's memory consumption.

Install

Get the latest binaries for GNU/Linux, macOS or Windows from the release page. Get the source code from GitHub using the ZIP button or git clone, and compile swarm with GCC (version 4.8.5 or more recent) or with clang (version 9 or more recent):

git clone https://github.com/torognes/swarm.git
cd swarm/
make

# or, with clang
make CC="clang-9" CXX="clang++-9"

If you have administrator privileges, you can make swarm accessible for all users. Simply copy the binary ./bin/swarm to /usr/local/bin/ or to /usr/bin/. The man page can be installed this way:

cd ./man/
gzip -c swarm.1 > swarm.1.gz
mv swarm.1.gz /usr/local/share/man/man1/
# or
mv swarm.1.gz /usr/share/man/man1/

Once installed, the man page can be accessed with the command man swarm.

Install with conda

(thanks to GitHub user Gian77 for reporting this procedure)

Assuming you already have a conda set-up (anaconda or miniconda), start by activate an environment with python 3:

conda activate py3

Make sure you have all the necessary channels for the bioconda packages:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

List the different versions of swarm available and install one:

conda search -c bioconda swarm
conda install -c bioconda swarm=3.0.0=hc9558a2_0
swarm --version  # check

Prepare amplicon fasta files

To facilitate the use of swarm, we provide examples of shell commands that can be use to format and check the input fasta file. Warning, these examples may not be suitable for very large files.

We assume your SFF or FASTQ files have been properly pair-assembled (with vsearch for example), trimmed from adaptors and primers (with cutadapt for example), and converted to fasta.

Dereplication (mandatory)

In a sample, or collection of sample, a given sequence may appear several times. That number of strictly identical occurrences represents the abundance value of the sequence. Swarm requires all fasta entries to present abundance values to be able to produce high-resolution clusters, like this:

>seqID1_1000
acgtacgtacgtacgt
>seqID2_25
cgtcgtcgtcgtcgt

were seqID1 has an abundance of 1,000 and seqID2 has an abundance of 25 (alternative formats are possible, please see the user manual).

The role of the dereplication step is to identify, merge and sort identical sequences by decreasing abundance. Here is a command using vsearch v1.3.3 or superior:

vsearch \
    --derep_fulllength amplicons.fasta \
    --sizeout \
    --relabel_sha1 \
    --fasta_width 0 \
    --output amplicons_linearized_dereplicated.fasta

The command performs the dereplication, the linearization (--fasta_width 0) and the renaming with hashing values (--relabel_sha1). If you can't or don't want to use vsearch, here is an example using standard command line tools:

grep -v "^>" amplicons_linearized.fasta | \
grep -v [^ACGTacgt] | sort -d | uniq -c | \
while read abundance sequence ; do
    hash=$(printf "${sequence}" | sha1sum)
    hash=${hash:0:40}
    printf ">%s_%d_%s\n" "${hash}" "${abundance}" "${sequence}"
done | sort -t "_" -k2,2nr -k1.2,1d | \
sed -e 's/\_/\n/2' > amplicons_linearized_dereplicated.fasta

Amplicons containing characters other than "ACGT" are discarded. The dereplicated amplicons receive a meaningful unique name (hash values), and are sorted by decreasing number of occurrences and by hash values (to guarantee a stable sorting). The use of a hashing function also provides an easy way to compare sets of amplicons. If two amplicons from two different sets have the same hash code, it means that the sequences they represent are identical.

If for some reason your fasta entries don't have abundance values, and you still want to run swarm (not recommended), you can specify a default abundance value with swarm's --append-abundance (-a) option to be used when abundance information is missing from a sequence.

Launch swarm

Here is a typical way to use swarm:

./swarm \
    --fastidious \
    --threads 4 \
    --seeds cluster_representatives.fasta \
    amplicons.fasta > /dev/null

swarm will partition your dataset with the finest resolution (local number of differences d = 1 by default, built-in elimination of potential chained clusters, fastidious processing) using 4 CPU-cores. cluster representatives will be written to a new fasta file, other results will be discarded (/dev/null).

See the user manual for details on swarm's options and parameters.

Frequently asked questions

To facilitate the use of swarm, we provide examples of options or shell commands that can be use to parse swarm's output. We assume that the amplicon fasta file was prepared as describe above (linearization and dereplication).

Refine swarm clusters

The chain-breaking, which used to be performed in a second step in swarm 1.0, is now built-in and performed by default. It is possible to deactivate it with the --no-otu-breaking option, but it is not recommended. The fastidious option is recommended when using d = 1, as it will reduce the number of small clusters while maintaining a high clustering resolution. The principle of the fastidious option is described in the figure below:

Count the number of amplicons per cluster

You might want to check the size distribution of clusters (number of amplicons in each cluster), and count the number of singletons (clusters containing only one amplicon). It can be easily done with the --statistics-file filename option. Each line in the output file represents a cluster and provides different metrics. See the manual for a complete description.

Get the seed sequence for each cluster

It is frequent for subsequent analyses to keep only one representative amplicon per cluster (usually the seed) to reduce the computational burden. That operation is easily done with swarm by using the -w filename option.

Get fasta sequences for all amplicons in a cluster

For each cluster, get the fasta sequences for all amplicons. Warning, this loop can generate a very large number of files. To limit the number of files, a test can be added to exclude swarms with less than n elements. See this wiki page for more examples.

INPUT_SWARM="amplicons.swarms"
INPUT_FASTA="amplicons.fasta"
OUTPUT_FOLDER="swarms_fasta"
AMPLICONS=$(mktemp)
mkdir "${OUTPUT_FOLDER}"
while read CLUSTER ; do
    tr " " "\n" <<< "${CLUSTER}" | sed -e 's/^/>/' > "${AMPLICONS}"
    seed=$(head -n 1 "${AMPLICONS}")
    grep -A 1 -F -f "${AMPLICONS}" "${INPUT_FASTA}" | \
        sed -e '/^--$/d' > "./${OUTPUT_FOLDER}/${seed/>/}.fasta"
done < "${INPUT_SWARM}"
rm "${AMPLICONS}"

Citation

To cite swarm, please refer to:

Swarm: robust and fast clustering method for amplicon-based studies.
Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2014)
PeerJ 2:e593 doi: 10.7717/peerj.593

Swarm v2: highly-scalable and high-resolution amplicon clustering.
Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2015)
PeerJ 3:e1420 doi: 10.7717/peerj.1420

Swarm v3: towards tera-scale amplicon clustering.
Mahé F, Czech L, Stamatakis A, Quince C, de Vargas C, Dunthorn M, Rognes T. (2021)
Bioinformatics doi: 10.1093/bioinformatics/btab493

Acknowledgments

Many thanks to the following people for their valuable contributions:

  • Lucas Czech
  • Etienne Platini

Contact

You are welcome to:

Third-party pipelines

swarm is available in third-party pipelines:

  • FROGS: a Galaxy/CLI workflow designed to produce a cluster count matrix from high depth sequencing amplicon data.
  • LotuS (v1.30): extremely fast cluster building, annotation, phylogeny and abundance matrix pipeline, based on raw sequencer output.
  • QIIME (v1.9): a multi-purpose pipeline for performing microbiome analysis from raw DNA sequencing data.

Alternatives

If you want to try alternative free and open-source clustering methods, here are some links:

Roadmap

swarm adheres to semantic versioning 2.0.0:

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible API changes, MINOR version when you add functionality in a backwards compatible manner, and PATCH version when you make backwards compatible bug fixes.

swarm 3.1.6:

  • use more C++11 and STL features,
  • replace pthreads with std::thread,
  • eliminate most of clang-tidy's warnings,
  • refactor to reduce cyclomatic complexity (simpler and shorter functions),
  • reduce/eliminate linuxisms to improve portability,
  • measure the effect of code modernization on run-time performances

swarm 3.2.0:

  • swarm can be compiled natively on a BSD or a Windows system

swarm 4.0.0:

  • rename option -n to --no-cluster-breaking (API change),
  • faster similarity search? (AVX2, GPU, symmetric deletion lookup algorithm, ...),
  • drop compatibility with GCC 4 and GCC 5 (late-2024 GCC 8 will become the new de facto standard for HPC centers),
  • start using C++14 and C++17 features

Version history

version 3.1.5

swarm 3.1.5 fixes four minor bugs, improves code and documentation, and eliminates compilation warnings and static analysis warnings:

  • change: when using the fastidious and the ceiling options, minimal ceiling value is now 40 megabytes (instead of 8 megabytes),
  • add: more compilation checks (shadow, useless-cast, conversion, sign-conversion),
  • add: 42 new black-box tests,
  • fix: two minor bugs introduced in version 3.1.1 (alloc-dealloc-mismatch, and allocating too much memory; bugs had no impact on clustering results),
  • fix: a minor bug introduced in version 3.1.4 (misaligned memory under certain conditions; bug had no impact on clustering results),
  • fix: minor bug in the way available memory is estimated (buffer underflow; bug had no impact on clustering results),
  • fix: 50 warnings triggered by newly added compilation checks,
  • fix: 1,677 clang-tidy warnings (from 2,629 warnings, down to 952),
  • performance: generally stable but compiler-dependent, with the exception of a 5 to 10% increase in total memory footprint when d >= 2 (we expect these performance regressions to be temporary and to be fixed with further refactoring),
  • improve: documentation for output option --network_file (advanced users),
  • improve: build target platform detection,
  • improve: code-coverage of our test-suite,
  • improve: code modernization for long-term maintenance,
  • revert: due to inconsistent test results, reading from named pipes (fifo) is marked as experimental for now

version 3.1.4

swarm 3.1.4 fixes a minor bug, eliminates compilation warnings and static analysis warnings, and improves code:

  • fix: add checks to prevent silent overflow of short unsigned integers,
  • fix: compilation warnings with GCC 13 and clang 18,
  • fix: 1,040 clang-tidy warnings (from 3,669 warnings, down to 2,629),
  • improve: code modernization for long-term maintenance,
  • improve: double the maximal number of threads (from 256 threads to 512),
  • improve: make -DNDEBUG the default compilation behavior,
  • performance: stable for all modes, except a 6 to 10% increase in memory footprint when d > 2

version 3.1.3

swarm 3.1.3 fixes a few minor bugs, removes warnings, and improves code and documentation:

  • fix: bug introduced in version 3.1.1, that caused swarm to allocate way too much memory when d > 1 (bug had no impact on clustering results),
  • fix: off-by-one error when allocating memory for a Bloom filter (bug had no impact on clustering results),
  • fix: compilation warning with GCC 12 (and more recent) when using link-time optimization,
  • fix: compilation warning with clang 13 (and more recent): unused set variable,
  • fix: five clang-tidy warnings (readability-braces-around-statements),
  • fix: minor code refactoring,
  • improve: more uniform vocabulary throughout swarm's documentation (code, help, manpage, README, companion scripts and wiki),
  • improve: code coverage of our test suite (swarm-tests).

version 3.1.2

swarm 3.1.2 fixes a bug with fastidious mode introduced in version 3.1.1, that could cause Swarm to crash. Probably due to allocating too much memory.

version 3.1.1

swarm 3.1.1 eliminates a risk of segmentation fault with extremely long sequence headers. Documentation and error messages have been improved, and code cleaning continued.

version 3.1

swarm 3.1 includes a fix for a bug in the 16-bit SIMD alignment code that was exposed with a combination of d>1, long sequences, and very high gap penalties. The code has also been been cleaned up, tested and improved substantially, and it is now fully C++11 compliant. Support for macOS on Apple Silicon (ARM64) has been added.

version 3.0

swarm 3.0 is much faster when d = 1, and consumes less memory. Strict dereplication is now mandatory.

version 2.2.2

swarm 2.2.2 fixes a bug causing Swarm to wait forever in very rare cases when multiple threads were used.

version 2.2.1

swarm 2.2.1 fixes a memory allocation bug for d=1.

version 2.2.0

swarm 2.2.0 fixes several problems and improves usability. Corrected output to structure and uclust files when using fastidious mode. Corrected abundance output in some cases. Added check for duplicated sequences and fixed check for duplicated sequence IDs. Checks for empty sequences. Sorts sequences by additional fields to improve stability. Improves compatibility with compilers and operating systems. Outputs sequences in upper case. Allows 64-bit abundances. Shows message when waiting for input from stdin. Improves error messages and warnings. Improves checking of command line options. Fixes remaining errors reported by test suite. Updates documentation.

version 2.1.13

swarm 2.1.13 removes a bug in the progress bar when writing seeds.

version 2.1.12

swarm 2.1.12 removes a debugging message.

version 2.1.11

swarm 2.1.11 fixes two bugs related to the SIMD implementation of alignment that might result in incorrect alignments and scores. The bug only applies when d>1.

version 2.1.10

swarm 2.1.10 fixes two bugs related to gap penalties of alignments. The first bug may lead to wrong aligments and similarity percentages reported in UCLUST (.uc) files. The second bug makes Swarm use a slightly higher gap extension penalty than specified. The default gap extension penalty used have actually been 4.5 instead of 4.

version 2.1.9

swarm 2.1.9 fixes a problem when compiling with GCC version 6.

version 2.1.8

swarm 2.1.8 fixes a rare bug triggered when clustering extremely short undereplicated sequences. Also, alignment parameters are not shown when d=1.

version 2.1.7

swarm 2.1.7 fixes more problems with seed output. Ignore CR characters in FASTA files. Improved help and error messsages.

version 2.1.6

swarm 2.1.6 fixes problems with older compilers that do not have the x86intrin.h header file. It also fixes a bug in the output of seeds with the -w option when d>1.

version 2.1.5

swarm 2.1.5 fixes minor bugs.

version 2.1.4

swarm 2.1.4 fixes minor bugs in the swarm algorithm used for d=1.

version 2.1.3

swarm 2.1.3 adds checks of numeric option arguments.

version 2.1.2

swarm 2.1.2 adds the -a (--append-abundance) option to set a default abundance value to be used when abundance information is missing from the input file. If this option is not specified, missing abundance information will result in a fatal error. The error message in that case is improved.

version 2.1.1

swarm 2.1.1 fixes a bug with the fastidious option that caused it to ignore some connections between heavy and light swarms.

version 2.1.0

swarm 2.1.0 marks the first official release of swarm 2.

version 2.0.7

swarm 2.0.7 writes abundance information in usearch style when using options -w (--seeds) in combination with -z (--usearch-abundance).

version 2.0.6

swarm 2.0.6 fixes a minor bug.

version 2.0.5

swarm 2.0.5 improves the implementation of the fastidious option and adds options to control memory usage of the Bloom filter (-y and -c). In addition, an option (-w) allows to output cluster representatives sequences with updated abundances (sum of all abundances inside each cluster). This version also enables dereplication when d = 0.

version 2.0.4

swarm 2.0.4 includes a fully parallelized fastidious option.

version 2.0.3

swarm 2.0.3 includes a working fastidious option.

version 2.0.2

swarm 2.0.2 fixes SSSE3 problems.

version 2.0.1

swarm 2.0.1 is a development release that partially implements the fastidious option.

version 2.0.0

swarm 2.0.0 simplifies the usage of swarm by using the fast algorithm and the built-in cluster breaking by default. Some options are changed and some new output options are introduced.

version 1.2.21

swarm 1.2.21 is supposed to fix some problems related to the use of the SSSE3 CPU instructions which are not always available.

version 1.2.20

swarm 1.2.20 presents a production-ready version of the alternative algorithm (option -a), with optional built-in cluster breaking (option -n). That alternative algorithmic approach (usable only with d = 1) is considerably faster than currently used clustering algorithms, and can deal with datasets of 100 million unique amplicons or more in a few hours. Of course, results are rigourously identical to the results previously produced with swarm. That release also introduces new options to control swarm output (options -i and -l).

version 1.2.19

swarm 1.2.19 fixes a problem related to abundance information when the sequence identifier includes multiple underscore characters.

version 1.2.18

swarm 1.2.18 reenables the possibility of reading sequences from stdin if no file name is specified on the command line. It also fixes a bug related to CPU features detection.

version 1.2.17

swarm 1.2.17 fixes a memory allocation bug introduced in version 1.2.15.

version 1.2.16

swarm 1.2.16 fixes a bug in the abundance sort introduced in version 1.2.15.

version 1.2.15

swarm 1.2.15 sorts the input sequences in order of decreasing abundance unless they are detected to be sorted already. When using the alternative algorithm for d = 1 it also sorts all subseeds in order of decreasing abundance.

version 1.2.14

swarm 1.2.14 fixes a bug in the output with the swarm breaker option (-b) when using the alternative algorithm (-a).

version 1.2.13

swarm 1.2.13 updates the citation.

version 1.2.12

swarm 1.2.12 improves speed of new search strategy for d = 1.

version 1.2.11

swarm 1.2.11 corrects the number of differences reported in the break swarms output.

version 1.2.10

swarm 1.2.10 allows amplicon abundances to be specified using the usearch style in the sequence header (e.g. >id;size=1) when the -z option is chosen. Also fixes the bad URL shown in the previous version of swarm.

version 1.2.9

swarm 1.2.9 includes a parallelized variant of the new search strategy for d = 1. It seems to be fairly scalable up to about 16 threads for longer reads (~400bp), while up to about 8 threads for shorter reads (~150bp). Using about 50% more threads than available physical cores is recommended. This version also includes the d parameter in the beginning of the mothur-style output (e.g., swarm\_1). Also, in the break_swarms output the real number of differences between the seed and the amplicon is indicated in the last column.

version 1.2.8

swarm 1.2.8 fixes an error with the gap extension penalty. Previous versions effectively used a gap penalty twice as large as intended. This version also introduces an experimental new search strategy in the case where d = 1 that appears to be almost linear and faster at least for datasets of about half a million sequences or more. The new strategy can be turned on with the -a option.

version 1.2.7

swarm 1.2.7 incorporates a few small changes and improvements to make it ready for integration into QIIME.

version 1.2.6

swarm 1.2.6 add an option (-r or --mothur) to format the output file as a mothur-compatible list file instead of the native swarm format. When swarm encounters an illegal character in the input sequences it will now report the illegal character and the line number.

version 1.2.5

swarm 1.2.5 can be run on CPUs without the POPCNT feature. It automatically checks whether the CPU feature is available and uses the appropriate code. The code that avoids POPCNT is just slightly slower. Only basic SSE2 is now required.

version 1.2.4

swarm 1.2.4 changes the name of the new option from --break_swarms to --break-swarms for consistency with other options, and also adds a companion script swarm_breaker.py to refine swarm results (scripts folder).

version 1.2.3

swarm 1.2.3 adds an option (-b or --break_swarms) to output all pairs of amplicons to stderr. The data can be used for post-processing of the results to refine the swarms. The syntax of the inline assembly code is also changed for compatibility with more compilers.

version 1.2.2

swarm 1.2.2 fixes an issue with incorrect values in the statistics file (maximum generation and radius of swarms). This version is also a bit faster.

version 1.2.1

swarm 1.2.1 removes the need for a SSE4.1 capable CPU and should now be able to run on most servers, desktops and laptops.

version 1.2.0

swarm 1.2.0 introduces a pre-filtering of similar amplicons based on k-mers. This eliminates most of the time-consuming pairwise alignments and greatly improves speed. The speedup can be more than 100-fold compared to previous swarm versions when using a single thread with a large set of amplicons. Using multiple threads induces a computational overhead, but becomes more and more efficient as the size of the amplicon set increases.

version 1.1.1

swarm now works on Apple computers. This version also corrects an issue in the pairwise global alignment step that could lead to sub-optimal alignments. Slightly different alignments may result relative to previous version, giving slightly different swarms.

version 1.1.0

swarm 1.1.0 introduces new optimizations and is 20% faster than the previous version on our test dataset. It also introduces two new output options: statistics and uclust-like format.

By specifying the -s option to swarm it will now output detailed statistics about each swarm to a specified file. It will print the number of unique amplicons, the number of occurrences, the name of the seed and its abundance, the number of singletons (amplicons with an abundance of 1), the number of iterations and the maximum radius of the swarm (i.e. number of differences between the seed and the furthermost amplicon). When using input data sorted by decreasing abundance, the seed is the most abundant amplicon in the swarm.

Some pipelines use the uclust output format as input for subsequent analyses. swarm can now output results in this format to a specified file with the -u option.

swarm's People

Contributors

etienneplatini avatar frederic-mahe avatar lczech avatar mooreryan avatar torognes avatar vmikk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

swarm's Issues

Store sequences using 2 bits

Store sequences using 2 bits instead of 8 bits per nucleotide. Will allow larger datasets. Impact on performance uncertain.

Progress Indication

If it's possible given the implementation, some kind of progress indication would be very helpful. Maybe a count of sequences left unassigned to an OTU?

Can the stats file be used in any way to determine progress? We've had a job clustering 5,358,959,072 nucleotides in 12,783,921 (unique) sequences on 16 cores with 16GB of RAM for around a week now, and knowing when it may complete would be some peace of mind. It's chugging along quite nicely at 1000% CPU usage (according to top), and has survived when cd-hit and uclust would have filled up the RAM and forced the process into swap space.

Accelerate pairwise comparisons for the special case d = 1

By default, the d value is set to 1, to produce high-resolution OTUs, even for the densest regions of the amplicon-space. We can safely assume that d = 1 will be by the most frequent choice in future studies.

There are several ways to speed up the pairwise comparisons when searching only for amplicons with one difference.

The most obvious way is to compare the strings character by character. Compare the first characters of each sequence, if they match, compare the second characters of each sequence, etc. On the first mismatch, record the coordinate of the mismatch and break. Start the comparisons from the end of the sequences, until a mismatch is found. If the mismatches occur at the same coordinates, then we have a single mutation between the sequences.

Of course, there is no need to compare the sequences if their length difference is greater than 1.

For sequences of length n (+- 1), the number of character comparisons to do is at most n, which should be faster than global pairwise alignment. Here is an implementation as a python function:

def pairwise_alignment(seed, candidates):
    """
    Fast pairwise alignment (search only for sequences with a single
    mutation)
    """
    selected = list()
    seed_length = len(seed)
    for candidate in candidates:
        candidate_length = len(candidate)
        # Eliminate sequences too long or too short
        if abs(seed_length - candidate_length) > 1:
            continue
        # Identify the longest sequence (in practice, turn all
        # insertion cases into deletions)
        if seed_length >= candidate_length:
            query, subject = seed, candidate
        else:
            query, subject = candidate, seed
        # Compare the sequences character by character
        length = len(query)
        stop_forward = length
        for i in xrange(0, length - 1, 1):
            if query[i] != subject[i]:
                stop_forward = i
                break
        stop_reverse = stop_forward
        for i in xrange(1, length - stop_forward, 1):
            if query[-i] != subject[-i]:
                stop_reverse = length - i
                break
        # Do we detect a single mutation or insertion-deletion?
        if stop_forward == stop_reverse:
            selected.append(candidate)
    return selected

The function arguments are the seed sequence and a list of candidate sequences to test. It is 40 times faster than a pure python Needleman-Wunsch function.

Swarm radius values should be available via an option

It would be interesting to get the radius of each swarm. The radius is the maximum distance (or the minimum percentage of identity) between the seed of the swarm and any another amplicon assigned to the swarm.

Swarm radius in statistics should be multiplied by d

The radius value outputted is a short integer. This integer represents the number of generations, or the number of iterations, that it took for the swarm to reach its maximum limits. If multiplied by the parameter d, it gives the maximum number of differences between the seed and the edge of the swarm.

I think that the radius value should be multiplied by d by default.

Allow ascii \x01 in headers

Allow the \x01 character (^A) (SOH) in headers. It is used by NCBI to separate entries in the FASTA headers of the NR and NT databases.

Use swarm for dereplication (-d 0)

Rohit Kolora suggested to add a possibility to run swarm with -d 0 to dereplicate datasets.

The recommended tool for dereplication is vsearch --derep_full. Implementing an efficient dereplication in swarm is not a priority for us, and would lead to code duplication. Although, it should be possible to implement some sort of (inefficient but functional) dereplication, just by keeping the present code and allowing the use of d values ranging from zero to 255. I guess the fastest way is to use the alternative algorithm (so the branching would be alternative algorithm for d = 0 or 1, normal algorithm for higher d values).

It seems that users spontaneously try to run swarm -d 0. It has some logic to it, so we might as well try to make it possible (if it is not time consuming).

Downstream analysis on otutable or biom file

Hi,

What would be the best approach to continue with statistiscs (in R/phyloseq) after swarm clustering? Mapping raw reads to the reference sequences using vsearch and the -usearch_global option? If I would use the mothur output, a count_table is still needed to create a shared/biom file. Or is there another way to use the abundance information in the stats file for example to come to a table with otu counts?

Thanks!

Exact string matching strategy (special case d = 1)

When d = 1, how many micro-variants can a sequence have?

At most 7L + 4, where L is the length of the seed.

A micro-variant is a sequence differing from the seed sequence by only one substitution, insertion or deletion. The number of possible micro-variants is surprisingly low, and it is reasonably fast to generate all possible sequences. If we store our fasta entries as a mapping of sequences to amplicon names and abundances, we can turn swarming into a problem of exact string matching.

It works as such: parse the entire dataset and store it in a mapping structure. For a given seed sequence, produce all its unique micro-variants. Check if these micro-variants are in the mapping structure. If yes, add them to the swarm as sub-seeds and mark them as swarmed.

The complexity of the computation now depends on how fast the mapping structure can be queried, and how fast we can create the micro-variants sets. The micro-variant sets have to be created for all amplicons (9L + 4 operations), and at most 7L + 4 look-ups are necessary for each amplicon. Depending on the complexity of look-ups, the global complexity could be in O(n log n).

A pure python implementation is only 10 times slower than swarm (C++) on a dataset containing 77,693 unique sequences of average length 129 bp: 25 s for swarm, 240 s for swarm.py. I will run tests on much larger datasets to see if the new implementation outperforms the C++ implementation when n increases.

Support for usearch amplicon-abundance annotation style

Swarm assumes that in this type of fasta header:

>KR08766_2

The value 2 represents the abundance of that sequence (i.e. the number of time the sequence has been seen in the data set). Usearch uses a different annotation style (see here for a full description):

>KR08766;size=1;
>KR08766;size=2

Swarm should support these two forms (with or without the terminal ;) to facilitate its use in uclust-based pipelines.

Avoid SSE3 and SSSE3 dependency

The Swarm code is compiled with "-mssse3" in the Makefile. A few ssse3 instructions are used in the alignment code if present, otherwise the code is written to only require sse2, and the presence of sse2 is tested for. However, it seems that the compiler also uses some sse3 or ssse3 instructions somewhere. This results in termination of Swarm with "Illegal instruction" fatal errors on some architectures that does not support all sse3 or ssse3 instructions. Two reports from users indicate this (both using pick_denovo_otus.py in qiime).

The code should be recompiled with other compiler options in the Makefile to avoid this problem.

Adapt to AVX2

Adapt SWARM to AVX2 and the 256-bit registers available in the new Intel Haswell CPUs that became available in June 2013. Should allow 32-way SIMD parallellisation.

Fix bug in affine alignment backtracking

More information need to be saved during forward alignment in order to backtrack the correct alignment when using affine gap penalties. Otherwise the identified alignment may not be optimal.

Bug in the -b option output (critical)

That option is important for the swarm breaking. In its present form, the output of the option is wrong, which gives wrong breaking results. The -b output is formatted as follows: @@ TAB father TAB son TAB number of differences.

As it is implemented now, the -b option always output the cluster's seed as "father". Here is a simple example. That fasta file:

>s01_100
ACGTACGT
>s02_50
ACGTTCGT
>s03_25
ACGTTAGT
>s04_1
ACGTTAAT

...yields the following cluster: s01--s02--s03--s04

Number of swarms:  1
Largest swarm:     4
Max generations:   3

...but the -b option output is:

@@  s01 s02 1
@@  s01 s03 1
@@  s01 s04 1

...instead of:

@@  s01 s02 1
@@  s02 s03 1
@@  s03 s04 1

Chimera checking and swarm

When clustering with swarm, when is an appropriate time to check for chimeras?

Developers of previous OTU picking algorithms have recommended chimera checking at different stages. For example, Robert Edgar recommends using uchime to remove chimeras after OTU picking with uparse. His recommendation is based, in part, on chimera removal heuristics built in to uparse. When using uclust, he recommended referenced-based chimera checking before OTU picking so that OTUs were not influenced by chimeras.

When using swarm, when and how should we attempt to remove chimeric sequences?

Unstable order of amplicons in swarms with new approach for d=1 with >1 thread

With the new approach for d=1, the order of amplicons in each swarm may be unstable when running with more than one thread; the actual swarms and their amplicons should be the same. This is because the list of subseeds for each seed is not in the order of the input file. To fix this we could sort the subseeds before proceeding to the next (sub)seed.

Output detailed statistics for each swarm

Add an option to output the following swarm statistics to a separate file:

  • maximum radius of cluster
  • number of unique amplicons
  • number of copies
  • name of the seed
  • seed abundance,
  • number of singletons (amplicons with an abundance of 1)

Sort by abundance

Add option to sort input sequences according to their abundance if present

Segmentation fault (or SIGABRT) with -a

With several test datasets, I get a segmentation fault at the end of the clustering process when using the -a option. A gdb session shows

gdb swarm
...
run -a -b -d 1 < AF091148.fas
...
Number of swarms:  887
Largest swarm:     441
Max generations:   186

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff67f3700 (LWP 24981)]
0x00007ffff702d0c0 in __GI___call_tls_dtors () at cxa_thread_atexit_impl.c:83
83  cxa_thread_atexit_impl.c: Aucun fichier ou dossier de ce type.
(gdb) backtrace 
#0  0x00007ffff702d0c0 in __GI___call_tls_dtors () at cxa_thread_atexit_impl.c:83
#1  0x00007ffff7bc70b2 in start_thread (arg=0x7ffff67f3700) at pthread_create.c:319
#2  0x00007ffff70dac2d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

if I do the same using 2 threads, I obtain a different error:

gdb swarm
...
run -a -b -d 1 < AF091148.fas
...
[Thread 0x7ffff5ff2700 (LWP 25025) exited]
*** Error in `/home/fred/Science/Projects/Swarms/swarm/swarm': munmap_chunk(): invalid pointer: 0x00000000006403d0 ***
[Thread 0x7ffff57f1700 (LWP 25026) exited]

Program received signal SIGABRT, Aborted.
0x00007ffff702a077 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: Aucun fichier ou dossier de ce type.
(gdb) backtrace 
#0  0x00007ffff702a077 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff702b458 in __GI_abort () at abort.c:89
#2  0x00007ffff7067fb4 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7ffff715abc0 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff706d78e in malloc_printerr (action=1, str=0x7ffff715abe8 "munmap_chunk(): invalid pointer", ptr=<optimized out>) at malloc.c:4996
#4  0x000000000040b76c in hash_free () at algod1.cc:142
#5  algo_d1_run () at algod1.cc:727
#6  0x000000000040169a in main (argc=<optimized out>, argv=0x7fffffffe2e8) at swarm.cc:428
(gdb) frame 4
#4  0x000000000040b76c in hash_free () at algod1.cc:142
142   free(hash_occupied);
(gdb) print hash_occupied 
$1 = (unsigned char *) 0x6403d0 "J\001"

Add the OTU number to the output of the -b option

Géraldine Pascal from Toulouse (France) suggested to add the OTU number to the output of the -b option. Each line would then contain a new column with a positive integer representing the OTU number in swarm results (): 1 for the first OTU, 2 for the second, etc.

That would facilitate the filtering of the -b output, allowing for an easy extraction of all amplicon pairs belonging to a given OTU. Implementation would require a small modification of swarm_breaker.py, if the script is still necessary for OTU breaking.

Refine clustering

When running swarm_breaker I get an error I cannot figure out. I would be grateful for help:

$ python ~/kode/swarm/scripts/swarm_breaker.py -f All_reads_U.fasta -s amplicons.swarms > amplicons.swarms2
Traceback (most recent call last):
File "/home/andersl/kode/swarm/scripts/swarm_breaker.py", line 315, in
main()
File "/home/andersl/kode/swarm/scripts/swarm_breaker.py", line 304, in main
swarm_breaker(binary, all_amplicons, swarms, threshold)
File "/home/andersl/kode/swarm/scripts/swarm_breaker.py", line 243, in swarm_breaker
graph_raw_data = run_swarm(binary, all_amplicons, amplicons, threshold)
File "/home/andersl/kode/swarm/scripts/swarm_breaker.py", line 127, in run_swarm
close_fds=True)
File "/usr/lib/python2.7/subprocess.py", line 679, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

swarm_breaker.py sometimes hangs forever

The python script uses subprocess to run swarm on sub-parts of the fasta dataset. The function wait() sometime hangs forever, without doing anything. The bug is known to be triggered when using pipes to redirect stdout and/or stderr of the subprocess. When the amount of data is bigger than 65k (a memory page?), the deadlock happens.

One recommended solution is to write stdout and stderr to files. This is what we do here, so I don't understand why the bug happens.

Linearization awk code prints last line twice

The provided awk code to linearize fasta files prints the last line twice.

awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {print}' amplicons.fasta > amplicons_linearized.fasta

doesnt work for me, while:

awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1}' amplicons.fasta > amplicons_linearized.fasta

is fine.

Easy test:

>1
AAACCC
>2
GGGTTT

with original code produces:

>1
AAACCC
>2
GGGTTTGGGTTT

with fixed code produces:

>1
AAACCC
>2
GGGTTT

Tested with GNU awk 4.0.1 and 4.1.1

Check for unique sequences

Check if sequences are unique and handle any duplicates; Compute hash of each sequence by MD5, SHA1 or similar

Add the number of differences in the output of the d option

In its present form, the -d option outputs pairs of amplicons that have at most /d/ differences, and a dummy value "1". It would be much more interesting to replace that dummy value with the actual number of differences between the two amplicons.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.