Giter VIP home page Giter VIP logo

lightmotif's Introduction

🎼🧬 lightmotif Star me

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

Actions Coverage License Crate Docs Source Mirror GitHub issues Changelog

🗺️ Overview

Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides. Position weight matrices are often viewed as sequence logos:

MX000274.svg

The lightmotif library provides a Rust crate to run very efficient searches for a motif encoded in a position weight matrix. The position scanning combines several techniques to allow high-throughput processing of sequences:

  • Compile-time definition of alphabets and matrix dimensions.
  • Sequence symbol encoding for fast table look-ups, as implemented in HMMER[1] or MEME[2]
  • Striped sequence matrices to process several positions in parallel, inspired by Michael Farrar[3].
  • Vectorized matrix row look-up using permute instructions of AVX2.

Other crates from the ecosystem provide additional features if needed:

  • lightmotif-io is a crate with parser implementations for various count matrix, frequency matrix and position-specific scoring matrix formats such as TRANSFAC or JASPAR.
  • lightmotif-tfmpvalue is an exact reimplementation of the TFM-PVALUE[4] algorithm for converting between a score and a p-value for a given scoring matrix.

This is the Rust version, there is a Python package available as well.

💡 Example

use lightmotif::*;
use lightmotif::abc::Nucleotide;

// Create a count matrix from an iterable of motif sequences
let counts = CountMatrix::<Dna>::from_sequences(
    ["GTTGACCTTATCAAC", "GTTGATCCAGTCAAC"]
        .into_iter()
        .map(|s| EncodedSequence::encode(s).unwrap()),
)
.unwrap();

// Create a PSSM with 0.1 pseudocounts and uniform background frequencies.
let pssm = counts.to_freq(0.1).to_scoring(None);

// Use the pipeline to encode the target sequence into a striped matrix
let seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG";
let encoded = EncodedSequence::encode(seq).unwrap();
let mut striped = encoded.to_striped();

// Organize layout of striped matrix to allow scoring with PSSM.
striped.configure(&pssm);

// Compute scores for every position of the matrix.
let scores = pssm.score(&striped);

// Scores can be extracted into a Vec<f32>, or indexed directly.
let v = scores.unstripe();
assert_eq!(scores[0], -23.07094);
assert_eq!(v[0], -23.07094);

// Find the highest scoring position.
let best = scores.argmax().unwrap();
assert_eq!(best, 18);

// Find the positions above an absolute score threshold.
let indices = scores.threshold(10.0);
assert_eq!(indices, []);

This example uses a dynamic dispatch pipeline, which selects the best available backend (AVX2, SSE2, NEON, or a generic implementation) depending on the local platform.

⏱️ Benchmarks

Both benchmarks use the MX000001 motif from PRODORIC[5], and the complete genome of an Escherichia coli K12 strain. Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with --target-cpu=native.

  • Score every position of the genome with the motif weight matrix:

    test bench_avx2    ... bench:   4,510,794 ns/iter (+/-     9,570) = 1029 MB/s
    test bench_sse2    ... bench:  26,773,537 ns/iter (+/-    57,891) =  173 MB/s
    test bench_generic ... bench: 317,731,004 ns/iter (+/- 2,567,370) =   14 MB/s
  • Find the highest-scoring position for a motif in a 10kb sequence (compared to the PSSM algorithm implemented in bio::pattern_matching::pssm):

    test bench_avx2    ... bench:      12,797 ns/iter (+/-   380) = 781 MB/s
    test bench_sse2    ... bench:      62,597 ns/iter (+/-    43) = 159 MB/s
    test bench_generic ... bench:     671,900 ns/iter (+/- 1,150) =  14 MB/s
    test bench_bio     ... bench:   1,193,911 ns/iter (+/- 2,519) =   8 MB/s

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the open-source MIT license.

This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

📚 References

  • [1] Eddy, Sean R. ‘Accelerated Profile HMM Searches’. PLOS Computational Biology 7, no. 10 (20 October 2011): e1002195. doi:10.1371/journal.pcbi.1002195.
  • [2] Grant, Charles E., Timothy L. Bailey, and William Stafford Noble. ‘FIMO: Scanning for Occurrences of a given Motif’. Bioinformatics 27, no. 7 (1 April 2011): 1017–18. doi:10.1093/bioinformatics/btr064.
  • [3] Farrar, Michael. ‘Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations’. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.
  • [4] Touzet, Hélène, and Jean-Stéphane Varré. ‘Efficient and Accurate P-Value Computation for Position Weight Matrices’. Algorithms for Molecular Biology 2, no. 1 (2007): 1–12. doi:10.1186/1748-7188-2-15.
  • [5] Dudek, Christian-Alexander, and Dieter Jahn. ‘PRODORIC: State-of-the-Art Database of Prokaryotic Gene Regulation’. Nucleic Acids Research 50, no. D1 (7 January 2022): D295–302. doi:10.1093/nar/gkab1110.

lightmotif's People

Contributors

althonos avatar striezel avatar workingjubilee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

lightmotif's Issues

Motif finding in protein sequences.

Hello! What a wonderful program you have here.

I was writing to ask if lightmotif is capable of searching for motifs in protein sequences. It looks like functionality is partially in place but I couldn't get it to run with some simple code tweaks. It kept crashing as it was trying to reverse compliment a protein sequence. Maybe I'm doing something wrong?

I'm looking to use motif searching techniques to find short, but conserved exons in genomes. Similarity approaches such as Blast, Diamond, HMMER aren't able to find these. We

When mining a genome, we know roughly where an exon is because it is nestled between or adjacent to longer exons. Thus we only have to search a miniscule fraction of the actual genome. We've done a basic proof of concept which shows the idea works. Unfortunately, all the references sequences are only available in amino acids. The nucleotide format is not available for the references. However, we are able to pass nucleotides or translations for the genome we are mining if that would help.

We were hoping to use your software so we didn't have to reinvent the wheel so to say.

Is this protein motif searching possible in lightmotif? If not, is it something you can add?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.