Giter VIP home page Giter VIP logo

hashchain's Introduction

HashChain

HashChain is a family of very fast factor-based sublinear exact-matching search algorithms.

They work by building a bloom-filter based on hashes of q-grams within the pattern to be searched, and hashes of their adjacent q-gram. This permits the algorithm to efficiently reject non-adjacent q-grams in the text with very high probability, allowing it to skip ahead of the mis-matching factor.

Publications

A pre-print of the paper describing the HashChain algorithm is available on arXiv at https://arxiv.org/abs/2310.15711v1.

A version of this paper was accepted at the Symposium on Experimental Algorithmics 2024: https://sea2024.univie.ac.at/accepted-papers/

Algorithms

Algorithms in the HashChain family include:

  • HashChain - the original HashChain algorithm.
  • SentinelHashChain - a faster HashChain using a search text modification hack.
  • WeakerHashChain - a faster HashChain algorithm which does not re-scan data during filtering.
  • LinearHashChain - HashChain with a guaranteed linear worst-case, based on Linear WFR.

Similar algorithms

Similar algorithms include:

  • Weak Factor Recognition (WFR) by Simone Faro,
  • Linear WFR (LWFR) by Domenico Cantone, Simone Faro and Arianna Pavone,
  • Q-gram filtering (QF) by Branislav Ďurian, Hannu Peltola, Leena Salmela, and Jorma Tarhio.

WFR uses a rolling hash function that gives it a longer pre-processing stage and which sets more bits in its filter, giving it a higher false positive rate. It is especially effective on low entropy data as the rolling hash expands the accessible space in the filter table.

LWFR uses two techniques to achieve linearity in the worst case, while remaining sublinear on average. First that it does not re-scan data it has previously scanned during the filtering phrase, and second, that it uses a forward linear matching algorithm (KMP) for verification that likewise guarantees linearity.

QF uses an alignment concept; that successive q-grams of size Q read back in the text should be aligned in the pattern in the same way. Either they belong to the chain of q-grams that start from the end of the pattern, or the second from the end, and so on up to Q chains. It appears to be an efficient filter, but the space it has to set bits is limited by Q rather than the size of a word in memory.

hashchain's People

Contributors

nishihatapalmer avatar

Stargazers

vvvggg avatar Stefano Scafiti avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.