Giter VIP home page Giter VIP logo

tantivy's Introduction

Docs Build Status codecov Join the chat at https://discord.gg/MT27AG5EVE License: MIT Crates.io

Tantivy, the fastest full-text search engine library written in Rust

Fast full-text search engine library written in Rust

If you are looking for an alternative to Elasticsearch or Apache Solr, check out Quickwit, our distributed search engine built on top of Tantivy.

Tantivy is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is not an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.

Tantivy is, in fact, strongly inspired by Lucene's design.

Benchmark

The following benchmark breakdowns performance for different types of queries/collections.

Your mileage WILL vary depending on the nature of queries and their load.

Details about the benchmark can be found at this repository.

Features

  • Full-text search
  • Configurable tokenizer (stemming available for 17 Latin languages) with third party support for Chinese (tantivy-jieba and cang-jie), Japanese (lindera, Vaporetto, and tantivy-tokenizer-tiny-segmenter) and Korean (lindera + lindera-ko-dic-builder)
  • Fast (check out the ๐ŸŽ โœจ benchmark โœจ ๐ŸŽ)
  • Tiny startup time (<10ms), perfect for command-line tools
  • BM25 scoring (the same as Lucene)
  • Natural query language (e.g. (michael AND jackson) OR "king of pop")
  • Phrase queries search (e.g. "michael jackson")
  • Incremental indexing
  • Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop)
  • Mmap directory
  • SIMD integer compression when the platform/CPU includes the SSE2 instruction set
  • Single valued and multivalued u64, i64, and f64 fast fields (equivalent of doc values in Lucene)
  • &[u8] fast fields
  • Text, i64, u64, f64, dates, ip, bool, and hierarchical facet fields
  • Compressed document store (LZ4, Zstd, None)
  • Range queries
  • Faceted search
  • Configurable indexing (optional term frequency and position indexing)
  • JSON Field
  • Aggregation Collector: histogram, range buckets, average, and stats metrics
  • LogMergePolicy with deletes
  • Searcher Warmer API
  • Cheesy logo with a horse

Non-features

Distributed search is out of the scope of Tantivy, but if you are looking for this feature, check out Quickwit.

Getting started

Tantivy works on stable Rust and supports Linux, macOS, and Windows.

How can I support this project?

There are many ways to support this project.

  • Use Tantivy and tell us about your experience on Discord or by email ([email protected])
  • Report bugs
  • Write a blog post
  • Help with documentation by asking questions or submitting PRs
  • Contribute code (you can join our Discord server)
  • Talk about Tantivy around you

Contributing code

We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR. Feel free to update CHANGELOG.md with your contribution.

Tokenizer

When implementing a tokenizer for tantivy depend on the tantivy-tokenizer-api crate.

Clone and build locally

Tantivy compiles on stable Rust. To check out and run tests, you can simply run:

git clone https://github.com/quickwit-oss/tantivy.git
cd tantivy
cargo test

Companies Using Tantivy

Etsyย  Nuclia ย  Humanfirst.ai Element.io Nuclia ย  Humanfirst.aiย  ย  Element.io

FAQ

Can I use Tantivy in other languages?

You can also find other bindings on GitHub but they may be less maintained.

What are some examples of Tantivy use?

  • seshat: A matrix message database/indexer
  • tantiny: Tiny full-text search for Ruby
  • lnx: adaptable, typo tolerant search engine with a REST API
  • and more!

On average, how much faster is Tantivy compared to Lucene?

Does tantivy support incremental indexing?

  • Yes.

How can I edit documents?

  • Data in tantivy is immutable. To edit a document, the document needs to be deleted and reindexed.

When will my documents be searchable during indexing?

  • Documents will be searchable after a commit is called on an IndexWriter. Existing IndexReaders will also need to be reloaded in order to reflect the changes. Finally, changes are only visible to newly acquired Searcher.

tantivy's People

Contributors

adamreichold avatar appaquet avatar barrotsteindev avatar boraarslan avatar currymj avatar dependabot-preview[bot] avatar dependabot[bot] avatar drusellers avatar evanxg852000 avatar fmassot avatar fulmicoton avatar guilload avatar halvorboe avatar jason-wolfe avatar k-yomo avatar kodraus avatar kryesh avatar lengyijun avatar lnicola avatar petr-tik avatar ppodolsky avatar pseitz avatar rihardsk avatar robyoung avatar saroh avatar shikhar avatar trinity-1686a avatar vigneshsarma avatar vishalsodani avatar waywardmonkeys avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tantivy's Issues

Geo-search

This would require a longlat field type... (and coords if we want simple 2D as well

Then the user would want to perform queries such as

  • return document at a distance of less than X km
  • return document within this polygon

Add auto-publish

Publishing and committing don't have to be the same.

In many usage, users want document to be searchable as soon as possible.
Let's call making a segment ready for search "publishing.

Committing on the other hand is about order consistency, (after commit, a document is
searchable iff it was added before the commit) and more importantly persistence.

Parse multivalues

The current json parser does not allow for multivalues.

{
 "author": ["Les Paul", "Mary Ford"]
}

should be parsed correctly

Stop copying Schema around

Schema is copied quite often.

Maybe add an property to Index, and wrap it in an Arc could be a solution

LogMergePolicy

Implement a merge policy inspired by LogMergePolicy from Lucene.

Segment size will be determined by the number of documents.

Unless I am mistaken, this merge policy should not need to store any state.

File lifetime

Currently we do not handle removing the segments.
This is a bit tricky because the file (or rather ReadOnlySource) involved in a search should outlive the searcher.

Somehow find a way to make cascading writing to disk

Currently, each indexing thread writes to disk when its indexing buffer is full. As they start in sync, this means they tend to decide to spill their segment all at the same time.
This is suboptimal, as the disk is idle for a while, and then is asked to concurrently write a huge amount of data.

Implement deletes

Until now tantivy only handled adding document. This issue is about adding delete. Provided a unique id term exists in the schema, update can then be done on the application side by expanding it into a delete and an add.

Since adding a document will not be the only operation anymore, "docstamp" needs to be renamed "opstamp"
For each segment, deletes are represented as a tombstone .del file that stores the bitset of documents that are deleted.

For this first iteration, only terms can be deleted. When a term is added for deletion, it is
appended to the delete queue without being processed.

Committed segments

Upon commit, all previously committed segment have deletes up to date up to the last commit opstamp.
Their .del file therefore needs to be updated up to the new commit opstamp.

Uncommitted segments

These uncommitted segments come with a transient information which is the opstamp up to which deletes have been processed, which is not necessarily the last commit opstamp.

(All of this info being transient is ok as uncommitted segments are by nature transient. If the indexing process crashes and restart, indexing must resume from the last commit opstamp, and there is no way to use the uncommitted segments.)

When a segment being written gets closed, the deletes up to the last opstamp are read to build a bitset of the deleted documents, create a mapping of the compacted doc ids, and the finalized and flushed segment already takes in account the delete.

Because deletes should only apply to documents that where added before the delete, a transient mapping document -> docstamp. The function is monotonic so identifying the document that should be affected by a delete is a matter of binary searching the delete opstamp in this mapping, identifying the smallest doc id with a greater docstamp, reading the inverted list from the segment writer and marking as deleted all the doc ids greater or equal to this limit docstamp.

The delete opstamp at this point is the last opstamp of the segment.

Merging segments

When merging uncommitted segments, all of the segments tombstone bitset are updated to a common opstamp. (When merging committed segments, they are already at the same docstamp so this is not required).
Similarly to what is done with writing segments, we create a map of old docid -> new docid and phyiscally remove deleted docids.
After the merged segment is flushed, we check if its delete opstamp is up to date with the committed delete opstamp. If not, we compute and create the tombstone file accordingly.

Dumping delete terms from Memory

The delete queue is a linear queue in which we can easily iterate over sections defined using timestamps.
The timing at which the queue can be purge is non trivial as the last consumer of it can be the commit or a currently running merge.

Client of the delete queue can suscribe to the queue and express that they will need to consume
it from a given starting point in the future.

For the first version, there will be no mechanism to purge the delete queue before commit, when the queue is getting too big.

Make sure IndexWriter "locks" the directory

There should be only one IndexWriter working at a time (within our process, or within another process). We should probably add a lockfile system to prevent people from making a mistake.

Implement PhraseQuery

The current positions encoding is very slow, but we should already be able to implement PhraseQuery on top of it.

Reproducable benchmark against Lucene

It would be great to see track periodically our performance and ideally compare it to Lucene.

Maybe using the same dataset as
http://home.apache.org/~mikemccand/lucenebench/

Add merge policy

Currently users are expected to manually merge segments. It would be nice to have some kind of merge strategy working out of the box

Add a lenient mode to QueryParser

The role of the QueryParser is to give an off-the-shelf solution to people wanting to deliver a search box to end user.
Currently QueryParser::parse can return an error if the query does not match our query grammar.

e.g.: +(happy : the parenthesis is not closed.

This is great, but how should somebody writing a site handle this error?
Maybe for a log analysis search engine, displaying an explicit syntax error is a good idea, but for most use case, (ecommerce, document search, etc.) we want to never return an error and just do our best to handle the user query.

For this reason, we want to add a new parse_lenient that would never fail and always return a result.

pub fn parse_query_lenient(&self, query: &str) -> Box<Query> {
    // ...
}

The behavior of this lenient mode does NOT have to be very smart...
For instance, it could be a second very naive parser taking over when the initial parser failed.

For instance, this naive parser could remove all special characters and run the initial parser.

Smarter attempts to interpret the faulty user query are also welcome, as long as we return Box<Query> and we are confident the code will never panic.

Index frequent bigrams

High frequency words like "the" are hurting the query performance.

They are most of the time not queried alone.
Build the analyzer that generate the right tokens for index, and query.

In query

 "the lord of the ring" -> "the lord", "of", "the", "ring"

In index

 "the lord of the ring" -> [ ["the", "lord"], "the lord"], "of", "the", "ring"

Rethink searchers

They are here to make the segment lifetime last the running queries.
The API should allow the index to advertise for new segments rapidly.

Compilation error: missing simdcomp

running: "c++" "-O0" "-ffunction-sections" "-fdata-sections" "-g" "-m64" "-fPIC" "-I" "./cpp/simdcomp/include" "-std=c++11" "-O3" "-mssse3" "-o" "/home/maciej/git/tantivy/target/debug/build/tantivy-c843e95d6308187a/out/cpp/simdcomp_wrapper.o" "-c" "cpp/simdcomp_wrapper.cpp"
cargo:warning=cpp/simdcomp_wrapper.cpp:4:22: fatal error: simdcomp.h: No such file or directory
cargo:warning=compilation terminated.
ExitStatus(ExitStatus(256))

How should I install it? Documentation doesn't mention it.

Replace the `Vec`s in SegmentWriter

Vec is probably not the best datastructure to store the postings for the segment writer.
What we probably want is a linked list of blocks.

Extra points if the blocks are compressed themselves every time they are "closed".

Add a macro to define document simply.

This would be handy for unit tests for instance.

Maybe something like :

doc!( field1=>"toto", field2=>5, )

Value implements From<String> and From<u32> so that should be feasible.

Reintroduce query explain

The current query explain uses term1, term2, etc. instead of the actual terms.

Using terms would be nicer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.