quickwit-oss / tantivy Goto Github PK

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

License: MIT License

Rust 99.97% Makefile 0.01% Python 0.03%

tantivy's Introduction

Fast full-text search engine library written in Rust

If you are looking for an alternative to Elasticsearch or Apache Solr, check out Quickwit, our distributed search engine built on top of Tantivy.

Tantivy is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is not an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.

Tantivy is, in fact, strongly inspired by Lucene's design.

Benchmark

The following benchmark breakdowns performance for different types of queries/collections.

Your mileage WILL vary depending on the nature of queries and their load.

Details about the benchmark can be found at this repository.

Features

Full-text search
Configurable tokenizer (stemming available for 17 Latin languages) with third party support for Chinese (tantivy-jieba and cang-jie), Japanese (lindera, Vaporetto, and tantivy-tokenizer-tiny-segmenter) and Korean (lindera + lindera-ko-dic-builder)
Fast (check out the 🐎 ✨ benchmark ✨ 🐎)
Tiny startup time (<10ms), perfect for command-line tools
BM25 scoring (the same as Lucene)
Natural query language (e.g. (michael AND jackson) OR "king of pop")
Phrase queries search (e.g. "michael jackson")
Incremental indexing
Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop)
Mmap directory
SIMD integer compression when the platform/CPU includes the SSE2 instruction set
Single valued and multivalued u64, i64, and f64 fast fields (equivalent of doc values in Lucene)
&[u8] fast fields
Text, i64, u64, f64, dates, ip, bool, and hierarchical facet fields
Compressed document store (LZ4, Zstd, None)
Range queries
Faceted search
Configurable indexing (optional term frequency and position indexing)
JSON Field
Aggregation Collector: histogram, range buckets, average, and stats metrics
LogMergePolicy with deletes
Searcher Warmer API
Cheesy logo with a horse

Non-features

Distributed search is out of the scope of Tantivy, but if you are looking for this feature, check out Quickwit.

Getting started

Tantivy works on stable Rust and supports Linux, macOS, and Windows.

Tantivy's simple search example
tantivy-cli and its tutorial - tantivy-cli is an actual command-line interface that makes it easy for you to create a search engine, index documents, and search via the CLI or a small server with a REST API. It walks you through getting a Wikipedia search engine up and running in a few minutes.
Reference doc for the last released version

How can I support this project?

There are many ways to support this project.

Use Tantivy and tell us about your experience on Discord or by email ([email protected])
Report bugs
Write a blog post
Help with documentation by asking questions or submitting PRs
Contribute code (you can join our Discord server)
Talk about Tantivy around you

Contributing code

We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR. Feel free to update CHANGELOG.md with your contribution.

Tokenizer

When implementing a tokenizer for tantivy depend on the tantivy-tokenizer-api crate.

Clone and build locally

Tantivy compiles on stable Rust. To check out and run tests, you can simply run:

git clone https://github.com/quickwit-oss/tantivy.git
cd tantivy
cargo test

Companies Using Tantivy

FAQ

Can I use Tantivy in other languages?

Python → tantivy-py
Ruby → tantiny

You can also find other bindings on GitHub but they may be less maintained.

What are some examples of Tantivy use?

seshat: A matrix message database/indexer
tantiny: Tiny full-text search for Ruby
lnx: adaptable, typo tolerant search engine with a REST API
and more!

On average, how much faster is Tantivy compared to Lucene?

According to our search latency benchmark, Tantivy is approximately 2x faster than Lucene.

Does tantivy support incremental indexing?

Yes.

How can I edit documents?

Data in tantivy is immutable. To edit a document, the document needs to be deleted and reindexed.

When will my documents be searchable during indexing?

Documents will be searchable after a commit is called on an IndexWriter. Existing IndexReaders will also need to be reloaded in order to reflect the changes. Finally, changes are only visible to newly acquired Searcher.

tantivy's People

Contributors

Stargazers

Watchers

Forkers

wuranbo fiedzia currymj white-oak mre royshan benjamesbabala gitter-badger vandenoever manuel-woelker kodraus rlugojr celaus antonha loudbirds anti-social rasouli drusellers king6cong rclaude peilin-yang j4in danburkert jhgx pombredanne storypku sourcepirate mindis mypmc dylan-dpc-zz ehiggs yonran igxactly jason-wolfe albert19882016 dshipilov kphelps omega-tree freshy969 w32zhong n1o hoangpq graydon querystyle dten kardeiz petr-tik pwoolcoc atul9 rowhit lxq2537664558 pentlander hunglethanh9 tailhook nanne007 george3d6 gregbowyer anderender grossws memoryruins base58ed bt juchiast colinclark trinity-1686a hiroshiyui aleph-z k0pernicus gopherj sadiqmmm hhy5277 suhuaguo rahulsoibam aswasi fengweijp barrotsteindev rust-stuff osyoyu takistakis li-bo weiboyiyou elbow-jason tinycedar rleungx diegopacheco mauri870 jannickj kornelski torkleyy jonfk emuhedo jobs-git jinycoo isgasho ilover311 marks-yag champs-libres sravan-s bugsbunny1101 wthdt

tantivy's Issues

Geo-search

This would require a longlat field type... (and coords if we want simple 2D as well

Then the user would want to perform queries such as

return document at a distance of less than X km
return document within this polygon

Allocate docIds in a way that is monotonic with some fast field

Sorting documents by a given field can open the door to various optimization

sort by score related metric
group by/collapse by a sorted field
locality
range queries

Add auto-publish

Publishing and committing don't have to be the same.

In many usage, users want document to be searchable as soon as possible.
Let's call making a segment ready for search "publishing.

Committing on the other hand is about order consistency, (after commit, a document is
searchable iff it was added before the commit) and more importantly persistence.

Parse multivalues

The current json parser does not allow for multivalues.

{
 "author": ["Les Paul", "Mary Ford"]
}

should be parsed correctly

Implement date field type.

Rewrite QueryParser with an intermediary AST

Stop copying Schema around

Schema is copied quite often.

Maybe add an property to Index, and wrap it in an Arc could be a solution

LogMergePolicy

Implement a merge policy inspired by LogMergePolicy from Lucene.

Segment size will be determined by the number of documents.

Unless I am mistaken, this merge policy should not need to store any state.

File lifetime

Currently we do not handle removing the segments.
This is a bit tricky because the file (or rather ReadOnlySource) involved in a search should outlive the searcher.

optimize stacking store doc

Currently we decompress / recompress docs when we merge segments.

MUST term in the query parser does not play well with multifield.

+barack obama should translate in (title:barack OR body:barack) AND (title:obama OR body:obama)

Somehow find a way to make cascading writing to disk

Currently, each indexing thread writes to disk when its indexing buffer is full. As they start in sync, this means they tend to decide to spill their segment all at the same time.
This is suboptimal, as the disk is idle for a while, and then is asked to concurrently write a huge amount of data.

Watch changes to the meta.json file

IndexReader from different processes should see the changes to the meta.json file.

Implement deletes

Until now tantivy only handled adding document. This issue is about adding delete. Provided a unique id term exists in the schema, update can then be done on the application side by expanding it into a delete and an add.

Since adding a document will not be the only operation anymore, "docstamp" needs to be renamed "opstamp"
For each segment, deletes are represented as a tombstone .del file that stores the bitset of documents that are deleted.

For this first iteration, only terms can be deleted. When a term is added for deletion, it is
appended to the delete queue without being processed.

Committed segments

Upon commit, all previously committed segment have deletes up to date up to the last commit opstamp.
Their .del file therefore needs to be updated up to the new commit opstamp.

Uncommitted segments

These uncommitted segments come with a transient information which is the opstamp up to which deletes have been processed, which is not necessarily the last commit opstamp.

(All of this info being transient is ok as uncommitted segments are by nature transient. If the indexing process crashes and restart, indexing must resume from the last commit opstamp, and there is no way to use the uncommitted segments.)

When a segment being written gets closed, the deletes up to the last opstamp are read to build a bitset of the deleted documents, create a mapping of the compacted doc ids, and the finalized and flushed segment already takes in account the delete.

Because deletes should only apply to documents that where added before the delete, a transient mapping document -> docstamp. The function is monotonic so identifying the document that should be affected by a delete is a matter of binary searching the delete opstamp in this mapping, identifying the smallest doc id with a greater docstamp, reading the inverted list from the segment writer and marking as deleted all the doc ids greater or equal to this limit docstamp.

The delete opstamp at this point is the last opstamp of the segment.

Merging segments

When merging uncommitted segments, all of the segments tombstone bitset are updated to a common opstamp. (When merging committed segments, they are already at the same docstamp so this is not required).
Similarly to what is done with writing segments, we create a map of old docid -> new docid and phyiscally remove deleted docids.
After the merged segment is flushed, we check if its delete opstamp is up to date with the committed delete opstamp. If not, we compute and create the tombstone file accordingly.

Dumping delete terms from Memory

The delete queue is a linear queue in which we can easily iterate over sections defined using timestamps.
The timing at which the queue can be purge is non trivial as the last consumer of it can be the commit or a currently running merge.

Client of the delete queue can suscribe to the queue and express that they will need to consume
it from a given starting point in the future.

For the first version, there will be no mechanism to purge the delete queue before commit, when the queue is getting too big.

Replace the DocStore's anonymous memory index

... by a mmapped skip list.

Refactor `Query` to introduce `Weight` or `Scorer`

Refactor Query to introduce Weight or Scorer like Lucene.

Create an example directory

... and move the gigantic example in rustdoc to this instead.

DocValue for other types

This will most likely require a variable size DocValue.

Ignore phrase query when the field does not have positions.

Make sure IndexWriter "locks" the directory

There should be only one IndexWriter working at a time (within our process, or within another process). We should probably add a lockfile system to prevent people from making a mistake.

Implement PhraseQuery

The current positions encoding is very slow, but we should already be able to implement PhraseQuery on top of it.

Optimize the query in the presence of MUST terms

Currently we don't do anything smart when we don't take advantage of skipping when we have a must terms.

Reproducable benchmark against Lucene

It would be great to see track periodically our performance and ideally compare it to Lucene.

Maybe using the same dataset as
http://home.apache.org/~mikemccand/lucenebench/

Add merge policy

Currently users are expected to manually merge segments. It would be nice to have some kind of merge strategy working out of the box

Rewrite the stacker without unsafe

It might be possible if we pass the heap around at each call.

Use skip list in docstore

Plug the query parser to the PhraseQuery

Remove the 64 bits padding in fastfields

Rather than creating N searchers

Error introduced in merge

Reason unknown ... probably the skip list merge.

Happens on the index wikipedia example

asciicinema terminal screencast

Add a asciicinema terminal screencast to the tutorial

Make MUST a possible default in the query parser

Remove the autocommit behavior

Tantivy automatically commits new segment. This prevents anything serious from being build on top of tantivy.

Use `flame` instead of `TimerTree`

flame keeps thread local "tracks" and has a method to clear context, so there is no need to pass anything as an argument.

Introduce an `IndexReader`

Externalize the searchers to a IndexReader object, and watch for changes in meta.json

Add a lenient mode to QueryParser

The role of the QueryParser is to give an off-the-shelf solution to people wanting to deliver a search box to end user.
Currently QueryParser::parse can return an error if the query does not match our query grammar.

e.g.: +(happy : the parenthesis is not closed.

This is great, but how should somebody writing a site handle this error?
Maybe for a log analysis search engine, displaying an explicit syntax error is a good idea, but for most use case, (ecommerce, document search, etc.) we want to never return an error and just do our best to handle the user query.

For this reason, we want to add a new parse_lenient that would never fail and always return a result.

pub fn parse_query_lenient(&self, query: &str) -> Box<Query> {
    // ...
}

The behavior of this lenient mode does NOT have to be very smart...
For instance, it could be a second very naive parser taking over when the initial parser failed.

For instance, this naive parser could remove all special characters and run the initial parser.

Smarter attempts to interpret the faulty user query are also welcome, as long as we return Box<Query> and we are confident the code will never panic.

Implement WAND algorithm

Create mechanism to define collectors as aggregate function

If someone can define collectors as:

docset -> A
[A] -> B or ( A,A -> B and B::zero)

It would make distribution or concurrence trivial.

Alternative bit packer not using SIMD and not using C

SIMD is not portable as is on ARM. Also this would open the door to compile tantivy to web assembly.

Use it to measure similarity between texts?

Do you think it is possible to use tantivy to measure the similarity between texts?

Reconsider the way to encode the field in the term

First byte as term can be very limitting. It will prevent building dynamic schemas (we might want to define a schema with wildcard fields : text_*: {indexing_options: ...}

Stop copying Vec<u8> around in the RAMDirectory

fst now allows to work withArc<Vec<u8>>

BurntSushi/fst#11

Removes the crazy "let's copy everything all the time" in the ram directory

Index frequent bigrams

High frequency words like "the" are hurting the query performance.

They are most of the time not queried alone.
Build the analyzer that generate the right tokens for index, and query.

In query

 "the lord of the ring" -> "the lord", "of", "the", "ring"

In index

 "the lord of the ring" -> [ ["the", "lord"], "the lord"], "of", "the", "ring"

Rethink searchers

They are here to make the segment lifetime last the running queries.
The API should allow the index to advertise for new segments rapidly.

Clean up and document the Directory interface.

Compilation error: missing simdcomp

running: "c++" "-O0" "-ffunction-sections" "-fdata-sections" "-g" "-m64" "-fPIC" "-I" "./cpp/simdcomp/include" "-std=c++11" "-O3" "-mssse3" "-o" "/home/maciej/git/tantivy/target/debug/build/tantivy-c843e95d6308187a/out/cpp/simdcomp_wrapper.o" "-c" "cpp/simdcomp_wrapper.cpp"
cargo:warning=cpp/simdcomp_wrapper.cpp:4:22: fatal error: simdcomp.h: No such file or directory
cargo:warning=compilation terminated.
ExitStatus(ExitStatus(256))

How should I install it? Documentation doesn't mention it.

doc!( field1=>"toto", field2=>5, )

Value implements From<String> and From<u32> so that should be feasible.

Reintroduce query explain

The current query explain uses term1, term2, etc. instead of the actual terms.

Using terms would be nicer.