Giter VIP home page Giter VIP logo

tantivy-tokenizer's Introduction

This is a fork of the official Tantivy package which only keeps the tokenizer, to be used in other projects.

Docs Build Status codecov Join the chat at https://discord.gg/MT27AG5EVE License: MIT Crates.io

Tantivy

Tantivy is a full-text search engine library written in Rust.

It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is not an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.

Tantivy is, in fact, strongly inspired by Lucene's design.

If you are looking for an alternative to Elasticsearch or Apache Solr, check out Quickwit, our search engine built on top of Tantivy.

Benchmark

The following benchmark breakdowns performance for different types of queries/collections.

Your mileage WILL vary depending on the nature of queries and their load.

Features

  • Full-text search
  • Configurable tokenizer (stemming available for 17 Latin languages with third party support for Chinese (tantivy-jieba and cang-jie), Japanese (lindera, Vaporetto, and tantivy-tokenizer-tiny-segmenter) and Korean (lindera + lindera-ko-dic-builder)
  • Fast (check out the ๐ŸŽ โœจ benchmark โœจ ๐ŸŽ)
  • Tiny startup time (<10ms), perfect for command-line tools
  • BM25 scoring (the same as Lucene)
  • Natural query language (e.g. (michael AND jackson) OR "king of pop")
  • Phrase queries search (e.g. "michael jackson")
  • Incremental indexing
  • Multithreaded indexing (indexing English Wikipedia takes < 3 minutes on my desktop)
  • Mmap directory
  • SIMD integer compression when the platform/CPU includes the SSE2 instruction set
  • Single valued and multivalued u64, i64, and f64 fast fields (equivalent of doc values in Lucene)
  • &[u8] fast fields
  • Text, i64, u64, f64, dates, and hierarchical facet fields
  • LZ4 compressed document store
  • Range queries
  • Faceted search
  • Configurable indexing (optional term frequency and position indexing)
  • JSON Field
  • Aggregation Collector: range buckets, average, and stats metrics
  • LogMergePolicy with deletes
  • Searcher Warmer API
  • Cheesy logo with a horse

Non-features

Distributed search is out of the scope of Tantivy, but if you are looking for this feature, check out Quickwit.

Getting started

Tantivy works on stable Rust and supports Linux, macOS, and Windows.

How can I support this project?

There are many ways to support this project.

  • Use Tantivy and tell us about your experience on Discord or by email ([email protected])
  • Report bugs
  • Write a blog post
  • Help with documentation by asking questions or submitting PRs
  • Contribute code (you can join our Discord server)
  • Talk about Tantivy around you

Contributing code

We use the GitHub Pull Request workflow: reference a GitHub ticket and/or include a comprehensive commit message when opening a PR.

Minimum supported Rust version

Tantivy currently requires at least Rust 1.62 or later to compile.

Clone and build locally

Tantivy compiles on stable Rust. To check out and run tests, you can simply run:

    git clone https://github.com/quickwit-oss/tantivy.git
    cd tantivy
    cargo build

Run tests

Some tests will not run with just cargo test because of fail-rs. To run the tests exhaustively, run ./run-tests.sh.

Debug

You might find it useful to step through the programme with a debugger.

A failing test

Make sure you haven't run cargo clean after the most recent cargo test or cargo build to guarantee that the target/ directory exists. Use this bash script to find the name of the most recent debug build of Tantivy and run it under rust-gdb:

find target/debug/ -maxdepth 1 -executable -type f -name "tantivy*" -printf '%TY-%Tm-%Td %TT %p\n' | sort -r | cut -d " " -f 3 | xargs -I RECENT_DBG_TANTIVY rust-gdb RECENT_DBG_TANTIVY

Now that you are in rust-gdb, you can set breakpoints on lines and methods that match your source code and run the debug executable with flags that you normally pass to cargo test like this:

$gdb run --test-threads 1 --test $NAME_OF_TEST

An example

By default, rustc compiles everything in the examples/ directory in debug mode. This makes it easy for you to make examples to reproduce bugs:

rust-gdb target/debug/examples/$EXAMPLE_NAME
$ gdb run

Companies Using Tantivy

Etsyย  Nuclia ย  Humanfirst.ai Element.io Nuclia ย  Humanfirst.aiย  ย  Element.io

FAQ

Can I use Tantivy in other languages?

You can also find other bindings on GitHub but they may be less maintained.

What are some examples of Tantivy use?

  • seshat: A matrix message database/indexer
  • tantiny: Tiny full-text search for Ruby
  • lnx: adaptable, typo tolerant search engine with a REST API
  • and more!

On average, how much faster is Tantivy compared to Lucene?

Does tantivy support incremental indexing?

  • Yes.

How can I edit documents?

  • Data in tantivy is immutable. To edit a document, the document needs to be deleted and reindexed.

When will my documents be searchable during indexing?

  • Documents will be searchable after a commit is called on an IndexWriter. Existing IndexReaders will also need to be reloaded in order to reflect the changes. Finally, changes are only visible to newly acquired Searcher.

tantivy-tokenizer's People

Contributors

fulmicoton avatar pseitz avatar lnicola avatar dependabot[bot] avatar evanxg852000 avatar currymj avatar saroh avatar waywardmonkeys avatar boraarslan avatar petr-tik avatar k-yomo avatar kodraus avatar drusellers avatar barrotsteindev avatar vigneshsarma avatar adamreichold avatar trinity-1686a avatar dependabot-preview[bot] avatar shikhar avatar guilload avatar rihardsk avatar jason-wolfe avatar kryesh avatar fmassot avatar lengyijun avatar appaquet avatar ppodolsky avatar robyoung avatar halvorboe avatar vishalsodani avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.