Giter VIP home page Giter VIP logo

anserini's Introduction

Anserini

build codecov Generic badge Maven Central LICENSE doi

Anserini is a toolkit for reproducible information retrieval research. By building on Lucene, we aim to bridge the gap between academic information retrieval research and the practice of building real-world search applications. Among other goals, our effort aims to be the opposite of this.* Anserini grew out of a reproducibility study of various open-source retrieval engines in 2016 (Lin et al., ECIR 2016). See Yang et al. (SIGIR 2017) and Yang et al. (JDIQ 2018) for overviews.

❗ Anserini was upgraded from JDK 11 to JDK 21 at commit 272565 (2024/04/03), which corresponds to the release of v0.35.0.

πŸ’₯ Try It!

Anserini is packaged in a self-contained fatjar, which also provides the simplest way to get started. Assuming you've already got Java installed, fetch the fatjar:

wget https://repo1.maven.org/maven2/io/anserini/anserini/0.36.1/anserini-0.36.1-fatjar.jar

The follow commands will generate a SPLADE++ ED run with the dev queries (encoded using ONNX) on the MS MARCO passage corpus:

java -cp anserini-0.36.1-fatjar.jar io.anserini.search.SearchCollection \
  -index msmarco-v1-passage.splade-pp-ed \
  -topics msmarco-v1-passage.dev \
  -encoder SpladePlusPlusEnsembleDistil \
  -output run.msmarco-v1-passage-dev.splade-pp-ed-onnx.txt \
  -impact -pretokenized

To evaluate:

java -cp anserini-0.36.1-fatjar.jar trec_eval -c -M 10 -m recip_rank msmarco-passage.dev-subset run.msmarco-v1-passage-dev.splade-pp-ed-onnx.txt

See detailed instructions for the current fatjar release of Anserini (v0.36.1) to reproduce regression experiments on the MS MARCO V2.1 corpora for TREC 2024 RAG, on MS MARCO V1 Passage, and on BEIR, all directly from the fatjar!

Older instructions

🎬 Installation

Most Anserini features are exposed in the Pyserini Python interface. If you're more comfortable with Python, start there, although Anserini forms an important building block of Pyserini, so it remains worthwhile to learn about Anserini.

You'll need Java 21 and Maven 3.9+ to build Anserini. Clone our repo with the --recurse-submodules option to make sure the eval/ submodule also gets cloned (alternatively, use git submodule update --init). Then, build using Maven:

mvn clean package

The tools/ directory, which contains evaluation tools and other scripts, is actually this repo, integrated as a Git submodule (so that it can be shared across related projects). Build as follows (you might get warnings, but okay to ignore):

cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..

With that, you should be ready to go. The onboarding path for Anserini starts here!

Windows tips

If you are using Windows, please use WSL2 to build Anserini. Please refer to the WSL2 Installation document to install WSL2 if you haven't already.

Note that on Windows without WSL2, tests may fail due to encoding issues, see #1466. A simple workaround is to skip tests by adding -Dmaven.test.skip=true to the above mvn command. See #1121 for additional discussions on debugging Windows build errors.

βš—οΈ End-to-End Regression Experiments

Anserini is designed to support end-to-end experiments on various standard IR test collections out of the box. Each of these end-to-end regressions starts from the raw corpus, builds the necessary index, performs retrieval runs, and generates evaluation results. See individual pages for details.

MS MARCO V1 Passage Regressions

MS MARCO V1 Passage Regressions

dev DL19 DL20
Unsupervised Sparse
Lucene BoW baselines πŸ”‘ πŸ”‘ πŸ”‘
Quantized BM25 πŸ”‘ πŸ”‘ πŸ”‘
WordPiece baselines (pre-tokenized) πŸ”‘ πŸ”‘ πŸ”‘
WordPiece baselines (Huggingface) πŸ”‘ πŸ”‘ πŸ”‘
WordPiece + Lucene BoW baselines πŸ”‘ πŸ”‘ πŸ”‘
doc2query πŸ”‘
doc2query-T5 πŸ”‘ πŸ”‘ πŸ”‘
Learned Sparse (uniCOIL family)
uniCOIL noexp πŸ«™ πŸ«™ πŸ«™
uniCOIL with doc2query-T5 πŸ«™ πŸ«™ πŸ«™
uniCOIL with TILDE πŸ«™
Learned Sparse (other)
DeepImpact πŸ«™
SPLADEv2 πŸ«™
SPLADE++ CoCondenser-EnsembleDistil πŸ«™πŸ…ΎοΈ πŸ«™πŸ…ΎοΈ πŸ«™πŸ…ΎοΈ
SPLADE++ CoCondenser-SelfDistil πŸ«™πŸ…ΎοΈ πŸ«™πŸ…ΎοΈ πŸ«™πŸ…ΎοΈ
Learned Dense (HNSW indexes)
cosDPR-distil full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
BGE-base-en-v1.5 full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
OpenAI Ada2 full:πŸ«™ int8:πŸ«™ full:πŸ«™ int8:πŸ«™ full:πŸ«™ int8:πŸ«™
Cohere English v3.0 full:πŸ«™ int8:πŸ«™ full:πŸ«™ int8:πŸ«™ full:πŸ«™ int8:πŸ«™
Learned Dense (flat indexes)
cosDPR-distil full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
BGE-base-en-v1.5 full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
OpenAI Ada2 full:πŸ«™ int8:πŸ«™οΈ full:πŸ«™ int8:πŸ«™ full:πŸ«™ int8:πŸ«™
Cohere English v3.0 full:πŸ«™ int8:πŸ«™ full:πŸ«™ int8:πŸ«™ full:πŸ«™ int8:πŸ«™
Learned Dense (Inverted; experimental)
cosDPR-distil w/ "fake words" πŸ«™ πŸ«™ πŸ«™
cosDPR-distil w/ "LexLSH" πŸ«™ πŸ«™ πŸ«™

Key:

  • πŸ”‘ = keyword queries
  • "full" = full 32-bit floating precision
  • "int8" = quantized 8-bit precision
  • πŸ«™ = cached queries, πŸ…ΎοΈ = query encoding with ONNX

Available Corpora for Download

Corpora Size Checksum
Quantized BM25 1.2 GB 0a623e2c97ac6b7e814bf1323a97b435
uniCOIL (noexp) 2.7 GB f17ddd8c7c00ff121c3c3b147d2e17d8
uniCOIL (d2q-T5) 3.4 GB 78eef752c78c8691f7d61600ceed306f
uniCOIL (TILDE) 3.9 GB 12a9c289d94e32fd63a7d39c9677d75c
DeepImpact 3.6 GB 73843885b503af3c8b3ee62e5f5a9900
SPLADEv2 9.9 GB b5d126f5d9a8e1b3ef3f5cb0ba651725
SPLADE++ CoCondenser-EnsembleDistil 4.2 GB e489133bdc54ee1e7c62a32aa582bc77
SPLADE++ CoCondenser-SelfDistil 4.8 GB cb7e264222f2bf2221dd2c9d28190be1
cosDPR-distil 57 GB e20ffbc8b5e7f760af31298aefeaebbd
BGE-base-en-v1.5 59 GB 353d2c9e72e858897ad479cca4ea0db1
OpenAI-ada2 109 GB a4d843d522ff3a3af7edbee789a63402
Cohere embed-english-v3.0 38 GB 06a6e38a0522850c6aa504db7b2617f5
MS MARCO V1 Document Regressions

MS MARCO V1 Document Regressions

dev DL19 DL20
Unsupervised Lexical, Complete Doc*
Lucene BoW baselines + + +
WordPiece baselines (pre-tokenized) + + +
WordPiece baselines (Huggingface tokenizer) + + +
WordPiece + Lucene BoW baselines + + +
doc2query-T5 + + +
Unsupervised Lexical, Segmented Doc*
Lucene BoW baselines + + +
WordPiece baselines (pre-tokenized) + + +
WordPiece + Lucene BoW baselines + + +
doc2query-T5 + + +
Learned Sparse Lexical
uniCOIL noexp βœ“ βœ“ βœ“
uniCOIL with doc2query-T5 βœ“ βœ“ βœ“

Available Corpora for Download

Corpora Size Checksum
MS MARCO V1 doc: uniCOIL (noexp) 11 GB 11b226e1cacd9c8ae0a660fd14cdd710
MS MARCO V1 doc: uniCOIL (d2q-T5) 19 GB 6a00e2c0c375cb1e52c83ae5ac377ebb
MS MARCO V2 Passage Regressions

MS MARCO V2 Passage Regressions

dev DL21 DL22 DL23
Unsupervised Lexical, Original Corpus
baselines + + + +
doc2query-T5 + + + +
Unsupervised Lexical, Augmented Corpus
baselines + + + +
doc2query-T5 + + + +
Learned Sparse Lexical
uniCOIL noexp zero-shot βœ“ βœ“ βœ“ βœ“
uniCOIL with doc2query-T5 zero-shot βœ“ βœ“ βœ“ βœ“
SPLADE++ CoCondenser-EnsembleDistil (cached queries) βœ“ βœ“ βœ“ βœ“
SPLADE++ CoCondenser-EnsembleDistil (ONNX) βœ“ βœ“ βœ“ βœ“
SPLADE++ CoCondenser-SelfDistil (cached queries) βœ“ βœ“ βœ“ βœ“
SPLADE++ CoCondenser-SelfDistil (ONNX) βœ“ βœ“ βœ“ βœ“

Available Corpora for Download

Corpora Size Checksum
uniCOIL (noexp) 24 GB d9cc1ed3049746e68a2c91bf90e5212d
uniCOIL (d2q-T5) 41 GB 1949a00bfd5e1f1a230a04bbc1f01539
SPLADE++ CoCondenser-EnsembleDistil 66 GB 2cdb2adc259b8fa6caf666b20ebdc0e8
SPLADE++ CoCondenser-SelfDistil 76 GB 061930dd615c7c807323ea7fc7957877
MS MARCO V2 Document Regressions

MS MARCO V2 Document Regressions

dev DL21 DL22 DL23
Unsupervised Lexical, Complete Doc
baselines + + + +
doc2query-T5 + + + +
Unsupervised Lexical, Segmented Doc
baselines + + + +
doc2query-T5 + + + +
Learned Sparse Lexical
uniCOIL noexp zero-shot βœ“ βœ“ βœ“ βœ“
uniCOIL with doc2query-T5 zero-shot βœ“ βœ“ βœ“ βœ“

Available Corpora for Download

Corpora Size Checksum
MS MARCO V2 doc: uniCOIL (noexp) 55 GB 97ba262c497164de1054f357caea0c63
MS MARCO V2 doc: uniCOIL (d2q-T5) 72 GB c5639748c2cbad0152e10b0ebde3b804
MS MARCO V2.1 Document Regressions

MS MARCO V2.1 Document Regressions

The MS MARCO V2.1 corpora were derived from the V2 corpora for the TREC 2024 RAG Track. The experiments below capture topics and qrels originally targeted at the V2 corpora, but have been "projected" over to the V2.1 corpora.

dev DL21 DL22 DL23 RAGgy dev
Unsupervised Lexical, Complete Doc
baselines + + + + +
Unsupervised Lexical, Segmented Doc
baselines + + + + +
BEIR (v1.0.0) Regressions

BEIR (v1.0.0) Regressions

Key:

  • F1 = "flat" baseline (Lucene analyzer), keyword queries (πŸ”‘)
  • F2 = "flat" baseline (pre-tokenized with bert-base-uncased tokenizer), keyword queries (πŸ”‘)
  • MF = "multifield" baseline (Lucene analyzer), keyword queries (πŸ”‘)
  • U1 = uniCOIL (noexp), cached queries (πŸ«™)
  • S1 = SPLADE++ CoCondenser-EnsembleDistil: cached queries (πŸ«™), ONNX (πŸ…ΎοΈ)
  • BGE (flat) = BGE-base-en-v1.5 (flat indexes)
    • original (float32) indexes: cached queries (πŸ«™), ONNX (πŸ…ΎοΈ)
    • quantized (int8) indexes: cached queries (πŸ«™), ONNX (πŸ…ΎοΈ)
  • BGE (HNSW) = BGE-base-en-v1.5 (HNSW indexes)
    • original (float32) indexes: cached queries (πŸ«™), ONNX (πŸ…ΎοΈ)
    • quantized (int8) indexes: cached queries (πŸ«™), ONNX (πŸ…ΎοΈ)

See instructions below the table for how to reproduce results for a model on all BEIR corpora "in one go".

Corpus F1 F2 MF U1 S1 BGE (flat) BGE (HNSW)
TREC-COVID πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
BioASQ πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
NFCorpus πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
NQ πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
HotpotQA πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
FiQA-2018 πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
Signal-1M(RT) πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
TREC-NEWS πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
Robust04 πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
ArguAna πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
Touche2020 πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
CQADupStack-Android πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
CQADupStack-English πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
CQADupStack-Gaming πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
CQADupStack-Gis πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
CQADupStack-Mathematica πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
CQADupStack-Physics πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
CQADupStack-Programmers πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
CQADupStack-Stats πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
CQADupStack-Tex πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
CQADupStack-Unix πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
CQADupStack-Webmasters πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
CQADupStack-Wordpress πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
Quora πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
DBPedia πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
SCIDOCS πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
FEVER πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
Climate-FEVER πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ
SciFact πŸ”‘ πŸ”‘ πŸ”‘ πŸ«™ πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ full:πŸ«™πŸ…ΎοΈ int8:πŸ«™πŸ…ΎοΈ

To reproduce the SPLADE++ CoCondenser-EnsembleDistil results, start by downloading the collection:

wget https://rgw.cs.uwaterloo.ca/pyserini/data/beir-v1.0.0-splade-pp-ed.tar -P collections/
tar xvf collections/beir-v1.0.0-splade-pp-ed.tar -C collections/

The tarball is 42 GB and has MD5 checksum 9c7de5b444a788c9e74c340bf833173b. Once you've unpacked the data, the following commands will loop over all BEIR corpora and run the regressions:

MODEL="splade-pp-ed"; CORPORA=(trec-covid bioasq nfcorpus nq hotpotqa fiqa signal1m trec-news robust04 arguana webis-touche2020 cqadupstack-android cqadupstack-english cqadupstack-gaming cqadupstack-gis cqadupstack-mathematica cqadupstack-physics cqadupstack-programmers cqadupstack-stats cqadupstack-tex cqadupstack-unix cqadupstack-webmasters cqadupstack-wordpress quora dbpedia-entity scidocs fever climate-fever scifact); for c in "${CORPORA[@]}"
do
    echo "Running $c..."
    python src/main/python/run_regression.py --index --verify --search --regression beir-v1.0.0-${c}-${MODEL} > logs/log.beir-v1.0.0-${c}-${MODEL} 2>&1
done

You can verify the results by examining the log files in logs/.

For the other models, modify the above commands as follows:

Key Corpus Checksum MODEL
F1 corpus faefd5281b662c72ce03d22021e4ff6b flat
F2 corpus-wp 3cf8f3dcdcadd49362965dd4466e6ff2 flat-wp
MF corpus faefd5281b662c72ce03d22021e4ff6b multifield
U1 unicoil-noexp 4fd04d2af816a6637fc12922cccc8a83 unicoil-noexp
S1 splade-pp-ed 9c7de5b444a788c9e74c340bf833173b splade-pp-ed
BGE bge-base-en-v1.5 e4e8324ba3da3b46e715297407a24f00 bge-base-en-v1.5-hnsw

The "Corpus" above should be substituted into the full file name beir-v1.0.0-${corpus}.tar, e.g., beir-v1.0.0-bge-base-en-v1.5.tar.

Cross-lingual and Multi-lingual Regressions

Cross-lingual and Multi-lingual Regressions

Other Regressions

Other Regressions

πŸ“ƒ Additional Documentation

The experiments described below are not associated with rigorous end-to-end regression testing and thus provide a lower standard of reproducibility. For the most part, manual copying and pasting of commands into a shell is required to reproduce our results.

MS MARCO V1

MS MARCO V1

MS MARCO V2

MS MARCO V2

TREC-COVID and CORD-19

TREC-COVID and CORD-19

Other Experiments and Features

Other Experiments and Features

πŸ™‹ How Can I Contribute?

If you've found Anserini to be helpful, we have a simple request for you to contribute back. In the course of reproducing baseline results on standard test collections, please let us know if you're successful by sending us a pull request with a simple note, like what appears at the bottom of the page for Disks 4 & 5. Reproducibility is important to us, and we'd like to know about successes as well as failures. Since the regression documentation is auto-generated, pull requests should be sent against the raw templates. Then the regression documentation can be generated using the bin/build.sh script. In turn, you'll be recognized as a contributor.

Beyond that, there are always open issues we would appreciate help on!

πŸ“œοΈ Release History

older... (and historic notes)

πŸ“œοΈ Historical Notes

  • Anserini was upgraded to Lucene 9.3 at commit 272565 (8/2/2022): this upgrade created backward compatibility issues, see #1952. Anserini will automatically detect Lucene 8 indexes and disable consistent tie-breaking to avoid runtime errors. However, Lucene 9 code running on Lucene 8 indexes may give slightly different results than Lucene 8 code running on Lucene 8 indexes. Lucene 8 code will not run on Lucene 9 indexes. Pyserini has also been upgraded and similar issues apply: Lucene 9 code running on Lucene 8 indexes may give slightly different results than Lucene 8 code running on Lucene 8 indexes.
  • Anserini was upgraded to Java 11 at commit 17b702d (7/11/2019) from Java 8. Maven 3.3+ is also required.
  • Anserini was upgraded to Lucene 8.0 as of commit 75e36f9 (6/12/2019); prior to that, the toolkit uses Lucene 7.6. Based on preliminary experiments, query evaluation latency has been much improved in Lucene 8. As a result of this upgrade, results of all regressions have changed slightly. To reproducible old results from Lucene 7.6, use v0.5.1.

✨ References

πŸ™ Acknowledgments

This research is supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada. Previous support came from the U.S. National Science Foundation under IIS-1423002 and CNS-1405688. Any opinions, findings, and conclusions or recommendations expressed do not necessarily reflect the views of the sponsors.

anserini's People

Contributors

16bitnarwhal avatar adamyy avatar arthurchen189 avatar borislin avatar chriskamphuis avatar dependabot[bot] avatar edwinzhng avatar emmileaf avatar iorixxx avatar jasper-xian avatar jimmy0017 avatar jmmackenzie avatar justram avatar kytabyte avatar lintool avatar luchentan avatar lukuang avatar mofetoluwa avatar mxueguang avatar nikhilro avatar peilin-yang avatar rodrigonogueira4 avatar ronakice avatar rosequ avatar shaneding avatar stephaniewhoo avatar toluclassics avatar tteofili avatar victor0118 avatar yuki617 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anserini's Issues

Create RerankerCascade abstraction

It would make sense to create a RerankerCascade abstraction for running a whole bunch of sequence of rerankers. Something like:

RerankerCascade cascade = new RerankerCascade(context).add(foo).add(bar);
cascade.run(docs);

Issue with using QueryParser to parse TREC topics

We're currently using QueryParser to parse TREC topics, which means that symbols in the topics like parentheses and quotes get interpreted as query operators... this isn't the desired behavior.

Code Comments

Before we get too far into hacking on Anserini, we should probably decide on how we want to deal with comments.

Do we want to do Javadoc? Something else?

Implement Tweets2011/2012 baseline

Get basic indexing/retrieval working on TREC Microblog track data from 2011 to 2014. Let's start with TREC 2011 and TREC 2012 microblog data since the corpus is smaller...

Refactor ClueWeb09b to parallel structure of IndexGov2

@iorixxx Please check out my branch cw09b-refactoring

I've pulled in your edits and started putting classes in the "right" package hierarchy, following the general layout of Lucene's package hierarchy. Can you please:

  • Clean up various reference (e.g., pom.xml) to make sure everything still works?
  • Refactor out usage of Args class in IndexClueWeb09b (we should just be using commons-cli), and in general, make the logging, cmdline options, etc. consistent?

Thanks!

Experiment with different analyzers on Gov2

According to @iorixxx

EnglishAnalyzer: PorterStemmer is aggressive, and stop word removal would make certain queries (the wall, the current, the sun, to be or not to be) meaningless.
I think analysis should be minimum.

We should play with different analyzers and evaluate impact on effectiveness.

Indexing all of ClueWeb09

Quite impressively, I was able to index all of ClueWeb09 (English):

nohup sh target/appassembler/bin/IndexClueWeb09b \
  -input /scratch1/collections/ClueWeb09.English/data/ \
  -index lucene-index.cw09.cnt -threads 32 -optimize >& log.cw09.cnt.txt &

Took ~18 hours:

2015-10-16 07:51:04,775 INFO  [main] index.IndexClueWeb09b (IndexClueWeb09b.java:298) - Total 503781465 documents indexed in 18:01:04

Index size (note: no positions):

$ du -h lucene-index.cw09.cnt/
254G    lucene-index.cw09.cnt/

Indentation size

@iorixxx Do you mind if we agree on code indentation being two spaces, just to be consistent?

If so, can you please reformat your code? I'd rather you do it so better retain history for git blame. Please send pull request.

Thanks!

Refactoring of the index and document

For now everything is based on Warc formatted records.
We'd have other types of records too, e.g. Trec text or maybe other types in the future.
It is better to have a base record and everything inherits it.

Implement DocumentReranker interface

It seems what we need is a generic document reranking interface: takes a document ranking and spits another document ranking back out. This would implement a standard multi-stage retrieval pipeline: e.g., BM25 (or QL) + 1st stage reranker + 2nd stage reranker, etc.

Lucene query parser cannot parse wildcard queries

Lucene query parser gives the following error if the query has wildcard characters in it:
'*' or '?' not allowed as first character in WildcardQuery
Ex: Cannot parse 'where is the Eldorado Casino in Reno ?': '*' or '?' not allowed as first character in WildcardQuery.

Implement RM3

We probably need some relevance feedback model... RM3 is probably our best bet.

Put example command to dump LTR feaures into the README.md

specifically

sh target/appassembler/bin/DumpTweetsLtrData -index tweets2011-index/ -topics src/main/resources/topics-and-qrels/topics.microblog2011.txt -output ltr.data.txt -qrels src/main/resources/topics-and-qrels/qrels.microblog2011.txt -ql

Connect NRTS demo with RTS mobile push broker

@aroegies @xeniaqian94 Can you two coordinate on making this happen?

RTS mobile push broker: https://github.com/aroegies/trecrts-tools

  • Decide on a REST API so that the NRTS demo can call the broker
  • Note that the REST API should have the notion of a queryid, user, and token (=password)
  • Modify the NRTS demo so that you pass in a queryid and a query (e.g., "birthday") on the command line, and also an interval, e.g., 1 minute. Every minute, the NRTS demo wakes up, runs the query, and pushes results to the RTS broker

Write an indexer for flat text files

@yb1 You probably want to dump out the cleaned text in a simple text format, something like this:

URL1 document1 ....
URL2 document2 ...

And write an indexer for it. Look at IndexTweets.java and IndexWebCollection.java here:
https://github.com/lintool/Anserini/tree/master/src/main/java/io/anserini/index

The tweets indexer should be fairly easy to understand - it's single-threaded so it's slower. IndexWebCollection is multi-threaded and thus much faster.

I would start with a single-threaded implementation. Call the class IndexPlainText or something like that.

Try out different analyzers on Tweets collection

@xeniaqian94 It would be great for you to get some experience running end-to-end ad hoc experiments, which is a core activity of IR research. Let's start with something simple, like playing with different analyzers - currently, the tweet indexing uses PorterStemFilter. Try removing it and see what the effect is. So:

  • change the analyzer to remove stemming
  • rebuild index
  • run retrieval experiments - report effects on MAP, P@30 (compare with original index).

It would be nice to also know the effects of indexing only English tweets, using same procedure above.

Integrate CACM collection

The CACM collection is small enough that we can include it in the repository... so we can have indexing/retrieval experiments completely integrated in with the system.

Simple LTR implementation for Tweets

@LuchenTan @xeniaqian94 Let's start with a simple two-feature LTR implementation for Tweets:

  • Start with the current tweet search implementation, which has two rerankers, rm3 and cleanup.
  • Your implementation is going to go in a third reranker that you tack on to the end.

Let's build a LTR implementation that just has two features: RM3 score + number hash tags. Inside your new reranker, you already have the RM3 score; use getField on the document to pull out the text, and then just count the number of hashtags. Print out a line like this:

1 325263 0.432 3

Topic 1, docid 325263, RM3 score of 0.432, 3 hashtags. Dump this information for all docs.

You'll need to take this file and join it with qrels to get the relevance judgments (i.e., write a simple Python script to do it. So you'll end up with a file like:

1 325263 0.432 3 1

The final column is the relevance judgment. Now you can run learning to rank using http://sourceforge.net/p/lemur/wiki/RankLib/

IndexCounter code broken

@LuchenTan IndexCounter code doesn't compile, so master is currently broken right now.

  • Wrong package
  • Uses Args class which has been removed. See IndexGov2 for example of how to use args4j
  • Class has a weird name - can you please rename to DumpDocids or something like that?
  • Can you please change indentation to 2 spaces instead of tab. Search online for Eclipse code formatter, one of the options is indentation - change to make consistent with everyone else.

Prepare TST data for baselines by Anserini

From @aroegies

If you tell me what fields are desired from:
https://github.com/trec-kba/streamcorpus/blob/master/if/streamcorpus-v0_3_0.thrift
Open questions:

  • Do we want raw HTML, cleaned up HTML, cleaned up visible only HTML?
  • Do we just want the sentences (e.g. for compatibility with TST eval)?
  • Do we want some combination of the above.
  • Likely don't want to save any of their tagging.
  • What metadata to retain though.

Then I should be able to quickly put together a script to re-crawl, format, and encode in JSON the documents.

Likely we just want to use the entire KBA dataset rather than the TST subset but whatever.

Nondeterminism in documents indexed for Gov2?

Indexing Gov2 on streeling at UMD, I get 24899563 docs.
Luchen reports 24900602 docs indexing on hops.

Weird - some non-determinism the multi-threading?

Not that important if we can replicate effective results on standard test collections, but worth noting.

Implement indexing for selective search

In selective search, the document collection is divided into different partitions (e.g., by clustering). Write an indexer that takes a cluster mapping (docid to clusterid mapping) and builds the right indexes - i.e., puts the documents in the appropriate partition index.

Try out RM3 on Gov2

Current implementation of RM3 works for Tweets... let's see if it works for Gov2.

Need to build a Gov2 index that stores doc vectors.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.