castorini / anserini Goto Github PK

View Code? Open in Web Editor NEW

1.0K 40.0 430.0 90.98 MB

Anserini is a Lucene toolkit for reproducible information retrieval research

Home Page: http://anserini.io/

License: Apache License 2.0

Java 84.78% Python 14.45% Shell 0.23% TeX 0.03% Julia 0.08% JavaScript 0.01% TypeScript 0.33% CSS 0.10%

information-retrieval lucene

anserini's Introduction

Anserini

Anserini is a toolkit for reproducible information retrieval research. By building on Lucene, we aim to bridge the gap between academic information retrieval research and the practice of building real-world search applications. Among other goals, our effort aims to be the opposite of this.* Anserini grew out of a reproducibility study of various open-source retrieval engines in 2016 (Lin et al., ECIR 2016). See Yang et al. (SIGIR 2017) and Yang et al. (JDIQ 2018) for overviews.

❗ Anserini was upgraded from JDK 11 to JDK 21 at commit 272565 (2024/04/03), which corresponds to the release of v0.35.0.

💥 Try It!

Anserini is packaged in a self-contained fatjar, which also provides the simplest way to get started. Assuming you've already got Java installed, fetch the fatjar:

wget https://repo1.maven.org/maven2/io/anserini/anserini/0.36.1/anserini-0.36.1-fatjar.jar

The follow commands will generate a SPLADE++ ED run with the dev queries (encoded using ONNX) on the MS MARCO passage corpus:

java -cp anserini-0.36.1-fatjar.jar io.anserini.search.SearchCollection \
  -index msmarco-v1-passage.splade-pp-ed \
  -topics msmarco-v1-passage.dev \
  -encoder SpladePlusPlusEnsembleDistil \
  -output run.msmarco-v1-passage-dev.splade-pp-ed-onnx.txt \
  -impact -pretokenized

To evaluate:

java -cp anserini-0.36.1-fatjar.jar trec_eval -c -M 10 -m recip_rank msmarco-passage.dev-subset run.msmarco-v1-passage-dev.splade-pp-ed-onnx.txt

See detailed instructions for the current fatjar release of Anserini (v0.36.1) to reproduce regression experiments on the MS MARCO V2.1 corpora for TREC 2024 RAG, on MS MARCO V1 Passage, and on BEIR, all directly from the fatjar!

Older instructions

🎬 Installation

Most Anserini features are exposed in the Pyserini Python interface. If you're more comfortable with Python, start there, although Anserini forms an important building block of Pyserini, so it remains worthwhile to learn about Anserini.

You'll need Java 21 and Maven 3.9+ to build Anserini. Clone our repo with the --recurse-submodules option to make sure the eval/ submodule also gets cloned (alternatively, use git submodule update --init). Then, build using Maven:

mvn clean package

The tools/ directory, which contains evaluation tools and other scripts, is actually this repo, integrated as a Git submodule (so that it can be shared across related projects). Build as follows (you might get warnings, but okay to ignore):

cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..

With that, you should be ready to go. The onboarding path for Anserini starts here!

Windows tips

If you are using Windows, please use WSL2 to build Anserini. Please refer to the WSL2 Installation document to install WSL2 if you haven't already.

Note that on Windows without WSL2, tests may fail due to encoding issues, see #1466. A simple workaround is to skip tests by adding -Dmaven.test.skip=true to the above mvn command. See #1121 for additional discussions on debugging Windows build errors.

⚗️ End-to-End Regression Experiments

Anserini is designed to support end-to-end experiments on various standard IR test collections out of the box. Each of these end-to-end regressions starts from the raw corpus, builds the necessary index, performs retrieval runs, and generates evaluation results. See individual pages for details.

MS MARCO V1 Passage Regressions

	dev	DL19	DL20
Unsupervised Sparse
Lucene BoW baselines	🔑	🔑	🔑
Quantized BM25	🔑	🔑	🔑
WordPiece baselines (pre-tokenized)	🔑	🔑	🔑
WordPiece baselines (Huggingface)	🔑	🔑	🔑
WordPiece + Lucene BoW baselines	🔑	🔑	🔑
doc2query	🔑
doc2query-T5	🔑	🔑	🔑
Learned Sparse (uniCOIL family)
uniCOIL noexp	🫙	🫙	🫙
uniCOIL with doc2query-T5	🫙	🫙	🫙
uniCOIL with TILDE	🫙
Learned Sparse (other)
DeepImpact	🫙
SPLADEv2	🫙
SPLADE++ CoCondenser-EnsembleDistil	🫙🅾️	🫙🅾️	🫙🅾️
SPLADE++ CoCondenser-SelfDistil	🫙🅾️	🫙🅾️	🫙🅾️
Learned Dense (HNSW indexes)
cosDPR-distil	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
BGE-base-en-v1.5	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
OpenAI Ada2	full:🫙 int8:🫙	full:🫙 int8:🫙	full:🫙 int8:🫙
Cohere English v3.0	full:🫙 int8:🫙	full:🫙 int8:🫙	full:🫙 int8:🫙
Learned Dense (flat indexes)
cosDPR-distil	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
BGE-base-en-v1.5	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
OpenAI Ada2	full:🫙 int8:🫙️	full:🫙 int8:🫙	full:🫙 int8:🫙
Cohere English v3.0	full:🫙 int8:🫙	full:🫙 int8:🫙	full:🫙 int8:🫙
Learned Dense (Inverted; experimental)
cosDPR-distil w/ "fake words"	🫙	🫙	🫙
cosDPR-distil w/ "LexLSH"	🫙	🫙	🫙

Key:

🔑 = keyword queries
"full" = full 32-bit floating precision
"int8" = quantized 8-bit precision
🫙 = cached queries, 🅾️ = query encoding with ONNX

Available Corpora for Download

Corpora	Size	Checksum
Quantized BM25	1.2 GB	`0a623e2c97ac6b7e814bf1323a97b435`
uniCOIL (noexp)	2.7 GB	`f17ddd8c7c00ff121c3c3b147d2e17d8`
uniCOIL (d2q-T5)	3.4 GB	`78eef752c78c8691f7d61600ceed306f`
uniCOIL (TILDE)	3.9 GB	`12a9c289d94e32fd63a7d39c9677d75c`
DeepImpact	3.6 GB	`73843885b503af3c8b3ee62e5f5a9900`
SPLADEv2	9.9 GB	`b5d126f5d9a8e1b3ef3f5cb0ba651725`
SPLADE++ CoCondenser-EnsembleDistil	4.2 GB	`e489133bdc54ee1e7c62a32aa582bc77`
SPLADE++ CoCondenser-SelfDistil	4.8 GB	`cb7e264222f2bf2221dd2c9d28190be1`
cosDPR-distil	57 GB	`e20ffbc8b5e7f760af31298aefeaebbd`
BGE-base-en-v1.5	59 GB	`353d2c9e72e858897ad479cca4ea0db1`
OpenAI-ada2	109 GB	`a4d843d522ff3a3af7edbee789a63402`
Cohere embed-english-v3.0	38 GB	`06a6e38a0522850c6aa504db7b2617f5`

MS MARCO V1 Document Regressions

	dev	DL19	DL20
Unsupervised Lexical, Complete Doc*
Lucene BoW baselines	+	+	+
WordPiece baselines (pre-tokenized)	+	+	+
WordPiece baselines (Huggingface tokenizer)	+	+	+
WordPiece + Lucene BoW baselines	+	+	+
doc2query-T5	+	+	+
Unsupervised Lexical, Segmented Doc*
Lucene BoW baselines	+	+	+
WordPiece baselines (pre-tokenized)	+	+	+
WordPiece + Lucene BoW baselines	+	+	+
doc2query-T5	+	+	+
Learned Sparse Lexical
uniCOIL noexp	✓	✓	✓
uniCOIL with doc2query-T5	✓	✓	✓

Available Corpora for Download

Corpora	Size	Checksum
MS MARCO V1 doc: uniCOIL (noexp)	11 GB	`11b226e1cacd9c8ae0a660fd14cdd710`
MS MARCO V1 doc: uniCOIL (d2q-T5)	19 GB	`6a00e2c0c375cb1e52c83ae5ac377ebb`

MS MARCO V2 Passage Regressions

	dev	DL21	DL22	DL23
Unsupervised Lexical, Original Corpus
baselines	+	+	+	+
doc2query-T5	+	+	+	+
Unsupervised Lexical, Augmented Corpus
baselines	+	+	+	+
doc2query-T5	+	+	+	+
Learned Sparse Lexical
uniCOIL noexp zero-shot	✓	✓	✓	✓
uniCOIL with doc2query-T5 zero-shot	✓	✓	✓	✓
SPLADE++ CoCondenser-EnsembleDistil (cached queries)	✓	✓	✓	✓
SPLADE++ CoCondenser-EnsembleDistil (ONNX)	✓	✓	✓	✓
SPLADE++ CoCondenser-SelfDistil (cached queries)	✓	✓	✓	✓
SPLADE++ CoCondenser-SelfDistil (ONNX)	✓	✓	✓	✓

Available Corpora for Download

Corpora	Size	Checksum
uniCOIL (noexp)	24 GB	`d9cc1ed3049746e68a2c91bf90e5212d`
uniCOIL (d2q-T5)	41 GB	`1949a00bfd5e1f1a230a04bbc1f01539`
SPLADE++ CoCondenser-EnsembleDistil	66 GB	`2cdb2adc259b8fa6caf666b20ebdc0e8`
SPLADE++ CoCondenser-SelfDistil	76 GB	`061930dd615c7c807323ea7fc7957877`

MS MARCO V2 Document Regressions

	dev	DL21	DL22	DL23
Unsupervised Lexical, Complete Doc
baselines	+	+	+	+
doc2query-T5	+	+	+	+
Unsupervised Lexical, Segmented Doc
baselines	+	+	+	+
doc2query-T5	+	+	+	+
Learned Sparse Lexical
uniCOIL noexp zero-shot	✓	✓	✓	✓
uniCOIL with doc2query-T5 zero-shot	✓	✓	✓	✓

Available Corpora for Download

Corpora	Size	Checksum
MS MARCO V2 doc: uniCOIL (noexp)	55 GB	`97ba262c497164de1054f357caea0c63`
MS MARCO V2 doc: uniCOIL (d2q-T5)	72 GB	`c5639748c2cbad0152e10b0ebde3b804`

MS MARCO V2.1 Document Regressions

The MS MARCO V2.1 corpora were derived from the V2 corpora for the TREC 2024 RAG Track. The experiments below capture topics and qrels originally targeted at the V2 corpora, but have been "projected" over to the V2.1 corpora.

	dev	DL21	DL22	DL23	RAGgy dev
Unsupervised Lexical, Complete Doc
baselines	+	+	+	+	+
Unsupervised Lexical, Segmented Doc
baselines	+	+	+	+	+

BEIR (v1.0.0) Regressions

Key:

F1 = "flat" baseline (Lucene analyzer), keyword queries (🔑)
F2 = "flat" baseline (pre-tokenized with bert-base-uncased tokenizer), keyword queries (🔑)
MF = "multifield" baseline (Lucene analyzer), keyword queries (🔑)
U1 = uniCOIL (noexp), cached queries (🫙)
S1 = SPLADE++ CoCondenser-EnsembleDistil: cached queries (🫙), ONNX (🅾️)
BGE (flat) = BGE-base-en-v1.5 (flat indexes)
- original (float32) indexes: cached queries (🫙), ONNX (🅾️)
- quantized (int8) indexes: cached queries (🫙), ONNX (🅾️)
BGE (HNSW) = BGE-base-en-v1.5 (HNSW indexes)
- original (float32) indexes: cached queries (🫙), ONNX (🅾️)
- quantized (int8) indexes: cached queries (🫙), ONNX (🅾️)

See instructions below the table for how to reproduce results for a model on all BEIR corpora "in one go".

Corpus	F1	F2	MF	U1	S1	BGE (flat)	BGE (HNSW)
TREC-COVID	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
BioASQ	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
NFCorpus	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
NQ	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
HotpotQA	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
FiQA-2018	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
Signal-1M(RT)	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
TREC-NEWS	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
Robust04	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
ArguAna	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
Touche2020	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
CQADupStack-Android	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
CQADupStack-English	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
CQADupStack-Gaming	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
CQADupStack-Gis	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
CQADupStack-Mathematica	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
CQADupStack-Physics	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
CQADupStack-Programmers	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
CQADupStack-Stats	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
CQADupStack-Tex	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
CQADupStack-Unix	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
CQADupStack-Webmasters	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
CQADupStack-Wordpress	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
Quora	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
DBPedia	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
SCIDOCS	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
FEVER	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
Climate-FEVER	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️
SciFact	🔑	🔑	🔑	🫙	🫙🅾️	full:🫙🅾️ int8:🫙🅾️	full:🫙🅾️ int8:🫙🅾️

To reproduce the SPLADE++ CoCondenser-EnsembleDistil results, start by downloading the collection:

wget https://rgw.cs.uwaterloo.ca/pyserini/data/beir-v1.0.0-splade-pp-ed.tar -P collections/
tar xvf collections/beir-v1.0.0-splade-pp-ed.tar -C collections/

The tarball is 42 GB and has MD5 checksum 9c7de5b444a788c9e74c340bf833173b. Once you've unpacked the data, the following commands will loop over all BEIR corpora and run the regressions:

MODEL="splade-pp-ed"; CORPORA=(trec-covid bioasq nfcorpus nq hotpotqa fiqa signal1m trec-news robust04 arguana webis-touche2020 cqadupstack-android cqadupstack-english cqadupstack-gaming cqadupstack-gis cqadupstack-mathematica cqadupstack-physics cqadupstack-programmers cqadupstack-stats cqadupstack-tex cqadupstack-unix cqadupstack-webmasters cqadupstack-wordpress quora dbpedia-entity scidocs fever climate-fever scifact); for c in "${CORPORA[@]}"
do
    echo "Running $c..."
    python src/main/python/run_regression.py --index --verify --search --regression beir-v1.0.0-${c}-${MODEL} > logs/log.beir-v1.0.0-${c}-${MODEL} 2>&1
done

You can verify the results by examining the log files in logs/.

For the other models, modify the above commands as follows:

Key	Corpus	Checksum	`MODEL`
F1	`corpus`	`faefd5281b662c72ce03d22021e4ff6b`	`flat`
F2	`corpus-wp`	`3cf8f3dcdcadd49362965dd4466e6ff2`	`flat-wp`
MF	`corpus`	`faefd5281b662c72ce03d22021e4ff6b`	`multifield`
U1	`unicoil-noexp`	`4fd04d2af816a6637fc12922cccc8a83`	`unicoil-noexp`
S1	`splade-pp-ed`	`9c7de5b444a788c9e74c340bf833173b`	`splade-pp-ed`
BGE	`bge-base-en-v1.5`	`e4e8324ba3da3b46e715297407a24f00`	`bge-base-en-v1.5-hnsw`

The "Corpus" above should be substituted into the full file name beir-v1.0.0-${corpus}.tar, e.g., beir-v1.0.0-bge-base-en-v1.5.tar.

Cross-lingual and Multi-lingual Regressions

Regressions for Mr. TyDi (v1.1) baselines: ar, bn, en, fi, id, ja, ko, ru, sw, te, th
Regressions for MIRACL (v1.0) baselines: ar, bn, en, es, fa, fi, fr, hi, id, ja, ko, ru, sw, te, th, zh
Regressions for TREC 2022 NeuCLIR Track BM25 (query translation): Persian, Russian, Chinese
Regressions for TREC 2022 NeuCLIR Track BM25 (document translation): Persian, Russian, Chinese
Regressions for TREC 2022 NeuCLIR Track SPLADE (query translation): Persian, Russian, Chinese
Regressions for TREC 2022 NeuCLIR Track SPLADE (document translation): Persian, Russian, Chinese
Regressions for HC4 (v1.0) baselines on HC4 corpora: Persian, Russian, Chinese
Regressions for HC4 (v1.0) baselines on original NeuCLIR22 corpora: Persian, Russian, Chinese
Regressions for HC4 (v1.0) baselines on translated NeuCLIR22 corpora: Persian, Russian, Chinese
Regressions for NTCIR-8 ACLIA (IR4QA subtask, Monolingual Chinese)
Regressions for CLEF 2006 Monolingual French
Regressions for TREC 2002 Monolingual Arabic
Regressions for FIRE 2012 monolingual baselines: Bengali, Hindi, English
Regressions for CIRAL (v1.0) BM25 (query translation): Hausa, Somali, Swahili, Yoruba
Regressions for CIRAL (v1.0) BM25 (document translation): Hausa, Somali, Swahili, Yoruba

Other Regressions

Regressions for Disks 1 & 2 (TREC 1-3), Disks 4 & 5 (TREC 7-8, Robust04), AQUAINT (Robust05)
Regressions for the New York Times Corpus (Core17), the Washington Post Corpus (Core18)
Regressions for Wt10g, Gov2
Regressions for ClueWeb09 (Category B), ClueWeb12-B13, ClueWeb12
Regressions for Tweets2011 (MB11 & MB12), Tweets2013 (MB13 & MB14)
Regressions for Complex Answer Retrieval (CAR17): v1.5, v2.0, v2.0 with doc2query
Regressions for TREC News Tracks (Background Linking Task): 2018, 2019, 2020
Regressions for FEVER Fact Verification
Regressions for DPR Wikipedia QA baselines: 100-word splits, 6/3 sliding window sentences

📃 Additional Documentation

The experiments described below are not associated with rigorous end-to-end regression testing and thus provide a lower standard of reproducibility. For the most part, manual copying and pasting of commands into a shell is required to reproduce our results.

MS MARCO V1

Reproducing BM25 baselines for MS MARCO Passage Ranking
Reproducing BM25 baselines for MS MARCO Document Ranking
Reproducing baselines for the MS MARCO Document Ranking Leaderboard
Reproducing doc2query results (MS MARCO Passage Ranking and TREC-CAR)
Reproducing docTTTTTquery results (MS MARCO Passage and Document Ranking)
Notes about reproduction issues with MS MARCO Document Ranking w/ docTTTTTquery

MS MARCO V2

Reproducing BM25 baselines on the MS MARCO V2 Collections

TREC-COVID and CORD-19

Other Experiments and Features

Working with the 20 Newsgroups Dataset
Guide to BM25 baselines for the FEVER Fact Verification Task
Guide to reproducing "Neural Hype" Experiments
Guide to running experiments on the AI2 Open Research Corpus
Experiments from Yang et al. (JDIQ 2018)
Runbooks for TREC 2018: [Anserini group] [h2oloo group]
Runbook for ECIR 2019 paper on axiomatic semantic term matching
Runbook for ECIR 2019 paper on cross-collection relevance feedback
Support for approximate nearest-neighbor search on dense vectors with inverted indexes

🙋 How Can I Contribute?

If you've found Anserini to be helpful, we have a simple request for you to contribute back. In the course of reproducing baseline results on standard test collections, please let us know if you're successful by sending us a pull request with a simple note, like what appears at the bottom of the page for Disks 4 & 5. Reproducibility is important to us, and we'd like to know about successes as well as failures. Since the regression documentation is auto-generated, pull requests should be sent against the raw templates. Then the regression documentation can be generated using the bin/build.sh script. In turn, you'll be recognized as a contributor.

Beyond that, there are always open issues we would appreciate help on!

📜️ Release History

v0.36.1: May 23, 2024 [Release Notes]
v0.36.0: April 28, 2024 [Release Notes]
v0.35.1: April 24, 2024 [Release Notes]
v0.35.0: April 3, 2024 [Release Notes]
v0.25.0: March 27, 2024 [Release Notes]
v0.24.2: February 27, 2024 [Release Notes]
v0.24.1: January 27, 2024 [Release Notes]
v0.24.0: December 28, 2023 [Release Notes]
v0.23.0: November 16, 2023 [Release Notes]
v0.22.1: October 18, 2023 [Release Notes]
v0.22.0: August 28, 2023 [Release Notes]
v0.21.0: March 31, 2023 [Release Notes]
v0.20.0: January 20, 2023 [Release Notes]

older... (and historic notes)

v0.16.2: December 12, 2022 [Release Notes]
v0.16.1: November 2, 2022 [Release Notes]
v0.16.0: October 23, 2022 [Release Notes]
v0.15.0: September 22, 2022 [Release Notes]
v0.14.4: July 31, 2022 [Release Notes]
v0.14.3: May 9, 2022 [Release Notes]
v0.14.2: March 24, 2022 [Release Notes]
v0.14.1: February 27, 2022 [Release Notes]
v0.14.0: January 10, 2022 [Release Notes]
v0.13.5: November 2, 2021 [Release Notes]
v0.13.4: October 22, 2021 [Release Notes]
v0.13.3: August 22, 2021 [Release Notes]
v0.13.2: July 20, 2021 [Release Notes]
v0.13.1: June 29, 2021 [Release Notes]
v0.13.0: June 22, 2021 [Release Notes]
v0.12.0: April 29, 2021 [Release Notes]
v0.11.0: February 13, 2021 [Release Notes]
v0.10.1: January 8, 2021 [Release Notes]
v0.10.0: November 25, 2020 [Release Notes]
v0.9.4: June 25, 2020 [Release Notes]
v0.9.3: May 26, 2020 [Release Notes]
v0.9.2: May 14, 2020 [Release Notes]
v0.9.1: May 6, 2020 [Release Notes]
v0.9.0: April 18, 2020 [Release Notes]
v0.8.1: March 22, 2020 [Release Notes]
v0.8.0: March 11, 2020 [Release Notes]
v0.7.2: January 25, 2020 [Release Notes]
v0.7.1: January 9, 2020 [Release Notes]
v0.7.0: December 13, 2019 [Release Notes]
v0.6.0: September 6, 2019 [Release Notes] [Known Issues]
v0.5.1: June 11, 2019 [Release Notes]
v0.5.0: June 5, 2019 [Release Notes]
v0.4.0: March 4, 2019 [Release Notes]
v0.3.0: December 16, 2018 [Release Notes]
v0.2.0: September 10, 2018 [Release Notes]
v0.1.0: July 4, 2018 [Release Notes]

📜️ Historical Notes

Anserini was upgraded to Lucene 9.3 at commit 272565 (8/2/2022): this upgrade created backward compatibility issues, see #1952. Anserini will automatically detect Lucene 8 indexes and disable consistent tie-breaking to avoid runtime errors. However, Lucene 9 code running on Lucene 8 indexes may give slightly different results than Lucene 8 code running on Lucene 8 indexes. Lucene 8 code will not run on Lucene 9 indexes. Pyserini has also been upgraded and similar issues apply: Lucene 9 code running on Lucene 8 indexes may give slightly different results than Lucene 8 code running on Lucene 8 indexes.
Anserini was upgraded to Java 11 at commit 17b702d (7/11/2019) from Java 8. Maven 3.3+ is also required.
Anserini was upgraded to Lucene 8.0 as of commit 75e36f9 (6/12/2019); prior to that, the toolkit uses Lucene 7.6. Based on preliminary experiments, query evaluation latency has been much improved in Lucene 8. As a result of this upgrade, results of all regressions have changed slightly. To reproducible old results from Lucene 7.6, use v0.5.1.

✨ References

Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, Sebastiano Vigna. Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. ECIR 2016.
Peilin Yang, Hui Fang, and Jimmy Lin. Anserini: Enabling the Use of Lucene for Information Retrieval Research. SIGIR 2017.
Peilin Yang, Hui Fang, and Jimmy Lin. Anserini: Reproducible Ranking Baselines Using Lucene. Journal of Data and Information Quality, 10(4), Article 16, 2018.

🙏 Acknowledgments

This research is supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada. Previous support came from the U.S. National Science Foundation under IIS-1423002 and CNS-1405688. Any opinions, findings, and conclusions or recommendations expressed do not necessarily reflect the views of the sponsors.

anserini's People

Contributors

Stargazers

Watchers

Forkers

iorixxx luchentan xingniu aroegies hatianzhang claclark sashavtyurina dyshi ylwang99 ahmed-elbagoury-zz anukat2015 khui yogosling youngbink saikrishnar khaledalbishre zuacubd peilin-yang rosequ snapbug tuzhucheng gauravbaruah lcschv shadowridgedev salman1993 jackzhangjc meowfei ptkyldz impavidity mengf821 victor0118 kytabyte tiddler amallia lukuang paopao74cn catenamatteo csirfan andrewyates matthew-z achyudh sebastian-hofstaetter entn-at tokee zhouyonglong awesome-archive codeaudit jpountz mkleen mdmustafizurrahman aalbahem atamborrino rodrigonogueira4 emmileaf edwardhdlu mam10eks zeynepakkalyoncu ricocotam jinfengr ronakice mathbunny tanjacrijns x389liu kelvin-jiang searchivarius tteofili surefirelin ewanpersonal w329li srihari-palivela rodrigo-eai kamyarghajar nihilistsumo mirzaeiyan mpkato canjiali lukuuu sarwar187 dragomirradev smarthi haiming2019 pombredanne jc-r polaris79 jmmackenzie guyrosin mannystockman kevinxyc1 nikhilro boudinfl ji-xin themagicalmish lindw zkt12 apoorvp16 kohei-shinden ewrfcas gzm9583 alxliu akshatabhat

anserini's Issues

Baselines on ClueWeb09b

Let's implement baselines for ClueWeb09b, and then push it to all of ClueWeb09.

Create RerankerCascade abstraction

It would make sense to create a RerankerCascade abstraction for running a whole bunch of sequence of rerankers. Something like:

RerankerCascade cascade = new RerankerCascade(context).add(foo).add(bar);
cascade.run(docs);

Baselines on ClueWeb12-B13

Let's implement baselines for ClueWeb12-B13, and then push it to all of ClueWeb12.

Incorporate Lintools for frequency distributions in RM3

Take advantage of classes for frequency distributions in Lintools: https://github.com/lintool/tools

Write partitioned tweets indexer

Write an indexer that indexes tweets into multiple partitions - simple round robin strategy would be a reasonable start.

Get simple end-to-end LTR training pipeline working with Ranklib

@claclark Ranklib is here: http://sourceforge.net/p/lemur/wiki/RankLib/

Massage our LTR data dumper to produce data files that can be directly read by Ranklib so we can have an end-to-end training/cross-validation pipeline.

Issue with using QueryParser to parse TREC topics

We're currently using QueryParser to parse TREC topics, which means that symbols in the topics like parentheses and quotes get interpreted as query operators... this isn't the desired behavior.

Refactor Twitter NRTS package to use a templating system

For the Twitter NRTS demo, we should probably use a proper templating system like mustache to avoid System.out.println nightmare: https://github.com/spullara/mustache.java

Implement learning to rank features in Macdonald (CIKM 2012)

@lintool:

We need to add query features from Macdonald et al., CIKM 2012

"On the Usefulness of Query Features for Learning to Rank"

http://www.dcs.gla.ac.uk/~craigm/publications/macdonald12queryf.pdf

Code Comments

Before we get too far into hacking on Anserini, we should probably decide on how we want to deal with comments.

Do we want to do Javadoc? Something else?

Switch to float features before it's too late

@lintool @LuchenTan
Most LTR features are floats anyway, so we should switch to just returning an array of floats, rather than ints. Let's do it now, rather than mess around later.

Implement Tweets2011/2012 baseline

Get basic indexing/retrieval working on TREC Microblog track data from 2011 to 2014. Let's start with TREC 2011 and TREC 2012 microblog data since the corpus is smaller...

Refactor ClueWeb09b to parallel structure of IndexGov2

@iorixxx Please check out my branch cw09b-refactoring

I've pulled in your edits and started putting classes in the "right" package hierarchy, following the general layout of Lucene's package hierarchy. Can you please:

Clean up various reference (e.g., pom.xml) to make sure everything still works?
Refactor out usage of Args class in IndexClueWeb09b (we should just be using commons-cli), and in general, make the logging, cmdline options, etc. consistent?

Thanks!

Experiment with different analyzers on Gov2

According to @iorixxx

EnglishAnalyzer: PorterStemmer is aggressive, and stop word removal would make certain queries (the wall, the current, the sun, to be or not to be) meaningless.
I think analysis should be minimum.

We should play with different analyzers and evaluate impact on effectiveness.

Indexing all of ClueWeb09

Quite impressively, I was able to index all of ClueWeb09 (English):

nohup sh target/appassembler/bin/IndexClueWeb09b \
  -input /scratch1/collections/ClueWeb09.English/data/ \
  -index lucene-index.cw09.cnt -threads 32 -optimize >& log.cw09.cnt.txt &

Took ~18 hours:

2015-10-16 07:51:04,775 INFO  [main] index.IndexClueWeb09b (IndexClueWeb09b.java:298) - Total 503781465 documents indexed in 18:01:04

Index size (note: no positions):

$ du -h lucene-index.cw09.cnt/
254G    lucene-index.cw09.cnt/

LTR data generator needs access to qrels

@claclark The feature data that the LTR module generates needs to have access to the qrels so the relevance grade can be folded directly into the output.

Indentation size

@iorixxx Do you mind if we agree on code indentation being two spaces, just to be consistent?

If so, can you please reformat your code? I'd rather you do it so better retain history for git blame. Please send pull request.

Thanks!

Refactoring of the index and document

For now everything is based on Warc formatted records.
We'd have other types of records too, e.g. Trec text or maybe other types in the future.
It is better to have a base record and everything inherits it.

Implement DocumentReranker interface

It seems what we need is a generic document reranking interface: takes a document ranking and spits another document ranking back out. This would implement a standard multi-stage retrieval pipeline: e.g., BM25 (or QL) + 1st stage reranker + 2nd stage reranker, etc.

Lucene query parser cannot parse wildcard queries

Lucene query parser gives the following error if the query has wildcard characters in it:
'*' or '?' not allowed as first character in WildcardQuery
Ex: Cannot parse 'where is the Eldorado Casino in Reno ?': '*' or '?' not allowed as first character in WildcardQuery.

Implement RM3

We probably need some relevance feedback model... RM3 is probably our best bet.

Refactor Twitter NRTS package to use Jetty servlets

The current implementation of TweetSearcherServer parses raw HTTP requests.

We should use a proper servlet container like Jetty:
http://www.eclipse.org/jetty/

Example of embedded servlet:
http://www.eclipse.org/jetty/documentation/9.1.4.v20140401/embedded-examples.html#embedded-minimal-servlet

Put example command to dump LTR feaures into the README.md

specifically

sh target/appassembler/bin/DumpTweetsLtrData -index tweets2011-index/ -topics src/main/resources/topics-and-qrels/topics.microblog2011.txt -output ltr.data.txt -qrels src/main/resources/topics-and-qrels/qrels.microblog2011.txt -ql

Connect NRTS demo with RTS mobile push broker

@aroegies @xeniaqian94 Can you two coordinate on making this happen?

RTS mobile push broker: https://github.com/aroegies/trecrts-tools

Decide on a REST API so that the NRTS demo can call the broker
Note that the REST API should have the notion of a queryid, user, and token (=password)
Modify the NRTS demo so that you pass in a queryid and a query (e.g., "birthday") on the command line, and also an interval, e.g., 1 minute. Every minute, the NRTS demo wakes up, runs the query, and pushes results to the RTS broker

Write an indexer for flat text files

@yb1 You probably want to dump out the cleaned text in a simple text format, something like this:

URL1 document1 ....
URL2 document2 ...

And write an indexer for it. Look at IndexTweets.java and IndexWebCollection.java here:
https://github.com/lintool/Anserini/tree/master/src/main/java/io/anserini/index

The tweets indexer should be fairly easy to understand - it's single-threaded so it's slower. IndexWebCollection is multi-threaded and thus much faster.

I would start with a single-threaded implementation. Call the class IndexPlainText or something like that.

Try out different analyzers on Tweets collection

@xeniaqian94 It would be great for you to get some experience running end-to-end ad hoc experiments, which is a core activity of IR research. Let's start with something simple, like playing with different analyzers - currently, the tweet indexing uses PorterStemFilter. Try removing it and see what the effect is. So:

change the analyzer to remove stemming
rebuild index
run retrieval experiments - report effects on MAP, P@30 (compare with original index).

It would be nice to also know the effects of indexing only English tweets, using same procedure above.

Tools for indexing/searching Wikipedia

I have a bunch of code for indexing/searching Wikipedia:
https://github.com/lintool/wiki-tools

Should pull into this repo...

Generic interface for feature extractors

Build interface for feature extractor shared across collections.

Integrate Waterloo spam scores and other static priors into index

We should develop a generic mechanism to store and use Waterloo spam scores, PageRank, HITS, and other static priors.

@iorixxx Do you have some code to contribute along these lines?

Integrate CACM collection

The CACM collection is small enough that we can include it in the repository... so we can have indexing/retrieval experiments completely integrated in with the system.

For Gov2 indexing: count vs. positional in cmdline args

For the Gov2 indexing class, we should probably add an option to force the user to specify whether they want count or positional index, as to reduce confusion?

Simple LTR implementation for Tweets

@LuchenTan @xeniaqian94 Let's start with a simple two-feature LTR implementation for Tweets:

Start with the current tweet search implementation, which has two rerankers, rm3 and cleanup.
Your implementation is going to go in a third reranker that you tack on to the end.

Let's build a LTR implementation that just has two features: RM3 score + number hash tags. Inside your new reranker, you already have the RM3 score; use getField on the document to pull out the text, and then just count the number of hashtags. Print out a line like this:

1 325263 0.432 3

Topic 1, docid 325263, RM3 score of 0.432, 3 hashtags. Dump this information for all docs.

You'll need to take this file and join it with qrels to get the relevance judgments (i.e., write a simple Python script to do it. So you'll end up with a file like:

1 325263 0.432 3 1

The final column is the relevance judgment. Now you can run learning to rank using http://sourceforge.net/p/lemur/wiki/RankLib/

IndexCounter code broken

@LuchenTan IndexCounter code doesn't compile, so master is currently broken right now.

Wrong package
Uses Args class which has been removed. See IndexGov2 for example of how to use args4j
Class has a weird name - can you please rename to DumpDocids or something like that?
Can you please change indentation to 2 spaces instead of tab. Search online for Eclipse code formatter, one of the options is indentation - change to make consistent with everyone else.

Prepare TST data for baselines by Anserini

From @aroegies

If you tell me what fields are desired from:
https://github.com/trec-kba/streamcorpus/blob/master/if/streamcorpus-v0_3_0.thrift
Open questions:

Do we want raw HTML, cleaned up HTML, cleaned up visible only HTML?
Do we just want the sentences (e.g. for compatibility with TST eval)?
Do we want some combination of the above.
Likely don't want to save any of their tagging.
What metadata to retain though.

Then I should be able to quickly put together a script to re-crawl, format, and encode in JSON the documents.

Likely we just want to use the entire KBA dataset rather than the TST subset but whatever.

Verify new code contributes by @iorixxx

@jiaul can you please verify the new code contributions from @iorixxx for indexing gov2, clueweb09, and clueweb12?

Nondeterminism in documents indexed for Gov2?

Indexing Gov2 on streeling at UMD, I get 24899563 docs.
Luchen reports 24900602 docs indexing on hops.

Weird - some non-determinism the multi-threading?

Not that important if we can replicate effective results on standard test collections, but worth noting.

Twitter tokenizer

Luchen is working on a Twitter tokenizer.

Baselines on Wt10g

Implement baselines for the Wt10g collection.

Implement indexing for selective search

In selective search, the document collection is divided into different partitions (e.g., by clustering). Write an indexer that takes a cluster mapping (docid to clusterid mapping) and builds the right indexes - i.e., puts the documents in the appropriate partition index.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.