Giter VIP home page Giter VIP logo

terrier-micro's Introduction

Terrier Micro

Build Status Codacy Badge License: LGPL v3

This project provides a lightweight implementation of some query processing strategies built on top of Terrier 5. It re-implements the query processing pipeline of the Terrier search engine, removing all unnecessary features such as document score modifiers, multiple weighting models, etc.

If you use this package to conduct search or experimentation, whether be it a research paper, dissertation, article, poster, presentation, or documentation, please cite the following paper:

@article{fnt,
    author = {Tonellotto, Nicola and Macdonald, Craig and Ounis, Iadh},
    issn = {1554-0669},
    journal = {Foundations and Trends in Information Retrieval},
    number = {4--5},
    pages = {319--492},
    title = {Efficient Query Processing for Scalable Web Search},
    volume = {12},
    year = {2018}
}

This package is free software distributed under the GNU Lesser General Public License.

Pre-requisites

Elias-Fano compression for Terrier is required for testing purposes or if you plan to use it in your experiments, but it is not explicitly required for using the Terrier Micro package.

To install the Elias-Fano compression for Terrier package (version 1.5.1) on your local machine, please run the following commands.

git clone https://github.com/tonellotto/terrier-ef
cd terrier-ef
git checkout 1.5.1
mvn install appassembler:assemble

Usage

If not already available, e.g. from Maven Central, you should git clone and install Terrier Micro (version 1.5.1):

git clone https://github.com/tonellotto/terrier-micro
cd terrier-micro
git checkout 1.5.1
mvn install appassembler:assemble

The main script to perform batch query processing is the retrieve tool.

If you want to use all available processors on your machine to perform batch query processing, use the parallel retrieve tool.

Two other scripts are provided, to support advanced query processing strategies: the ms-generate and bmw-generate tools.

Python

The python folder repo holds static copies of notebooks for learning to use Terrier Micro (Java). The notebooks in this repo are sync'ed (by hand) with notebooks in Colab. For convenience, there is a small pre-built index, available to download here.

  • Terrier Micro demo on Robust 2004: local and colab notebooks.

Credits

Developed by Nicola Tonellotto, ISTI-CNR. Contributions by Craig Macdonald, University of Glasgow, and Matteo Catena, ISTI-CNR.

terrier-micro's People

Contributors

tonellotto avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

terrier-micro's Issues

TopQueue copy() is inefficient

Using object serialization just to make a copy of a queue seems a big overhead. The easiest thing would for fastutils ObjectArrayPriorityQueue and ObjectHeapPriorityQueue to be cloneable (or to make sub-classes that are). Instead, I have subclassed these to make cloneable versions.

E.g.

    static class MyObjectHeapPriorityQueue<K> extends ObjectHeapPriorityQueue<K> implements Cloneable {
		public MyObjectHeapPriorityQueue(int capacity, Comparator<? super K> c) {
			super(capacity, c);
		}

		public MyObjectHeapPriorityQueue(Comparator<? super K> c) {
			super(c);
		}

		public MyObjectHeapPriorityQueue() {
			super();
		}

		public MyObjectHeapPriorityQueue(int capacity) {
			super(capacity);
		}

		@Override
		@SuppressWarnings("unchecked")
		protected Object clone() throws CloneNotSupportedException {
			try{
				MyObjectHeapPriorityQueue<K> rtr = (MyObjectHeapPriorityQueue<K>)super.clone();
				rtr.heap = this.heap.clone();
				rtr.size = this.size;
				rtr.c = this.c; //a comparator should be stateless, no need to clone
				return rtr;
			} catch (CloneNotSupportedException e) {
				throw new InternalError(e.toString());
			}
		}
	}

Can a micro Manager know the micro Matching class to use.

E.g. if we are loading MaxScoreManager, why do I also have to specify -Dmicro.matching=it.cnr.isti.hpclab.matching.MaxScore

Why not make the simpler constructor use the superclass constructor like this:

	public MaxScoreManager(final Index index)
	{
		super(index, MaxScore.class);
	}

when used from the commandline in Terrier core, generates log4j warning

 bin/terrier micro-ms-generator -I $PWD/var/index/ef.properties -w BM25
...
16:37:51.762 [main] INFO  i.c.isti.hpclab.maxscore.MSGenerator - Started MSGenerator with parallelism 1 (out of 3 max parallelism available)
16:37:51.765 [main] WARN  i.c.isti.hpclab.maxscore.MSGenerator - Multi-threaded MaxScore generation is experimental - caution advised due to threads competing for available memory! YMMV.
16:37:51.775 [main] INFO  i.c.isti.hpclab.maxscore.MSGenerator - Input index contains 1298 terms
log4j:WARN No appenders could be found for logger (it.cnr.isti.hpclab.ef.structures.EFDocumentIndex).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
16:37:51.958 [main] INFO  i.c.isti.hpclab.maxscore.MSGenerator - Parallel maxscore computation completed after 0 seconds
16:37:51.966 [main] INFO  i.c.isti.hpclab.maxscore.MSGenerator - Sequential writing completed after 0 seconds
16:37:51.975 [main] INFO  i.c.isti.hpclab.maxscore.MSGenerator - Multi-threaded MaxScore generation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.