bm25's Introduction

BM25 Java Implementation

BM25 (Best Matching 25) is a ranking function used by search engines to rank matching documents according to their relevance to a given search query.

Simple usage

List<String> corpus = List.of(
      "I love programming",
      "Java is my favorite programming language",
      "I enjoy writing code in Java",
      "Java is another popular programming language",
      "I find programming fascinating",
      "I love Java",
      "I prefer Java over Python"
  );

  BM25 bm25 = new BM25(corpus);

  List<Map.Entry<Integer, Double>> results = bm25.search("I love java");

  for (Map.Entry<Integer, Double> entry : results) {
      System.out.println("Sentence " + entry.getKey() + " : Score = " + entry.getValue() + " - [" + corpus.get(entry.getKey()) + "]");
  }

Sentence 5 : Score = 2.286729869084079 - [I love Java]
Sentence 0 : Score = 1.8387268317084793 - [I love programming]
Sentence 6 : Score = 0.7294916714788526 - [I prefer Java over Python]
Sentence 2 : Score = 0.6674701123652661 - [I enjoy writing code in Java]
Sentence 4 : Score = 0.40211004330297734 - [I find programming fascinating]
Sentence 1 : Score = 0.33373505618263305 - [Java is my favorite programming language]
Sentence 3 : Score = 0.33373505618263305 - [Java is another popular programming language]

bm25.search("programming");

Sentence 0 : Score = 0.687935390645563 - [I love programming]
Sentence 4 : Score = 0.6174639603843102 - [I find programming fascinating]
Sentence 1 : Score = 0.5124700885780712 - [Java is my favorite programming language]
Sentence 3 : Score = 0.5124700885780712 - [Java is another popular programming language]
Sentence 2 : Score = 0.0 - [I enjoy writing code in Java]
Sentence 5 : Score = 0.0 - [I love Java]
Sentence 6 : Score = 0.0 - [I prefer Java over Python]

With stop words

Get better results by removing language-specific stop words.

Based on ISO provided list from https://github.com/stopwords-iso

Current implementation supports English, French, German, Dutch, Italian and Spanish stop words.

      BM25 bm25 = new BM25(corpus, StopWords.ENGLISH);

With Stemming

Get better results by using stemming.

Stemming maps different forms of the same word to a common "stem". For example, the English stemmer maps running, run, runs to run. So a search for 'running' would also find documents which only have the other forms.

      BM25 bm25 = new BM25(corpus, StopWords.ENGLISH, new EnglishStemmer());

The default implementation uses the Porter2 stemmer from Snowball.
You can add other Stemmer implementations, for example, CoreNLP or Lucene.

bm25's People

Contributors

Stargazers

Watchers

bm25's Issues

Add stop words removal

Remove stop words in the given corpus.

Define a Stop Word List: Create a collection containing common stop words. This could be a static list within the class or loaded from an external resource.
Filter Stop Words in Document Processing: When calculating term frequencies (tf) and document frequencies (docFreq), exclude terms that are in the stop word list.
Filter Stop Words in Query Processing: When processing the search query, exclude terms from the query that are in the stop word list before calculating BM25 scores.

Add Stemming support using Snowball

"Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer."

https://snowballstem.org/

English details @ https://snowballstem.org/algorithms/english/stemmer.html

Recommend Projects

stephanj / bm25 Goto Github PK

bm25's Introduction

BM25 Java Implementation

Simple usage

With stop words

With Stemming

bm25's People

Contributors

Stargazers

Watchers

Forkers

bm25's Issues

Add stop words removal

Add Stemming support using Snowball

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent