Giter VIP home page Giter VIP logo

bm25's Introduction

BM25 Java Implementation

BM25 (Best Matching 25) is a ranking function used by search engines to rank matching documents according to their relevance to a given search query.

See also https://en.wikipedia.org/wiki/Okapi_BM25

Simple usage

List<String> corpus = List.of(
      "I love programming",
      "Java is my favorite programming language",
      "I enjoy writing code in Java",
      "Java is another popular programming language",
      "I find programming fascinating",
      "I love Java",
      "I prefer Java over Python"
  );

  BM25 bm25 = new BM25(corpus);

  List<Map.Entry<Integer, Double>> results = bm25.search("I love java");

  for (Map.Entry<Integer, Double> entry : results) {
      System.out.println("Sentence " + entry.getKey() + " : Score = " + entry.getValue() + " - [" + corpus.get(entry.getKey()) + "]");
  }
Sentence 5 : Score = 2.286729869084079 - [I love Java]
Sentence 0 : Score = 1.8387268317084793 - [I love programming]
Sentence 6 : Score = 0.7294916714788526 - [I prefer Java over Python]
Sentence 2 : Score = 0.6674701123652661 - [I enjoy writing code in Java]
Sentence 4 : Score = 0.40211004330297734 - [I find programming fascinating]
Sentence 1 : Score = 0.33373505618263305 - [Java is my favorite programming language]
Sentence 3 : Score = 0.33373505618263305 - [Java is another popular programming language]
bm25.search("programming");
Sentence 0 : Score = 0.687935390645563 - [I love programming]
Sentence 4 : Score = 0.6174639603843102 - [I find programming fascinating]
Sentence 1 : Score = 0.5124700885780712 - [Java is my favorite programming language]
Sentence 3 : Score = 0.5124700885780712 - [Java is another popular programming language]
Sentence 2 : Score = 0.0 - [I enjoy writing code in Java]
Sentence 5 : Score = 0.0 - [I love Java]
Sentence 6 : Score = 0.0 - [I prefer Java over Python]

With stop words

Get better results by removing language-specific stop words.

Based on ISO provided list from https://github.com/stopwords-iso

Current implementation supports English, French, German, Dutch, Italian and Spanish stop words.

      BM25 bm25 = new BM25(corpus, StopWords.ENGLISH);

With Stemming

Get better results by using stemming.

Stemming maps different forms of the same word to a common "stem". For example, the English stemmer maps running, run, runs to run. So a search for 'running' would also find documents which only have the other forms.

      BM25 bm25 = new BM25(corpus, StopWords.ENGLISH, new EnglishStemmer());

The default implementation uses the Porter2 stemmer from Snowball.
You can add other Stemmer implementations, for example, CoreNLP or Lucene.

bm25's People

Contributors

stephanj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

clownsw

bm25's Issues

Add stop words removal

Remove stop words in the given corpus.

  1. Define a Stop Word List: Create a collection containing common stop words. This could be a static list within the class or loaded from an external resource.

  2. Filter Stop Words in Document Processing: When calculating term frequencies (tf) and document frequencies (docFreq), exclude terms that are in the stop word list.

  3. Filter Stop Words in Query Processing: When processing the search query, exclude terms from the query that are in the stop word list before calculating BM25 scores.

Add Stemming support using Snowball

"Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer."

https://snowballstem.org/

English details @ https://snowballstem.org/algorithms/english/stemmer.html

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.