Giter VIP home page Giter VIP logo

ndx's Introduction

ndx · GitHub license

Lightweight Full-Text Indexing and Searching Library.

This library were designed for a specific use case when all documents are stored on a disk (IndexedDB) and can be dynamically added or removed to an index.

Query function supports only disjunction operators. Queries like one two will work as "one" or "two".

Inverted Index doesn't store term locations and query function won't be able to search for phrases like "Super Mario".

There are many alternative solutions with different tradeoffs that may better suit for your particular use cases. For a simple document search with a static dataset, I would recommend to use something like fst and deploy it as an edge function (wasm).

Features

  • Multiple fields full-text indexing and searching.
  • Per-field score boosting.
  • BM25 ranking function to rank matching documents.
  • Trie based dynamic Inverted Index.
  • Configurable tokenizer and term filter.
  • Free text queries with query expansion.

Example

import { createIndex, indexAdd } from "ndx";
import { indexQuery } from "ndx/query";

const termFilter = (term) => term.toLowerCase();

function createDocumentIndex(fields) {
  // `createIndex()` creates an index data structure.
  // First argument specifies how many different fields we want to index.
  const index = createIndex(
    fields.length,
    // Tokenizer is a function that breaks text into words, phrases, symbols,
    // or other meaningful elements called tokens.
    (s) => s.split(" "),
    // Filter is a function that processes tokens and returns terms, terms are
    // used in Inverted Index to index documents.
    termFilter,
  );
  // `fieldGetters` is an array with functions that will be used to retrieve
  // data from different fields.
  const fieldGetters = fields.map((f) => (doc) => doc[f.name]);
  // `fieldBoostFactors` is an array of boost factors for each field, in this
  // example all fields will have identical weight.
  const fieldBoostFactors = fields.map(() => 1);

  return {
    index,
    // `add()` will add documents to the index.
    add(doc) {
      indexAdd(
        index,
        fieldGetters,
        // Docum  ent key, it can be an unique document id or a refernce to a
        // document if you want to store all documents in memory.
        doc.id,
        // Document.
        doc,
      );
    },
    // `remove()` will remove documents from the index.
    remove(id) {
      // When document is removed we are just marking document id as being
      // removed. Index data structure still contains references to the removed
      // document.
      indexRemove(index, removed, id);
      if (removed.size > 10) {
        // `indexVacuum()` removes all references to removed documents from the
        // index.
        indexVacuum(index, removed);
      }
    },

    // `search()` will be used to perform queries.
    search(q) {
      return indexQuery(
        index,
        fieldBoostFactors,
        // BM25 ranking function constants:
        // BM25 k1 constant, controls non-linear term frequency normalization
        // (saturation).
        1.2,
        // BM25 b constant, controls to what degree document length normalizes
        // tf values.
        0.75,
        q,
      );
    }
  };
}

// Create a document index that will index `content` field.
const index = createDocumentIndex([{ name: "content" }]);

const docs = [
  {
    "id": "1",
    "content": "Lorem ipsum dolor",
  },
  {
    "id": "2",
    "content": "Lorem ipsum",
  }
];

// Add documents to the index.
docs.forEach((d) => { index.add(d); });

// Perform a search query.
index.search("Lorem");
// => [{ key: "2" , score: ... }, { key: "1", score: ... } ]
//
// document with an id `"2"` is ranked higher because it has a `"content"`
// field with a less number of terms than document with an id `"1"`.

index.search("dolor");
// => [{ key: "1", score: ... }]

Tokenizers and Filters

ndx library doesn't provide any tokenizers or filters. There are other libraries that implement tokenizers, for example Natural has a good collection of tokenizers and stemmers.

License

MIT

ndx's People

Contributors

daqst avatar localvoid avatar vladimiry avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ndx's Issues

Recommended way to update an index?

Hello, is there a recommended way to update an index in place?

Building on examples in the docs, I am using this right now:

    update: (id, doc) => {
      this.remove(id);
      if (removed.size) vacuumIndex(index, removed);
      this.add(doc);
    },

I am worried that vacuumIndex has to recurse the entire tree, and therefore it won't scale as well as add and remove. Is this a valid concern?

ability to search by field

Thank you for this repo, can I ask you if is possible to search by field, for example,
search all the docks where field = address and the query is 'test' .

import { DocumentIndex } from "ndx";

const index = new DocumentIndex();
index.addField("title");
index.addField("content");

const documents = [
  {
    id: "doc1",
    title: "First Document",
    content: "Lorem ipsum dolor",
    address: "15",
  },
  {
    id: "doc2",
    title: "Second Document",
    content: "Lorem ipsum",
    address: "23",
  }
];

documents.forEach((doc) => {
  index.add(doc.id, doc);
});

index.search("First");
// => [{ docId: "doc1", score: ... }]

index.search("Lorem");
// => [{ docId: "doc2", score: ... }, { docId: "doc1", score: ... }]

Feature request: adding commons/shared/base project

Such commons project could contain for example base tslint/tsconfig/other configs. Maybe also it could incorporate the https://github.com/ndx-search/ndx-utils project. The idea is to have a common configurations base point that the rest of the projects would be based on extending the default configs. That would allow increasing code consistency across the poly-repo projects.

TS supports tsconfig.json inheritance via Node.js packages since 3.2 https://www.typescriptlang.org/docs/handbook/release-notes/typescript-3-2.html

Search by ending

For example if you have a body of sit amet and search with amet you'l get the results, but if you search with met only, you won't get anything. I was wondering if this isn't working by design or is this a bug?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.