Giter VIP home page Giter VIP logo

Comments (3)

davidmezzetti avatar davidmezzetti commented on July 29, 2024

Thank you for the feature suggestion. That is an interesting concept, basically looking for a search result re-ranker.

In reviewing how nboost works, it looks like they stand in the middle, make a query modification to return more than the requested number of results, re-rank and return the Top N results.

txtai is designed to go at the sentence/paragraph level vs document level. The embedding.similarity method could be called between each document section and the query to get a list of scores for a document. Then mean or max pooling of those results can build a document score and results are returned based on that.

Issue #12 covers adding a model serving capability via FastAPI. If an additional API call that looked something like:

/$model/rank

Post Params:

  • data: List of sections grouped by document
  • query: query text
  • size: number of documents to return
  • pooling: mean or max
  • tokenize: if data should be split into sentences

Returns:
data sorted based on most relevance to query

Would it be beneficial? Could eventually grow into calling the HTTP APIs directly but I would probably start with this to support any platform that could pass data this way.

Alternatively or possibly in addition to this, if you'd be willing to index your data both in a search system and txtai, a much more efficient search time ranking could happen like below:

/$model/rank

Post Params:

  • ids: list of document ids
  • query: query text
  • size: number of document ids to return
  • pooling: mean or max

Returns:
ids sorted based on most relevance to query

This way would pull all the sections from the txtai index and run a batch similarity query against the already existing indexed embeddings. But this method would depend on how much index time and data storage requirements you would want to add as overhead. Having the data indexed in txtai could also allow a hybrid query approach where both are queried and results joined in some way.

Lot of different ideas here but a list of initial thoughts to see what path(s) sounds the best to explore.

from txtai.

 avatar commented on July 29, 2024

If you want a dataset for test, you can use the latest arXiv dataset available here with txtai.

to get the dump

gsutil ls gs://arxiv-dataset/metadata-v5/
gsutil cp gs://arxiv-dataset/metadata-v5/arxiv-metadata-oai.json .

Is it possible to index this dataset quickly as there is an update every week ?

from txtai.

davidmezzetti avatar davidmezzetti commented on July 29, 2024

Finally have the changes necessary to accomplish what you need via txtai. txtai has underwent a lot of changes since this issue was created.

The best approach is indeed to use txtai as a proxy.

#49 added a Similarity pipeline that can be used directly from Python. An example of ranking documents with this is below.

from txtai.pipeline import Similarity

# Run existing search, assume results has a "text" field
results = existingsearch(query, n=100)

#Use default model, can be any MNLI model on the Hugging Face model hub
similarity = Similarity()
scores = similarity(query, [r["text"] for r in results])

# Use scores to re-sort existing results
reranked = [(scores[x], results[x]) for x in reversed(np.argsort(scores))]

Alternatively, Embeddings similarity methods can still be used.

# Replace lines from above
similarity = Similarity()
scores = similarity(query, [r["text"] for r in results])

# With
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
scores = embeddings.similarity(query, [r["text"] for r in results])

The same tasks can be performed with the API with the following minimal configuration:

similarity:

An example using txtai.js

import {Embeddings} from "txtai";

let embeddings = new Embeddings("http://localhost:8000");
let results = await embeddings.similarity(query, results);

There are a number of language bindings for the API that can be used to perform the same logic.

I'm going to go ahead and close this issue but please feel free to re-open or open a new issue if you are still interested in using txtai for this functionality.

from txtai.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.