Hi, Hope you are all well ! I was wondering if we

[Feature] txtai as a proxy like nboost about txtai HOT 3 CLOSED

neuml commented on July 29, 2024

[Feature] txtai as a proxy like nboost

from txtai.

Comments (3)

davidmezzetti commented on July 29, 2024

Thank you for the feature suggestion. That is an interesting concept, basically looking for a search result re-ranker.

In reviewing how nboost works, it looks like they stand in the middle, make a query modification to return more than the requested number of results, re-rank and return the Top N results.

txtai is designed to go at the sentence/paragraph level vs document level. The embedding.similarity method could be called between each document section and the query to get a list of scores for a document. Then mean or max pooling of those results can build a document score and results are returned based on that.

Issue #12 covers adding a model serving capability via FastAPI. If an additional API call that looked something like:

/$model/rank

Post Params:

data: List of sections grouped by document
query: query text
size: number of documents to return
pooling: mean or max
tokenize: if data should be split into sentences

Returns:
data sorted based on most relevance to query

Would it be beneficial? Could eventually grow into calling the HTTP APIs directly but I would probably start with this to support any platform that could pass data this way.

Alternatively or possibly in addition to this, if you'd be willing to index your data both in a search system and txtai, a much more efficient search time ranking could happen like below:

/$model/rank

Post Params:

ids: list of document ids
query: query text
size: number of document ids to return
pooling: mean or max

Returns:
ids sorted based on most relevance to query

This way would pull all the sections from the txtai index and run a batch similarity query against the already existing indexed embeddings. But this method would depend on how much index time and data storage requirements you would want to add as overhead. Having the data indexed in txtai could also allow a hybrid query approach where both are queried and results joined in some way.

Lot of different ideas here but a list of initial thoughts to see what path(s) sounds the best to explore.

from txtai.

commented on July 29, 2024

If you want a dataset for test, you can use the latest arXiv dataset available here with txtai.

to get the dump

gsutil ls gs://arxiv-dataset/metadata-v5/
gsutil cp gs://arxiv-dataset/metadata-v5/arxiv-metadata-oai.json .

Is it possible to index this dataset quickly as there is an update every week ?

from txtai.

davidmezzetti commented on July 29, 2024

Finally have the changes necessary to accomplish what you need via txtai. txtai has underwent a lot of changes since this issue was created.

The best approach is indeed to use txtai as a proxy.

#49 added a Similarity pipeline that can be used directly from Python. An example of ranking documents with this is below.

from txtai.pipeline import Similarity

# Run existing search, assume results has a "text" field
results = existingsearch(query, n=100)

#Use default model, can be any MNLI model on the Hugging Face model hub
similarity = Similarity()
scores = similarity(query, [r["text"] for r in results])

# Use scores to re-sort existing results
reranked = [(scores[x], results[x]) for x in reversed(np.argsort(scores))]

Alternatively, Embeddings similarity methods can still be used.

# Replace lines from above
similarity = Similarity()
scores = similarity(query, [r["text"] for r in results])

# With
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
scores = embeddings.similarity(query, [r["text"] for r in results])

The same tasks can be performed with the API with the following minimal configuration:

similarity:

An example using txtai.js

import {Embeddings} from "txtai";

let embeddings = new Embeddings("http://localhost:8000");
let results = await embeddings.similarity(query, results);

There are a number of language bindings for the API that can be used to perform the same logic.

I'm going to go ahead and close this issue but please feel free to re-open or open a new issue if you are still interested in using txtai for this functionality.

from txtai.

[Feature] txtai as a proxy like nboost about txtai HOT 3 CLOSED

Comments (3)

to get the dump

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent