Comments (3)
Thank you for the feature suggestion. That is an interesting concept, basically looking for a search result re-ranker.
In reviewing how nboost works, it looks like they stand in the middle, make a query modification to return more than the requested number of results, re-rank and return the Top N results.
txtai is designed to go at the sentence/paragraph level vs document level. The embedding.similarity method could be called between each document section and the query to get a list of scores for a document. Then mean or max pooling of those results can build a document score and results are returned based on that.
Issue #12 covers adding a model serving capability via FastAPI. If an additional API call that looked something like:
/$model/rank
Post Params:
- data: List of sections grouped by document
- query: query text
- size: number of documents to return
- pooling: mean or max
- tokenize: if data should be split into sentences
Returns:
data sorted based on most relevance to query
Would it be beneficial? Could eventually grow into calling the HTTP APIs directly but I would probably start with this to support any platform that could pass data this way.
Alternatively or possibly in addition to this, if you'd be willing to index your data both in a search system and txtai, a much more efficient search time ranking could happen like below:
/$model/rank
Post Params:
- ids: list of document ids
- query: query text
- size: number of document ids to return
- pooling: mean or max
Returns:
ids sorted based on most relevance to query
This way would pull all the sections from the txtai index and run a batch similarity query against the already existing indexed embeddings. But this method would depend on how much index time and data storage requirements you would want to add as overhead. Having the data indexed in txtai could also allow a hybrid query approach where both are queried and results joined in some way.
Lot of different ideas here but a list of initial thoughts to see what path(s) sounds the best to explore.
from txtai.
If you want a dataset for test, you can use the latest arXiv dataset available here with txtai.
to get the dump
gsutil ls gs://arxiv-dataset/metadata-v5/
gsutil cp gs://arxiv-dataset/metadata-v5/arxiv-metadata-oai.json .
Is it possible to index this dataset quickly as there is an update every week ?
from txtai.
Finally have the changes necessary to accomplish what you need via txtai. txtai has underwent a lot of changes since this issue was created.
The best approach is indeed to use txtai as a proxy.
#49 added a Similarity pipeline that can be used directly from Python. An example of ranking documents with this is below.
from txtai.pipeline import Similarity
# Run existing search, assume results has a "text" field
results = existingsearch(query, n=100)
#Use default model, can be any MNLI model on the Hugging Face model hub
similarity = Similarity()
scores = similarity(query, [r["text"] for r in results])
# Use scores to re-sort existing results
reranked = [(scores[x], results[x]) for x in reversed(np.argsort(scores))]
Alternatively, Embeddings similarity methods can still be used.
# Replace lines from above
similarity = Similarity()
scores = similarity(query, [r["text"] for r in results])
# With
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
scores = embeddings.similarity(query, [r["text"] for r in results])
The same tasks can be performed with the API with the following minimal configuration:
similarity:
An example using txtai.js
import {Embeddings} from "txtai";
let embeddings = new Embeddings("http://localhost:8000");
let results = await embeddings.similarity(query, results);
There are a number of language bindings for the API that can be used to perform the same logic.
I'm going to go ahead and close this issue but please feel free to re-open or open a new issue if you are still interested in using txtai for this functionality.
from txtai.
Related Issues (20)
- Add RAG API endpoint
- Rename Extractor pipeline to RAG
- Add RAG deepdive notebook
- json Response Error when trying to use ollama embeddings in txtai HOT 3
- Feature request : Advanced Reasoning and Inference HOT 1
- Feature request : Enhanced Geospatial and Temporal Search HOT 1
- Feature request : LLM Integration for Knowledge Graph Enhancement HOT 3
- Feature request : Advanced Ontology Management HOT 2
- Add RAG example application
- Segmentation fault when calling embeddings.index HOT 8
- Support max_seq_length parameter with model pooling
- Fix issue with setting quantize=False in HFTrainer pipeline
- Improve textractor pipeline
- Support `<pre>` blocks with Textractor
- Translation: AttributeError: 'ModelInfo' object has no attribute 'modelId' HOT 5
- Example code (10_Extract_text_from_documents.ipynb) does not work: No module named 'txtai.pipeline'; 'txtai' is not a package HOT 10
- Update HF LLM to reduce noisy warnings
- Add new functionality to RAG application
- pipeline not detected on MacOS HOT 4
- 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from txtai.