Comments (1)
Thank you for using txtai and trying it out on your data. It is a common with similarity search to run into issues with accuracy when there is large variability in the length of content. Without knowing the exact data you're working with, here are some general ideas to try:
- Try different Sentence Transformers models: https://huggingface.co/models?search=sentence-transformers
- For example, try bert-base-nli-stsb-mean-tokens
- Train a custom Sentence Transformers model against your data: https://github.com/UKPLab/sentence-transformers#model-training-from-scratch
- Try word embeddings vs transformer models. Word embedding models have different ways to average the embeddings together, such as BM25. BM25 factors in the length of the content as part of it's scoring algorithm.
from txtai.
Related Issues (20)
- Update NLTK model downloads HOT 2
- Refactor benchmarks script
- Local ollama embeddings not working but LLM with ollama does HOT 10
- Change RAGTask to RagTask
- Cannot import Textractor, No module named 'txtai.pipeline' HOT 4
- Update documentation to use base imports
- Update examples to use RAG pipeline instead of Extractor when paired with LLMs
- Feature Request: Callback function to get progress of index process HOT 3
- Notebook 42 error HOT 2
- Update txtai index format to remove Python-specific serialization
- Add serialization package for handling supported data serialization methods
- Add MessagePack serialization as a top level dependency
- Modify NumPy and Torch ANN components to use np.load/np.save
- Persist Embeddings index ids (only used when content storage is disabled) with MessagePack
- Persist Reducer component with skops library
- Persist NetworkX graph component with MessagePack
- Persist Scoring component metadata with MessagePack
- Modify vector transforms to load/save data using np.load/np.save
- Refactor embeddings configuration into separate component
- Document txtai index format
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from txtai.