Comments (4)
Thank you for reporting this issue. Can you show how you are creating the Embeddings object? Are you using transformer or word backed vectors?
This is a good find with the default tokenizer. Another issue (#39) is most likely running into this same problem.
As a workaround, can you try passing in lists to the index/search/similarity calls. This will skip tokenization.
In other words:
embeddings.index([(uid, text.split(), None) for uid, text in enumerate(sections)])
embeddings.search(query.split())
or
embeddings.similarity(query, [x.split() for x in sections])
from txtai.
Thank you for reporting this issue. Can you show how you are creating the Embeddings object? Are you using transformer or word backed vectors?
This is a good find with the default tokenizer. Another issue (#39) is most likely running into this same problem.
As a workaround, can you try passing in lists to the index/search/similarity calls. This will skip tokenization.
In other words:
embeddings.index([(uid, text.split(), None) for uid, text in enumerate(sections)]) embeddings.search(query.split())or
embeddings.similarity(query, [x.split() for x in sections])
Thank you for your reply. I sovled this problem by replace
return [token for token in tokens if re.match(r"^\d*[a-z][-.0-9:_a-z]{1,}$", token) and token not in Tokenizer.STOP_WORDS]
with
return tokens
Because it would return a [] if the string doesn't have any English letter (r"^\d*[a-z][-.0-9:_a-z]{1,}$"). And for SentenceTransformer.encode(), it accept an input of strings, So in my case, I just transfer parameter of strings to SentenceTransformer.encode() which can solve this problem.
from txtai.
Thanks for the update. We'll keep this issue open and address it in the next release.
from txtai.
Issue should now be resolved. Tokenization can be disabled by setting the config option:
Embeddings({"method": "transformers", path: "/path/to/model", "tokenize": False})
from txtai.
Related Issues (20)
- [Feature Request]: Auto-save during indexing HOT 3
- Split similarity extras install
- Cuda error on initialzing Embedding instance in a spawned subprocess aka a celery background task. HOT 2
- Add pgvector ANN backend
- Add RDBMS Graph
- New to txtai, some general questions HOT 3
- Add notebook covering txtai integration with Postgres
- 60th example error with litellm LLM HOT 1
- Configuration documentation update request HOT 4
- UNIQUE constraint failed: sections.indexid HOT 2
- pypi v7.1.0 missing rdbms module in graph component HOT 2
- Pg-vector upsert indexing overwrites saved section, vector, documents rows and only maintains a single row HOT 3
- Optional dependences do not install HOT 3
- postgresql issue related to scoring, under load HOT 4
- Add close methods to ANN and Graph
- Fix issue with database.search and empty scores
- Add Postgres Full Text Scoring
- import Embeddings takes too long time HOT 5
- txtai[pipeline] fasttext build fails miserably HOT 3
- Segmetation fault on macOS Sonoma 14.4.1 (MacBook Pro Apple M3 chip) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from txtai.