hi， For English, it is a right tokenizer such as tokens

tokenizer.py about txtai HOT 4 CLOSED

neuml commented on July 29, 2024

tokenizer.py

from txtai.

Comments (4)

davidmezzetti commented on July 29, 2024

Thank you for reporting this issue. Can you show how you are creating the Embeddings object? Are you using transformer or word backed vectors?

This is a good find with the default tokenizer. Another issue (#39) is most likely running into this same problem.

As a workaround, can you try passing in lists to the index/search/similarity calls. This will skip tokenization.

In other words:

embeddings.index([(uid, text.split(), None) for uid, text in enumerate(sections)])
embeddings.search(query.split())

embeddings.similarity(query, [x.split() for x in sections])

from txtai.

zhaogangthu commented on July 29, 2024

Thank you for reporting this issue. Can you show how you are creating the Embeddings object? Are you using transformer or word backed vectors?

This is a good find with the default tokenizer. Another issue (#39) is most likely running into this same problem.

As a workaround, can you try passing in lists to the index/search/similarity calls. This will skip tokenization.

In other words:
embeddings.index([(uid, text.split(), None) for uid, text in enumerate(sections)])
embeddings.search(query.split())
or
embeddings.similarity(query, [x.split() for x in sections])

Thank you for your reply. I sovled this problem by replace
return [token for token in tokens if re.match(r"^\d*[a-z][-.0-9:_a-z]{1,}$", token) and token not in Tokenizer.STOP_WORDS]
with
return tokens
Because it would return a [] if the string doesn't have any English letter (r"^\d*[a-z][-.0-9:_a-z]{1,}$"). And for SentenceTransformer.encode(), it accept an input of strings， So in my case, I just transfer parameter of strings to SentenceTransformer.encode() which can solve this problem.

from txtai.

davidmezzetti commented on July 29, 2024

Thanks for the update. We'll keep this issue open and address it in the next release.

from txtai.

davidmezzetti commented on July 29, 2024

Issue should now be resolved. Tokenization can be disabled by setting the config option:

Embeddings({"method": "transformers", path: "/path/to/model", "tokenize": False})

from txtai.

Recommend Projects

tokenizer.py about txtai HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent