neuml / txtai Goto Github PK
View Code? Open in Web Editor NEW💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
Home Page: https://neuml.github.io/txtai
License: Apache License 2.0
💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
Home Page: https://neuml.github.io/txtai
License: Apache License 2.0
Hi,
I'm trying to use either the gpt2 or t5-3b models with txtai (as it is mentioned in one of the notebooks that any model listed on the Hugging Face would work), but I receive several errors:
ERROR:transformers.tokenization_utils_base:Using pad_token, but it is not set yet.
Traceback (most recent call last):
File "./text-ai.py", line 24, in
similarities = embeddings.similarity(s, queries)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 227, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 178, in transform
embedding = self.model.transform(document)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/vectors.py", line 257, in transform
return self.model.encode([" ".join(document[1])], show_progress_bar=False)[0]
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 176, in encode
sentence_features = self.get_sentence_features(text, longest_seq)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 219, in get_sentence_features
return self._first_module().get_sentence_features(*features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/models/Transformer.py", line 61, in get_sentence_features
return self.tokenizer.prepare_for_model(tokens, max_length=pad_seq_length, pad_to_max_length=True, return_tensors='pt', truncation=True)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2021, in prepare_for_model
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1529, in _get_padding_truncation_strategies
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token
(tokenizer.pad_token = tokenizer.eos_token e.g.)
or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'})
.
or for T5:
File "./text-ai.py", line 24, in
similarities = embeddings.similarity(s, queries)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 227, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 178, in transform
embedding = self.model.transform(document)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/vectors.py", line 257, in transform
return self.model.encode([" ".join(document[1])], show_progress_bar=False)[0]
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 187, in encode
out_features = self.forward(features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/models/Transformer.py", line 25, in forward
output_states = self.auto_model(**features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/modeling_t5.py", line 965, in forward
decoder_outputs = self.decoder(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/modeling_t5.py", line 684, in forward
raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
What am I missing?
Thanks!
Review the methods here: https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/distillation
See if this can be integrated to reduce the storage necessary for embeddings indices.
With the additional functionality added to txtai over the last few releases, the API definitions have gotten somewhat inconsistent. This issue will address that and make many of the return types across modules consistent. The changes are breaking in many cases and will require a bump of the major version of txtai to v2.
The current Python API definitions for v1 are:
embeddings.search("query text")
return [(id, score)] sort score desc
embeddings.similarity("query text", documents)
return [score]
embeddings.add(documents)
embeddings.index()
embeddings.transform("text")
return [float]
extractor(sections, queue)
return [(name, answer)]
labels("text", ["label1"])
return [(label, score)] sort score desc
The new method templates and return types are below.
embeddings.search("query text")
return [(id, score)] sort score desc
embeddings.batchsearch(["query text1", "query text2])
return [[(id, score)] sort score desc]
embeddings.add(documents)
embeddings.index()
embeddings.similarity("query text", texts)
return [(id, score)] sort score desc
embeddings.batchsimilarity(["query text1", "query text2], texts)
return [[(id, score)] sort score desc]
embeddings.transform("text")
return [float]
embeddings.batchtransform(["text1", "text2"])
return [[float]]
extractor(queue, texts)
return [(name, answer)]
labels("text", ["label1"])
return [(id, score)] sort score desc
labels(["text1", "text2"], ["label1"])
return [[(id, score)] sort score desc]
similarity("query text", texts)
return [(id, score)] sort score desc
batchsimilarity(["query text1", "query text2], texts)
return [[(id, score)] sort score desc]
The API methods also need to have corresponding changes.
Given that json doesn't support tuples and some languages can't easily map arrays/tuples to objects, the return types are mapped from tuples to json objects. For example instead of (id, score) the API will return {"id": value, "score": value}.
The API also has the following differences with the native Python API.
The following list shows how the API methods will look through language binding libraries.
embeddings.search("query text")
embeddings.batchsearch(["query text1", "query text2])
embeddings.add(documents)
embeddings.index()
embeddings.similarity("query text", texts)
embeddings.batchsimilarity(["query text1", "query text2], texts)
embeddings.transform("text")
embeddings.batchTransform(["text1", "text2"])
extractor.extract(questions, texts)
labels.label("text", ["label1"])
labels.batchlabel(["text1", "text2"], ["label1"])
similarity.similarity("query text", texts)
similarity.batchsimilarity(["query text1", "query text2], texts)
Re-run and save the example notebooks to ensure they all work with the latest libraries.
Also add a notebook on labeling.
Currently, the pipeline component has logic to workaround a performance issue in Transformers < 4.0. This performance issue has been resolved. Refactor this component to directly use the pipeline component.
Also consolidate labels methods into the pipeline module.
Faiss 1.6.4 supports Windows. This upgrade will help simplify the code base on all platforms.
travis-ci.org builds are frequently backlogged more than an hour, which doesn't work for continuous development. Migrate to GitHub actions.
Once successful, revoke all third-party app access.
For search, similarity, extractive qa and labels, all methods should operate on batches for the best performance.
it failed on
Currently, sentence-transformer based indices are indexing documents one at a time. Calls to sentence-transformers should be batched together to decrease indexing time.
For word embedding models, add an option to include the word vectors model in the models path.
The functionality provided via the txtai API has increased significantly. Improve test coverage in that area.
hnswlib requires numpy and is failing on the build of the wheel. The build process is falling back to the legacy build which will be removed in pip 21.x.
The combination of pip 20.3.x, transformers 4.x and sentence-transformers 0.3.9 has caused build errors not related to txtai.
Remove the following lines from build.yml once they are resolved upstream.
# Remove hardcoding to 20.2.4
pip install -U pip==20.2.4 wheel coverage coverall
Add pattern for serving models. API should be driven by a configuration yaml file listing the model name and path.
API endpoints:
Currently, extractor.py has a single method to run an embeddings query and then run extractive QA over those results.
This should be split into two separate methods, which allows external callers to just run the embeddings search without executing the QA extraction. This will allow downstream systems more flexibility in working with the extractor process.
Currently, transformers is fixed to 3.0.2 due to an issue with sentence-transformers.
Once sentence-transformers v0.3.6 is released, which will support 3.1.x, update setup.py accordingly.
Transformers 4.x has numerous performance improvements. Upgrade dependency to require at least 4.0.0.
I am trying to run the Example 1 with word embeddings using the following code:
embeddings = Embeddings({"path": "word-vectors/GoogleNews-vectors-negative300.magnitude",
"storevectors": True,
"scoring": "bm25",
"pca": 3,
"quantize": True})
sections = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]
print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)
for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
# Get index of best section that best matches query
candidates = embeddings.similarity(query, sections)
uid = np.argmax(candidates)
print("%-20s %s" % (query, sections[uid]))
Traceback (most recent call last):
File "2.py", line 24, in
candidates = embeddings.similarity(query, sections)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/embeddings.py", line 228, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/embeddings.py", line 179, in transform
embedding = self.model.transform(document)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/vectors.py", line 155, in transform
weights = self.scoring.weights(document) if self.scoring else None
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/scoring.py", line 133, in weights
weights.append(self.score(freq, idf, length))
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/scoring.py", line 217, in score
k = self.k1 * ((1 - self.b) + self.b * length / self.avgdl)
ZeroDivisionError: float division by zero
Do I need to do something with the word embeddings before I can use it for similarity search ?
I have run the following sample code for the extractor to perform Q&A on OS X but the results return None:
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
extractor = Extractor(embeddings, "distilbert-base-cased-distilled-squad")
sections = ["Giants hit 3 HRs to down Dodgers",
"Giants 5 Dodgers 4 final",
"Dodgers drop Game 2 against the Giants, 5-4",
"Blue Jays 2 Red Sox 1 final",
"Red Sox lost to the Blue Jays, 2-1",
"Blue Jays at Red Sox is over. Score: 2-1",
"Phillies win over the Braves, 5-0",
"Phillies 5 Braves 0 final",
"Final: Braves lose to the Phillies in the series opener, 5-0",
"Final score: Flyers 4 Lightning 1",
"Flyers 4 Lightning 1 final",
"Flyers win 4-1"]
sections = [(uid, section) for uid, section in enumerate(sections)]
questions = ["What team won the game?", "What was score?"]
execute = lambda query: extractor(sections, [(question, query, question, False) for question in questions])
for query in ["Red Sox - Blue Jays", "Phillies - Braves", "Dodgers - Giants", "Flyers - Lightning"]:
print("----", query, "----")
for answer in execute(query):
print(answer)
print()
Results:
---- Red Sox - Blue Jays ----
('What team won the game?', None)
('What was score?', None)
---- Phillies - Braves ----
('What team won the game?', None)
('What was score?', None)
---- Dodgers - Giants ----
('What team won the game?', None)
('What was score?', None)
---- Flyers - Lightning ----
('What team won the game?', None)
('What was score?', None)
Hi,
Hope you are all well !
I was wondering if we can use txtai like nboost as a proxy for elasticsearch or manticoresearch ?
i am really interested by an integration to manticoresearch as I wrote https://paper2code.com around this full-text search engine.
Thanks for your insights and inputs about this question.
Cheers,
X
Add testing framework and integrate Travis CI
Oh, my...
Also, code 403 has prevented you from downloading...
Next time, save it locally so you can download it again.
Please.
First of all, thanks a lot for your work, it's really great !
In order to user the search on lots of documents, it would be great if we could :
Keep up the good work !
It can have sections such as:
P.S. I am new to NeuML and found this an interesting initiative. Would love to contribute!
Hi, I have successfully installed the txtai in my linux server. When I run python and do from txtai.embeddings import Embeddings
it terminates the python process and gives Illegal instruction (core dumped) error. Following are the details of my linux server, can anybody help me figureout the problem and fix it. Thanks.
Linux-4.15.0-45-generic-x86_64-with-Ubuntu-18.04-bionic
Number of cores: 40
RAM: 126GB
Python 3.6.9 (default, Nov 7 2019, 10:44:02)
[GCC 8.3.0]
This is the original code from Introudcing txtai.py
`import numpy as np
sections = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]
print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)
for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
# Get index of best section that best matches query
uid = np.argmax(embeddings.similarity(query, sections))
print("%-20s %s" % (query, sections[uid]))`
A problem occurs from colab when executing the following line of code:
uid = np.argmax(embeddings.similarity(query, sections))
It shows : "ValueError: Wrong shape for input_ids (shape torch.Size([6])) or attention_mask (shape torch.Size([6]))"
The problem doesn't occur a few days ago.
Currently, faiss add_with_ids are used to store ids. This assumes that ids are 64 bit ints. With the addition of Annoy, which only supports sequential ids and hnswlib, an id map should be created in the Embeddings instance.
Hi there, This is a very beautiful work. I want to use this API for languages other than English. How can I implement other Languages model from https://huggingface.co/models?search=turkish or other sources.
Can you anyone help me on this one ?
Dear commiters,
I would like to use txtai for a search query purpose but currently my content is not in English, is there parameters that can be provided to improve the results based on language and locale ?
Thanks,
So, I'm very new to this but I have been able to put something together with txtai. Thanks for that! Very interesting stuff.
I built a new index and saved it, then worked up a simple flask app to load it and interface to it.
However, the initial embeddings line in there has to download the models from the net and this causes a timeout when trying to fire up the resulting container in Docker. Is there a way to pre-download these files and then point to those rather than having it try and load them? It seems to do this on my local machine, but I can not find where they are or how to reference them.
app.py
import os, json, requests
import urllib.request
from flask import Flask, abort, request, jsonify
from flask import Response
from flask_cors import CORS
from flask_restful import Resource, Api
from txtai.embeddings import Embeddings
# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
embeddings.load("index")
app = Flask(__name__)
cors = CORS(app, resources={r"/*": {"origins": "*"}})
app.config['CORS_HEADERS'] = 'Content-Type'
@app.route("/q1", methods=['GET'])
def serch():
q = request.args.get('q')
results = embeddings.search(q, 10)
data = {} # build json from the set..
for r in results:
uid = r[0]
score = r[1]
data[str(uid)] = score
# print('score:{} -- {}'.format(score, text_corpus[uid]))
print('score:{} -- {}'.format(score, uid))
j = json.dumps(data)
return j
if __name__ == '__main__':
app.run(debug=True, host='0.0.0.0', port=int(os.environ.get('PORT', 8080)))
Dockerfile
# Use the official Python image.
# https://hub.docker.com/_/python
FROM python:3.7
# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./
# Install production dependencies.
RUN pip install Flask gunicorn
RUN pip install flask_restful
RUN pip install flask-cors
RUN pip install numpy
RUN pip install txtai
# Run the web service on container startup. Here we use the gunicorn
# webserver, with one worker process and 8 threads.
# For environments with multiple CPU cores, increase the number of workers
# to be equal to the cores available.
#CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 app:app
CMD exec gunicorn --bind :8080 --workers 1 --threads 8 app:app
I am trying to load below pre-trained model from Sentence-Transformers in the Embedding function of txtai
xlm-r-100langs-bert-base-nli-stsb-mean-tokens:
But i am getting not found error.
Regards,
Currently, embeddings indices only support storing data in Faiss. Given that Faiss isn't supported on Windows, refactor to allow pluggable ANN backends.
Skip installing Faiss on Windows
File "c:\users\gaussfer\anaconda3\lib\distutils\command\build_ext.py", line 340, in run
self.build_extensions()
File "C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py", line 50, in build_extensions
self._remove_flag('-Wstrict-prototypes')
File "C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py", line 58, in _remove_flag
compiler = self.compiler.compiler
AttributeError: 'MSVCCompiler' object has no attribute 'compiler'
----------------------------------------
ERROR: Command errored out with exit status 1: 'c:\users\gaussfer\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py'"'"'; file='"'"'C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\Gaussfer\AppData\Local\Temp\pip-record-wnt_ovu1\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\gaussfer\anaconda3\Include\faiss-gpu' Check the logs for full command output.
Add a component to wrap Hugging Face's zero shot classifier pipeline.
hi,
For English, it is a right tokenizer such as
tokens = [token.strip(string.punctuation) for token in text.lower().split()]
and
return [token for token in tokens if re.match(r"^\d*[a-z][-.0-9:_a-z]{1,}$", token) and token not in Tokenizer.STOP_WORDS]
But for other language, for example Chinese, it makes wrong thing, so could you please revise this for different language, or give me some advice?
Thanks!
Faiss 1.6.4 supports Windows. This upgrade will help simplify the code base on all platforms.
I implemented a similar customizable indexing + retrieval pipeline. Huggingface's datasets (previously named NLP) libary allows one to vectorize index huge datasets without having to worry about RAM. They use Apache Arrow for memory mapped zero deserialization cost dataframes to do this. And It also supports easy integration with FAISS and elastic search.
Key advantages of making this the key part of the pipeline are as follows.
datasets
library already provides access to tonnes of datasets. Refer https://huggingface.co/datasets/viewer/. They allow adding new datasets, making it a good choice for distributing datasets which users of txtai would rely upon.Currently hardcoded to SQ8. Annoy/hnswlib only support float32, quantization will be ignored for those backends.
The txtai library performs less accurately when the given input matching texts are too long.
Currently, the API supports a subset of functionality in the embeddings module. Fully support embeddings and add methods qa extraction and labeling.
This will enable network-based implementations of txtai in other programming languages.
Given that GPU builds aren't being used and reported issues with macOS, switch to faiss-cpu package
Hi,
It is creating faiss-cpu>=1.6.3; os_name != "nt" error during the installation under the GPUed environment.
You may want to distinguish packages with different environments
Thank you.
Do a refresh and reorganization of the example notebooks.
Hi David.
I can't download articles.sqlite in Google Colab's examples.
The return code for www.kaggleusercontent.com is 403.
Is there a mistake in my operation?
Currently, the embeddings model is used for calculating similarity. The Labels model backed by Hugging Face's zero shot classifier has shown an impressive level of accuracy in labelling text.
Evaluate if this pipeline can be used to perform similarity comparisons. In this case, the input sections would be a list of documents and candidate labels would be the query.
Hi,
In the tutorial Part 3: Build an Embeddings index from a data source, at the part, where the word vectors are built I checked the txt file, that was generated. I realized that the vector representation of letters are there and not the words?! Is this on purpose? Correct me if I'm wrong, but I think the list of words should be there with the 300 dimension vectors.
Kind regards,
mrJezy
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.