neuml / txtai Goto Github PK

View Code? Open in Web Editor NEW

8.2K 84.0 547.0 46.39 MB

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

Home Page: https://neuml.github.io/txtai

License: Apache License 2.0

Python 99.39% Makefile 0.23% Dockerfile 0.38%

python search machine-learning nlp semantic-search neural-search vector-search txtai llm vector-database

txtai's People

Contributors

Stargazers

Watchers

Forkers

abhishekkshk68 quantap seanandre rishsriv anicyber-team balatatree sree181 codeaudit a1ip gvravi satheeshcdo codinronan hatemhosny dineshjs shrinivas-io truocphamkhac-agilityio 0xflotus rosssong tahasha stjordanis tspannhw malywonsz markmotrin polymath-is gordon-parrott keshabb wnor543 madhbhavikar dennisfrei kunlqt priestd09 ssundaranathan fanchouille davidrivasphd zenithez yyht ruanjiyang liyingkun1237 gforky altovate spreck asysc2020 arnavn101 harikrishnama-kore xiaming9880 imanojkumar csheargm yenmuse rdgozum gm0616 luojie-roger awesome-archive shashank1010 dondreojordan bigdatasciencegroup freshy969 knowledgehacker devinxzhou rtvt123 waldenn rickeyestes jeffersonzaki doinker sourestdeeds xiaojinwhu xpatronum sohamsshah sycomix pickkaa abandaru celestialized odnodn lycodeboy huangweiboy2 joshlovecoder sanen jufangshen kavithacd orctom maciejmacko dsp6414 tejastank bobycv06fpm ohsdba talestsp trendingtechnology aquibjaved davidalphafox thehumanecoder zhongbin1 ravigv gaohuan2015 ethixkr binglinchengxiash shinthor sambbb dumbalinyolo deanmarc25 juandisay dwtcourses

txtai's Issues

GPT2 and T5 model

Hi,
I'm trying to use either the gpt2 or t5-3b models with txtai (as it is mentioned in one of the notebooks that any model listed on the Hugging Face would work), but I receive several errors:

ERROR:transformers.tokenization_utils_base:Using pad_token, but it is not set yet.
Traceback (most recent call last):
File "./text-ai.py", line 24, in
similarities = embeddings.similarity(s, queries)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 227, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 178, in transform
embedding = self.model.transform(document)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/vectors.py", line 257, in transform
return self.model.encode([" ".join(document[1])], show_progress_bar=False)[0]
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 176, in encode
sentence_features = self.get_sentence_features(text, longest_seq)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 219, in get_sentence_features
return self._first_module().get_sentence_features(*features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/models/Transformer.py", line 61, in get_sentence_features
return self.tokenizer.prepare_for_model(tokens, max_length=pad_seq_length, pad_to_max_length=True, return_tensors='pt', truncation=True)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2021, in prepare_for_model
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1529, in _get_padding_truncation_strategies
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).

or for T5:

File "./text-ai.py", line 24, in
similarities = embeddings.similarity(s, queries)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 227, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 178, in transform
embedding = self.model.transform(document)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/vectors.py", line 257, in transform
return self.model.encode([" ".join(document[1])], show_progress_bar=False)[0]
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 187, in encode
out_features = self.forward(features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/models/Transformer.py", line 25, in forward
output_states = self.auto_model(**features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/modeling_t5.py", line 965, in forward
decoder_outputs = self.decoder(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/modeling_t5.py", line 684, in forward
raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")

What am I missing?

Thanks!

Add model distillation methods

Review the methods here: https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/distillation

See if this can be integrated to reduce the storage necessary for embeddings indices.

Make API definitions consistent

With the additional functionality added to txtai over the last few releases, the API definitions have gotten somewhat inconsistent. This issue will address that and make many of the return types across modules consistent. The changes are breaking in many cases and will require a bump of the major version of txtai to v2.

The current Python API definitions for v1 are:

Current Python API v1

embeddings.search("query text")
return [(id, score)] sort score desc
embeddings.similarity("query text", documents)
return [score]
embeddings.add(documents)
embeddings.index()
embeddings.transform("text")
return [float]
extractor(sections, queue)
return [(name, answer)]
labels("text", ["label1"])
return [(label, score)] sort score desc

The new method templates and return types are below.

New Python API v2

embeddings.search("query text")
return [(id, score)] sort score desc
embeddings.batchsearch(["query text1", "query text2])
return [[(id, score)] sort score desc]
embeddings.add(documents)
embeddings.index()
embeddings.similarity("query text", texts)
return [(id, score)] sort score desc
embeddings.batchsimilarity(["query text1", "query text2], texts)
return [[(id, score)] sort score desc]
embeddings.transform("text")
return [float]
embeddings.batchtransform(["text1", "text2"])
return [[float]]
extractor(queue, texts)
return [(name, answer)]
labels("text", ["label1"])
return [(id, score)] sort score desc
labels(["text1", "text2"], ["label1"])
return [[(id, score)] sort score desc]
similarity("query text", texts)
return [(id, score)] sort score desc
batchsimilarity(["query text1", "query text2], texts)
return [[(id, score)] sort score desc]

External v2 API Calls

The API methods also need to have corresponding changes.

Given that json doesn't support tuples and some languages can't easily map arrays/tuples to objects, the return types are mapped from tuples to json objects. For example instead of (id, score) the API will return {"id": value, "score": value}.

The API also has the following differences with the native Python API.

extract uses the Extractor pipeline which is a callable object in Python.
label/batchlabel uses the Labels pipeline which is a callable object in Python that supports both string and list input.
similarity/batchsimilarity uses the Similarity pipeline which is a callable object in Python that supports both string and list input.

The following list shows how the API methods will look through language binding libraries.

embeddings.search("query text")
embeddings.batchsearch(["query text1", "query text2])
embeddings.add(documents)
embeddings.index()
embeddings.similarity("query text", texts)
embeddings.batchsimilarity(["query text1", "query text2], texts)
embeddings.transform("text")
embeddings.batchTransform(["text1", "text2"])
extractor.extract(questions, texts)
labels.label("text", ["label1"])
labels.batchlabel(["text1", "text2"], ["label1"])
similarity.similarity("query text", texts)
similarity.batchsimilarity(["query text1", "query text2], texts)

Refresh example notebooks and add notebook on labeling

Re-run and save the example notebooks to ensure they all work with the latest libraries.

Also add a notebook on labeling.

Refactor pipeline component

Currently, the pipeline component has logic to workaround a performance issue in Transformers < 4.0. This performance issue has been resolved. Refactor this component to directly use the pipeline component.

Also consolidate labels methods into the pipeline module.

Upgrade to Faiss 1.6.4

Faiss 1.6.4 supports Windows. This upgrade will help simplify the code base on all platforms.

Migrate from Travis CI to GitHub Actions

travis-ci.org builds are frequently backlogged more than an hour, which doesn't work for continuous development. Migrate to GitHub actions.

Once successful, revoke all third-party app access.

All methods should operate on batches

For search, similarity, extractive qa and labels, all methods should operate on batches for the best performance.

Extractive QA already supports this.
Search, similarity and labels should work with batches. Separate methods (if necessary) can be retained to provide existing functionality for a single record.

can we install it on Mac?

it failed on

Add batch indexing for transformer indices

Currently, sentence-transformer based indices are indexing documents one at a time. Calls to sentence-transformers should be batched together to decrease indexing time.

Add option to store word vectors with embeddings model

For word embedding models, add an option to include the word vectors model in the models path.

Add API tests

The functionality provided via the txtai API has increased significantly. Improve test coverage in that area.

Fix build warnings with hnswlib

hnswlib requires numpy and is failing on the build of the wheel. The build process is falling back to the legacy build which will be removed in pip 21.x.

Remove build script workaround

The combination of pip 20.3.x, transformers 4.x and sentence-transformers 0.3.9 has caused build errors not related to txtai.

Remove the following lines from build.yml once they are resolved upstream.

# Remove hardcoding to 20.2.4
pip install -U pip==20.2.4 wheel coverage coverall

Integrate FastAPI for model serving

Add pattern for serving models. API should be driven by a configuration yaml file listing the model name and path.

API endpoints:

/$model/search?q=value
- Runs a search against model for query q
/$model/similarity?t1=text&t2=text
- Compares t1 and t2 for similarity using model
/$model/embedding?t=text
- Builds a sentence embeddings vector for text stored in t

Split extractor embedding query and QA calls

Currently, extractor.py has a single method to run an embeddings query and then run extractive QA over those results.

This should be split into two separate methods, which allows external callers to just run the embeddings search without executing the QA extraction. This will allow downstream systems more flexibility in working with the extractor process.

Update transformers requirement to latest

Currently, transformers is fixed to 3.0.2 due to an issue with sentence-transformers.

Once sentence-transformers v0.3.6 is released, which will support 3.1.x, update setup.py accordingly.

Add support for hnswlib backend

https://github.com/nmslib/hnswlib

Upgrade to Transformers 4.x

Transformers 4.x has numerous performance improvements. Upgrade dependency to require at least 4.0.0.

Word Embedding Question

I am trying to run the Example 1 with word embeddings using the following code:
embeddings = Embeddings({"path": "word-vectors/GoogleNews-vectors-negative300.magnitude",
"storevectors": True,
"scoring": "bm25",
"pca": 3,
"quantize": True})

sections = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
# Get index of best section that best matches query
candidates = embeddings.similarity(query, sections)
uid = np.argmax(candidates)

print("%-20s %s" % (query, sections[uid]))

But I am getting the following error:

Traceback (most recent call last):
File "2.py", line 24, in
candidates = embeddings.similarity(query, sections)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/embeddings.py", line 228, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/embeddings.py", line 179, in transform
embedding = self.model.transform(document)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/vectors.py", line 155, in transform
weights = self.scoring.weights(document) if self.scoring else None
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/scoring.py", line 133, in weights
weights.append(self.score(freq, idf, length))
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/scoring.py", line 217, in score
k = self.k1 * ((1 - self.b) + self.b * length / self.avgdl)
ZeroDivisionError: float division by zero

Do I need to do something with the word embeddings before I can use it for similarity search ?

Q&A Extractor Sample Code Not Functioning As Expected

I have run the following sample code for the extractor to perform Q&A on OS X but the results return None:

embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
extractor = Extractor(embeddings, "distilbert-base-cased-distilled-squad")
sections = ["Giants hit 3 HRs to down Dodgers",
            "Giants 5 Dodgers 4 final",
            "Dodgers drop Game 2 against the Giants, 5-4",
            "Blue Jays 2 Red Sox 1 final",
            "Red Sox lost to the Blue Jays, 2-1",
            "Blue Jays at Red Sox is over. Score: 2-1",
            "Phillies win over the Braves, 5-0",
            "Phillies 5 Braves 0 final",
            "Final: Braves lose to the Phillies in the series opener, 5-0",
            "Final score: Flyers 4 Lightning 1",
            "Flyers 4 Lightning 1 final",
            "Flyers win 4-1"]
sections = [(uid, section) for uid, section in enumerate(sections)]
questions = ["What team won the game?", "What was score?"]
execute = lambda query: extractor(sections, [(question, query, question, False) for question in questions])
for query in ["Red Sox - Blue Jays", "Phillies - Braves", "Dodgers - Giants", "Flyers - Lightning"]:
    print("----", query, "----")
    for answer in execute(query):
        print(answer)
    print()

Results:

---- Red Sox - Blue Jays ----
('What team won the game?', None)
('What was score?', None)

---- Phillies - Braves ----
('What team won the game?', None)
('What was score?', None)

---- Dodgers - Giants ----
('What team won the game?', None)
('What was score?', None)

---- Flyers - Lightning ----
('What team won the game?', None)
('What was score?', None)

[Feature] txtai as a proxy like nboost

Hi,

Hope you are all well !

I was wondering if we can use txtai like nboost as a proxy for elasticsearch or manticoresearch ?

i am really interested by an integration to manticoresearch as I wrote https://paper2code.com around this full-text search engine.

Thanks for your insights and inputs about this question.

Cheers,
X

Add unit tests and integrate Travis CI

Add testing framework and integrate Travis CI

（2nd） can't download articles.sqlite

Oh, my...
Also, code 403 has prevented you from downloading...

Next time, save it locally so you can download it again.
Please.

Add batch search

First of all, thanks a lot for your work, it's really great !
In order to user the search on lots of documents, it would be great if we could :

[ search batches of elements] : embeddings.search(queries, top_k) which would return a list of top_k results for each query in queries
[use fastest lib to do so ] (I use hnsw implem of nmslib which provides a batch search and great implementation of hnsw).

Keep up the good work !

Add documentation for Embeddings settings to README

Can we have a CONTRIBUTING.md for a quick guide?

It can have sections such as:

To contribute a feature/fix
How can you help
Getting Started
Formatting and Linting rules
Connect on Slack etc. to get help or for issues

P.S. I am new to NeuML and found this an interesting initiative. Would love to contribute!

txtai gives Illegal instruction (core dumped)

Hi, I have successfully installed the txtai in my linux server. When I run python and do from txtai.embeddings import Embeddings it terminates the python process and gives Illegal instruction (core dumped) error. Following are the details of my linux server, can anybody help me figureout the problem and fix it. Thanks.

Linux-4.15.0-45-generic-x86_64-with-Ubuntu-18.04-bionic
Number of cores: 40
RAM: 126GB
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0]

ValueError: Wrong shape for input_ids (shape torch.Size([6])) or attention_mask (shape torch.Size([6]))

This is the original code from Introudcing txtai.py

`import numpy as np

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
# Get index of best section that best matches query
uid = np.argmax(embeddings.similarity(query, sections))

print("%-20s %s" % (query, sections[uid]))`

A problem occurs from colab when executing the following line of code:
uid = np.argmax(embeddings.similarity(query, sections))

It shows : "ValueError: Wrong shape for input_ids (shape torch.Size([6])) or attention_mask (shape torch.Size([6]))"

The problem doesn't occur a few days ago.

Support string ids

Currently, faiss add_with_ids are used to store ids. This assumes that ids are 64 bit ints. With the addition of Annoy, which only supports sequential ids and hnswlib, an id map should be created in the Embeddings instance.

Different Language Support

Hi there, This is a very beautiful work. I want to use this API for languages other than English. How can I implement other Languages model from https://huggingface.co/models?search=turkish or other sources.
Can you anyone help me on this one ?

Language and Locale

Dear commiters,

I would like to use txtai for a search query purpose but currently my content is not in English, is there parameters that can be provided to improve the results based on language and locale ?

Thanks,

Docker run timeout when downloading embeddings files

So, I'm very new to this but I have been able to put something together with txtai. Thanks for that! Very interesting stuff.

I built a new index and saved it, then worked up a simple flask app to load it and interface to it.

However, the initial embeddings line in there has to download the models from the net and this causes a timeout when trying to fire up the resulting container in Docker. Is there a way to pre-download these files and then point to those rather than having it try and load them? It seems to do this on my local machine, but I can not find where they are or how to reference them.

app.py

import os, json, requests
import urllib.request
from flask import Flask, abort, request, jsonify
from flask import Response
from flask_cors import CORS
from flask_restful import Resource, Api
from txtai.embeddings import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
embeddings.load("index")

app = Flask(__name__)

cors = CORS(app, resources={r"/*": {"origins": "*"}})
app.config['CORS_HEADERS'] = 'Content-Type'

@app.route("/q1",  methods=['GET'])
def serch():
    q = request.args.get('q')
    results = embeddings.search(q, 10)
    data = {}  # build json from the set..
    for r in results:
        uid = r[0]
        score = r[1]
        data[str(uid)] = score
        #   print('score:{} -- {}'.format(score, text_corpus[uid]))
        print('score:{} -- {}'.format(score, uid))
    j = json.dumps(data)
    return  j

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=int(os.environ.get('PORT', 8080)))

Dockerfile

# Use the official Python image.
# https://hub.docker.com/_/python
FROM python:3.7

# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./

# Install production dependencies.
RUN pip install Flask gunicorn
RUN pip install flask_restful
RUN pip install flask-cors
RUN pip install numpy
RUN pip install txtai

# Run the web service on container startup. Here we use the gunicorn
# webserver, with one worker process and 8 threads.
# For environments with multiple CPU cores, increase the number of workers
# to be equal to the cores available.
#CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 app:app
CMD exec gunicorn --bind :8080 --workers 1 --threads 8 app:app

Error in loading one of the pretrained model from Sentence-Transformers

I am trying to load below pre-trained model from Sentence-Transformers in the Embedding function of txtai

xlm-r-100langs-bert-base-nli-stsb-mean-tokens:

But i am getting not found error.

Regards,

Add support for Annoy backend

Currently, embeddings indices only support storing data in Faiss. Given that Faiss isn't supported on Windows, refactor to allow pluggable ANN backends.

https://github.com/spotify/annoy

Modify setup.py to conditionally install Faiss

Skip installing Faiss on Windows

Unable to install txtai, below is the error. I have installed c++ build tools

File "c:\users\gaussfer\anaconda3\lib\distutils\command\build_ext.py", line 340, in run
    self.build_extensions()
  File "C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py", line 50, in build_extensions
    self._remove_flag('-Wstrict-prototypes')
  File "C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py", line 58, in _remove_flag
    compiler = self.compiler.compiler
AttributeError: 'MSVCCompiler' object has no attribute 'compiler'
----------------------------------------

ERROR: Command errored out with exit status 1: 'c:\users\gaussfer\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py'"'"'; file='"'"'C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\Gaussfer\AppData\Local\Temp\pip-record-wnt_ovu1\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\gaussfer\anaconda3\Include\faiss-gpu' Check the logs for full command output.

Add component for zero shot classification

Add a component to wrap Hugging Face's zero shot classifier pipeline.

tokenizer.py

hi，

For English, it is a right tokenizer such as
tokens = [token.strip(string.punctuation) for token in text.lower().split()]
and
return [token for token in tokens if re.match(r"^\d*[a-z][-.0-9:_a-z]{1,}$", token) and token not in Tokenizer.STOP_WORDS]
But for other language, for example Chinese, it makes wrong thing, so could you please revise this for different language, or give me some advice?

Thanks!

Upgrade to Faiss 1.6.4

Faiss 1.6.4 supports Windows. This upgrade will help simplify the code base on all platforms.

Using huggingface's datasets library as key part of the pipeline

I implemented a similar customizable indexing + retrieval pipeline. Huggingface's datasets (previously named NLP) libary allows one to vectorize index huge datasets without having to worry about RAM. They use Apache Arrow for memory mapped zero deserialization cost dataframes to do this. And It also supports easy integration with FAISS and elastic search.

Key advantages of making this the key part of the pipeline are as follows.

An interface to a memory mapped dataframe which is fast. This allows running a neural model on the data and saving it and caching it very easy.
datasets library already provides access to tonnes of datasets. Refer https://huggingface.co/datasets/viewer/. They allow adding new datasets, making it a good choice for distributing datasets which users of txtai would rely upon.

Enable flag to enable/disable Faiss SQ8 quantization

Currently hardcoded to SQ8. Annoy/hnswlib only support float32, quantization will be ignored for those backends.

Not accurate with long sentences

The txtai library performs less accurately when the given input matching texts are too long.

Enhance API to fully support all txtai functionality

Currently, the API supports a subset of functionality in the embeddings module. Fully support embeddings and add methods qa extraction and labeling.

This will enable network-based implementations of txtai in other programming languages.

Switch from faiss-gpu to faiss-cpu

Given that GPU builds aren't being used and reported issues with macOS, switch to faiss-cpu package

can you add gpu / cpu indicator to the package installation?

Hi,
It is creating faiss-cpu>=1.6.3; os_name != "nt" error during the installation under the GPUed environment.
You may want to distinguish packages with different environments
Thank you.

Review, organize and update example notebooks

Do a refresh and reorganization of the example notebooks.

can't download articles.sqlite

Hi David.
I can't download articles.sqlite in Google Colab's examples.
The return code for www.kaggleusercontent.com is 403.
Is there a mistake in my operation?

Add zero-shot based similarity pipeline

Currently, the embeddings model is used for calculating similarity. The Labels model backed by Hugging Face's zero shot classifier has shown an impressive level of accuracy in labelling text.

Evaluate if this pipeline can be used to perform similarity comparisons. In this case, the input sections would be a list of documents and candidate labels would be the query.

Word Embedding question 2

Hi,

In the tutorial Part 3: Build an Embeddings index from a data source, at the part, where the word vectors are built I checked the txt file, that was generated. I realized that the vector representation of letters are there and not the words?! Is this on purpose? Correct me if I'm wrong, but I think the list of words should be there with the 300 dimension vectors.

Kind regards,
mrJezy