Giter VIP home page Giter VIP logo

neuml / txtai Goto Github PK

View Code? Open in Web Editor NEW
8.2K 84.0 547.0 46.42 MB

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

Home Page: https://neuml.github.io/txtai

License: Apache License 2.0

Python 99.42% Makefile 0.23% Dockerfile 0.35%
python search machine-learning nlp semantic-search neural-search vector-search txtai llm vector-database

txtai's Issues

tokenizer.py

hi,

For English, it is a right tokenizer such as
tokens = [token.strip(string.punctuation) for token in text.lower().split()]
and
return [token for token in tokens if re.match(r"^\d*[a-z][-.0-9:_a-z]{1,}$", token) and token not in Tokenizer.STOP_WORDS]
But for other language, for example Chinese, it makes wrong thing, so could you please revise this for different language, or give me some advice?

Thanks!

Can we have a CONTRIBUTING.md for a quick guide?

It can have sections such as:

  • To contribute a feature/fix
  • How can you help
  • Getting Started
  • Formatting and Linting rules
  • Connect on Slack etc. to get help or for issues

P.S. I am new to NeuML and found this an interesting initiative. Would love to contribute!

Migrate from Travis CI to GitHub Actions

travis-ci.org builds are frequently backlogged more than an hour, which doesn't work for continuous development. Migrate to GitHub actions.

Once successful, revoke all third-party app access.

Integrate FastAPI for model serving

Add pattern for serving models. API should be driven by a configuration yaml file listing the model name and path.

API endpoints:

  • /$model/search?q=value
    • Runs a search against model for query q
  • /$model/similarity?t1=text&t2=text
    • Compares t1 and t2 for similarity using model
  • /$model/embedding?t=text
    • Builds a sentence embeddings vector for text stored in t

Make API definitions consistent

With the additional functionality added to txtai over the last few releases, the API definitions have gotten somewhat inconsistent. This issue will address that and make many of the return types across modules consistent. The changes are breaking in many cases and will require a bump of the major version of txtai to v2.

The current Python API definitions for v1 are:

Current Python API v1

  • embeddings.search("query text")
    return [(id, score)] sort score desc

  • embeddings.similarity("query text", documents)
    return [score]

  • embeddings.add(documents)
    embeddings.index()

  • embeddings.transform("text")
    return [float]

  • extractor(sections, queue)
    return [(name, answer)]

  • labels("text", ["label1"])
    return [(label, score)] sort score desc

The new method templates and return types are below.

New Python API v2

  • embeddings.search("query text")
    return [(id, score)] sort score desc

  • embeddings.batchsearch(["query text1", "query text2])
    return [[(id, score)] sort score desc]

  • embeddings.add(documents)
    embeddings.index()

  • embeddings.similarity("query text", texts)
    return [(id, score)] sort score desc

  • embeddings.batchsimilarity(["query text1", "query text2], texts)
    return [[(id, score)] sort score desc]

  • embeddings.transform("text")
    return [float]

  • embeddings.batchtransform(["text1", "text2"])
    return [[float]]

  • extractor(queue, texts)
    return [(name, answer)]

  • labels("text", ["label1"])
    return [(id, score)] sort score desc

  • labels(["text1", "text2"], ["label1"])
    return [[(id, score)] sort score desc]

  • similarity("query text", texts)
    return [(id, score)] sort score desc

  • batchsimilarity(["query text1", "query text2], texts)
    return [[(id, score)] sort score desc]

External v2 API Calls

The API methods also need to have corresponding changes.

Given that json doesn't support tuples and some languages can't easily map arrays/tuples to objects, the return types are mapped from tuples to json objects. For example instead of (id, score) the API will return {"id": value, "score": value}.

The API also has the following differences with the native Python API.

  • extract uses the Extractor pipeline which is a callable object in Python.
  • label/batchlabel uses the Labels pipeline which is a callable object in Python that supports both string and list input.
  • similarity/batchsimilarity uses the Similarity pipeline which is a callable object in Python that supports both string and list input.

The following list shows how the API methods will look through language binding libraries.

  • embeddings.search("query text")
    embeddings.batchsearch(["query text1", "query text2])

  • embeddings.add(documents)
    embeddings.index()

  • embeddings.similarity("query text", texts)
    embeddings.batchsimilarity(["query text1", "query text2], texts)

  • embeddings.transform("text")
    embeddings.batchTransform(["text1", "text2"])

  • extractor.extract(questions, texts)

  • labels.label("text", ["label1"])
    labels.batchlabel(["text1", "text2"], ["label1"])

  • similarity.similarity("query text", texts)
    similarity.batchsimilarity(["query text1", "query text2], texts)

Update transformers requirement to latest

Currently, transformers is fixed to 3.0.2 due to an issue with sentence-transformers.

Once sentence-transformers v0.3.6 is released, which will support 3.1.x, update setup.py accordingly.

Enhance API to fully support all txtai functionality

Currently, the API supports a subset of functionality in the embeddings module. Fully support embeddings and add methods qa extraction and labeling.

This will enable network-based implementations of txtai in other programming languages.

Refactor pipeline component

Currently, the pipeline component has logic to workaround a performance issue in Transformers < 4.0. This performance issue has been resolved. Refactor this component to directly use the pipeline component.

Also consolidate labels methods into the pipeline module.

Add batch indexing for transformer indices

Currently, sentence-transformer based indices are indexing documents one at a time. Calls to sentence-transformers should be batched together to decrease indexing time.

Remove build script workaround

The combination of pip 20.3.x, transformers 4.x and sentence-transformers 0.3.9 has caused build errors not related to txtai.

Remove the following lines from build.yml once they are resolved upstream.

# Remove hardcoding to 20.2.4
pip install -U pip==20.2.4 wheel coverage coverall

Word Embedding Question

I am trying to run the Example 1 with word embeddings using the following code:
embeddings = Embeddings({"path": "word-vectors/GoogleNews-vectors-negative300.magnitude",
"storevectors": True,
"scoring": "bm25",
"pca": 3,
"quantize": True})

sections = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
# Get index of best section that best matches query
candidates = embeddings.similarity(query, sections)
uid = np.argmax(candidates)

print("%-20s %s" % (query, sections[uid]))

But I am getting the following error:

Traceback (most recent call last):
File "2.py", line 24, in
candidates = embeddings.similarity(query, sections)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/embeddings.py", line 228, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/embeddings.py", line 179, in transform
embedding = self.model.transform(document)
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/vectors.py", line 155, in transform
weights = self.scoring.weights(document) if self.scoring else None
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/scoring.py", line 133, in weights
weights.append(self.score(freq, idf, length))
File "/home/vyasa/.virtualenvs/txtai/lib/python3.6/site-packages/txtai/scoring.py", line 217, in score
k = self.k1 * ((1 - self.b) + self.b * length / self.avgdl)
ZeroDivisionError: float division by zero

Do I need to do something with the word embeddings before I can use it for similarity search ?

Q&A Extractor Sample Code Not Functioning As Expected

I have run the following sample code for the extractor to perform Q&A on OS X but the results return None:

embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
extractor = Extractor(embeddings, "distilbert-base-cased-distilled-squad")
sections = ["Giants hit 3 HRs to down Dodgers",
            "Giants 5 Dodgers 4 final",
            "Dodgers drop Game 2 against the Giants, 5-4",
            "Blue Jays 2 Red Sox 1 final",
            "Red Sox lost to the Blue Jays, 2-1",
            "Blue Jays at Red Sox is over. Score: 2-1",
            "Phillies win over the Braves, 5-0",
            "Phillies 5 Braves 0 final",
            "Final: Braves lose to the Phillies in the series opener, 5-0",
            "Final score: Flyers 4 Lightning 1",
            "Flyers 4 Lightning 1 final",
            "Flyers win 4-1"]
sections = [(uid, section) for uid, section in enumerate(sections)]
questions = ["What team won the game?", "What was score?"]
execute = lambda query: extractor(sections, [(question, query, question, False) for question in questions])
for query in ["Red Sox - Blue Jays", "Phillies - Braves", "Dodgers - Giants", "Flyers - Lightning"]:
    print("----", query, "----")
    for answer in execute(query):
        print(answer)
    print()

Results:

---- Red Sox - Blue Jays ----
('What team won the game?', None)
('What was score?', None)

---- Phillies - Braves ----
('What team won the game?', None)
('What was score?', None)

---- Dodgers - Giants ----
('What team won the game?', None)
('What was score?', None)

---- Flyers - Lightning ----
('What team won the game?', None)
('What was score?', None)

Add batch search

First of all, thanks a lot for your work, it's really great !
In order to user the search on lots of documents, it would be great if we could :

  • [ search batches of elements] : embeddings.search(queries, top_k) which would return a list of top_k results for each query in queries
  • [use fastest lib to do so ] (I use hnsw implem of nmslib which provides a batch search and great implementation of hnsw).

Keep up the good work !

Using huggingface's datasets library as key part of the pipeline

I implemented a similar customizable indexing + retrieval pipeline. Huggingface's datasets (previously named NLP) libary allows one to vectorize index huge datasets without having to worry about RAM. They use Apache Arrow for memory mapped zero deserialization cost dataframes to do this. And It also supports easy integration with FAISS and elastic search.

Key advantages of making this the key part of the pipeline are as follows.

  1. An interface to a memory mapped dataframe which is fast. This allows running a neural model on the data and saving it and caching it very easy.
  2. datasets library already provides access to tonnes of datasets. Refer https://huggingface.co/datasets/viewer/. They allow adding new datasets, making it a good choice for distributing datasets which users of txtai would rely upon.

Add zero-shot based similarity pipeline

Currently, the embeddings model is used for calculating similarity. The Labels model backed by Hugging Face's zero shot classifier has shown an impressive level of accuracy in labelling text.

Evaluate if this pipeline can be used to perform similarity comparisons. In this case, the input sections would be a list of documents and candidate labels would be the query.

Add API tests

The functionality provided via the txtai API has increased significantly. Improve test coverage in that area.

Upgrade to Faiss 1.6.4

Faiss 1.6.4 supports Windows. This upgrade will help simplify the code base on all platforms.

Upgrade to Faiss 1.6.4

Faiss 1.6.4 supports Windows. This upgrade will help simplify the code base on all platforms.

Support string ids

Currently, faiss add_with_ids are used to store ids. This assumes that ids are 64 bit ints. With the addition of Annoy, which only supports sequential ids and hnswlib, an id map should be created in the Embeddings instance.

txtai gives Illegal instruction (core dumped)

Hi, I have successfully installed the txtai in my linux server. When I run python and do from txtai.embeddings import Embeddings it terminates the python process and gives Illegal instruction (core dumped) error. Following are the details of my linux server, can anybody help me figureout the problem and fix it. Thanks.

Linux-4.15.0-45-generic-x86_64-with-Ubuntu-18.04-bionic
Number of cores: 40
RAM: 126GB
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0]

ValueError: Wrong shape for input_ids (shape torch.Size([6])) or attention_mask (shape torch.Size([6]))

This is the original code from Introudcing txtai.py

`import numpy as np

sections = ["US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"]

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
# Get index of best section that best matches query
uid = np.argmax(embeddings.similarity(query, sections))

print("%-20s %s" % (query, sections[uid]))`

A problem occurs from colab when executing the following line of code:
uid = np.argmax(embeddings.similarity(query, sections))

It shows : "ValueError: Wrong shape for input_ids (shape torch.Size([6])) or attention_mask (shape torch.Size([6]))"

The problem doesn't occur a few days ago.

Word Embedding question 2

Hi,

In the tutorial Part 3: Build an Embeddings index from a data source, at the part, where the word vectors are built I checked the txt file, that was generated. I realized that the vector representation of letters are there and not the words?! Is this on purpose? Correct me if I'm wrong, but I think the list of words should be there with the 300 dimension vectors.

Kind regards,
mrJezy

GPT2 and T5 model

Hi,
I'm trying to use either the gpt2 or t5-3b models with txtai (as it is mentioned in one of the notebooks that any model listed on the Hugging Face would work), but I receive several errors:

ERROR:transformers.tokenization_utils_base:Using pad_token, but it is not set yet.
Traceback (most recent call last):
File "./text-ai.py", line 24, in
similarities = embeddings.similarity(s, queries)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 227, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 178, in transform
embedding = self.model.transform(document)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/vectors.py", line 257, in transform
return self.model.encode([" ".join(document[1])], show_progress_bar=False)[0]
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 176, in encode
sentence_features = self.get_sentence_features(text, longest_seq)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 219, in get_sentence_features
return self._first_module().get_sentence_features(*features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/models/Transformer.py", line 61, in get_sentence_features
return self.tokenizer.prepare_for_model(tokens, max_length=pad_seq_length, pad_to_max_length=True, return_tensors='pt', truncation=True)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2021, in prepare_for_model
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1529, in _get_padding_truncation_strategies
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).

or for T5:

File "./text-ai.py", line 24, in
similarities = embeddings.similarity(s, queries)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 227, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/embeddings.py", line 178, in transform
embedding = self.model.transform(document)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/txtai/vectors.py", line 257, in transform
return self.model.encode([" ".join(document[1])], show_progress_bar=False)[0]
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 187, in encode
out_features = self.forward(features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sentence_transformers/models/Transformer.py", line 25, in forward
output_states = self.auto_model(**features)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/modeling_t5.py", line 965, in forward
decoder_outputs = self.decoder(
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/transformers/modeling_t5.py", line 684, in forward
raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")

What am I missing?

Thanks!

Language and Locale

Dear commiters,

I would like to use txtai for a search query purpose but currently my content is not in English, is there parameters that can be provided to improve the results based on language and locale ?

Thanks,

Split extractor embedding query and QA calls

Currently, extractor.py has a single method to run an embeddings query and then run extractive QA over those results.

This should be split into two separate methods, which allows external callers to just run the embeddings search without executing the QA extraction. This will allow downstream systems more flexibility in working with the extractor process.

Unable to install txtai, below is the error. I have installed c++ build tools

File "c:\users\gaussfer\anaconda3\lib\distutils\command\build_ext.py", line 340, in run
    self.build_extensions()
  File "C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py", line 50, in build_extensions
    self._remove_flag('-Wstrict-prototypes')
  File "C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py", line 58, in _remove_flag
    compiler = self.compiler.compiler
AttributeError: 'MSVCCompiler' object has no attribute 'compiler'
----------------------------------------

ERROR: Command errored out with exit status 1: 'c:\users\gaussfer\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py'"'"'; file='"'"'C:\Users\Gaussfer\AppData\Local\Temp\pip-install-1drqoyp0\faiss-gpu\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\Gaussfer\AppData\Local\Temp\pip-record-wnt_ovu1\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\gaussfer\anaconda3\Include\faiss-gpu' Check the logs for full command output.

Docker run timeout when downloading embeddings files

So, I'm very new to this but I have been able to put something together with txtai. Thanks for that! Very interesting stuff.

I built a new index and saved it, then worked up a simple flask app to load it and interface to it.

However, the initial embeddings line in there has to download the models from the net and this causes a timeout when trying to fire up the resulting container in Docker. Is there a way to pre-download these files and then point to those rather than having it try and load them? It seems to do this on my local machine, but I can not find where they are or how to reference them.

app.py

import os, json, requests
import urllib.request
from flask import Flask, abort, request, jsonify
from flask import Response
from flask_cors import CORS
from flask_restful import Resource, Api
from txtai.embeddings import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
embeddings.load("index")

app = Flask(__name__)

cors = CORS(app, resources={r"/*": {"origins": "*"}})
app.config['CORS_HEADERS'] = 'Content-Type'

@app.route("/q1",  methods=['GET'])
def serch():
    q = request.args.get('q')
    results = embeddings.search(q, 10)
    data = {}  # build json from the set..
    for r in results:
        uid = r[0]
        score = r[1]
        data[str(uid)] = score
        #   print('score:{} -- {}'.format(score, text_corpus[uid]))
        print('score:{} -- {}'.format(score, uid))
    j = json.dumps(data)
    return  j

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=int(os.environ.get('PORT', 8080)))

Dockerfile

# Use the official Python image.
# https://hub.docker.com/_/python
FROM python:3.7

# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./

# Install production dependencies.
RUN pip install Flask gunicorn
RUN pip install flask_restful
RUN pip install flask-cors
RUN pip install numpy
RUN pip install txtai

# Run the web service on container startup. Here we use the gunicorn
# webserver, with one worker process and 8 threads.
# For environments with multiple CPU cores, increase the number of workers
# to be equal to the cores available.
#CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 app:app
CMD exec gunicorn --bind :8080 --workers 1 --threads 8 app:app

All methods should operate on batches

For search, similarity, extractive qa and labels, all methods should operate on batches for the best performance.

  • Extractive QA already supports this.
  • Search, similarity and labels should work with batches. Separate methods (if necessary) can be retained to provide existing functionality for a single record.

Fix build warnings with hnswlib

hnswlib requires numpy and is failing on the build of the wheel. The build process is falling back to the legacy build which will be removed in pip 21.x.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.