kagisearch / vectordb Goto Github PK

View Code? Open in Web Editor NEW

618.0 618.0 30.0 1.09 MB

A minimal Python package for storing and retrieving text using chunking, embeddings, and vector search.

Home Page: https://vectordb.com

License: MIT License

Python 100.00%

ai artificial-intelligence llm llms machine-learning

vectordb's People

Contributors

Stargazers

Watchers

vectordb's Issues

Running on m1 Mac

I ran into issues when trying to install vectordb on an m1 Mac. Here is my solution in case future humans run into something similar.

# deps required version 3.9
conda create -n myenv python=3.9
conda activate myenv

# tensorflow_text is not officially built for m1 Macs yet: https://github.com/tensorflow/text/issues/89
pip install https://github.com/sun1638650145/Libraries-and-Extensions-for-TensorFlow-for-Apple-Silicon/releases/download/v2.12/tensorflow_text-2.12.0-cp39-cp39-macosx_11_0_arm64.whl
pip install vectordb2

error in colab notebook

In cell 1 of the notebook at

https://colab.research.google.com/drive/1pecKGCCru_Jvx7v0WRNrW441EBlcS5qS#scrollTo=N6R_EZEC51ri

I get the following error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tf-keras 2.15.1 requires tensorflow<2.16,>=2.15, but you have tensorflow 2.16.1 which is incompatible.

update, delete, search by metadata methods

Awesome little library, so much more robust and easy to work with compared to alternatives.

If I may suggest, it would be great to add update and delete functions, as well as expanding search to allow searching by metadata.

The usecase I have in mind is for it to periodically (or before processing each request to AI) check the files in documents folder and update itself. So the logic I have in mind is that I could metadata to store filepaths and hashsums, and then I could just walk the directory and first check if the filepath is already in Memory. If it's not - add it. If it is - check the hashsum. If hashsum does not match - update it. Something like:

for filepath in filepaths:
    existing_entry = memory.search_by_metadata({'filepath': filepath}) # Here we could pass a dict to look for entries where key=value for each entry in the dict (AND operator)
    if existing_entry is None:
        # text = read file
        memory.save([text], [meta])
    else:
        # text = read file
        memory.update({'filepath': filepath}, [text], [meta]) # Uses same logic to find an entry to update, if multiple entries are found - not sure, maybe error out. We need to assume that it uses unique identifier.

Something like this would be sufficient and should be easy to implement. Alternatively it could use IDs in update, that would also work, but in that case it should also return entry ID with search.

Also here I use search_by_metadata, but instead search function could be extended to accept metadata dict. In which case it could also be used to filter results by metadata, for example to create categories. It would first perform metadata search, and then perform embeddings search. It would also speed up the process, I suppose. Or the other way around, if embeddings search if faster than metadata search.

And if the query is empty search could only perform metadata search. For example:

memory.search(None, metadata={"key": value})
memory.search('', metadata={"key": value})

ValueError: not enough values to unpack (expected 2, got 1)

i'm using vectordb in to index data documentation from different sources, but sometimes i get those backtraces. x.shape only contains one element (0,) this issue happens only when there 's nothing saved (or the data saved is too small)

146$ ./venv/bin/python
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import vectordb
Warning: mprt could not be imported. Install with 'pip install git+https://github.com/vioshyvo/mrpt/'. Falling back to Faiss.
>>> a=vectordb.Memory()
>>> a.search("riscv", top_n=5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/pancake/prg/r2ai/venv/lib/python3.11/site-packages/vectordb/memory.py", line 68, in search
    indices = self.vector_search.search_vectors(query_embedding, embeddings, top_n)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pancake/prg/r2ai/venv/lib/python3.11/site-packages/vectordb/vector_search.py", line 60, in search_vectors
    indices = call_search(query_embedding, embeddings, top_n)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pancake/prg/r2ai/venv/lib/python3.11/site-packages/vectordb/vector_search.py", line 26, in run_faiss
    index.add(vectors)
  File "/Users/pancake/prg/r2ai/venv/lib/python3.11/site-packages/faiss/class_wrappers.py", line 226, in replacement_add
    n, d = x.shape
    ^^^^
ValueError: not enough values to unpack (expected 2, got 1)
>>>

tensorflow_text. not available

(venv) 0$ pip install .
Processing /Users/pancake/prg/vectordb
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: torch>=1.9.0 in /Users/pancake/prg/r2ai/venv/lib/python3.12/site-packages (from vectordb2==0.1.9) (2.2.2)
Requirement already satisfied: transformers>=4.10.0 in /Users/pancake/prg/r2ai/venv/lib/python3.12/site-packages (from vectordb2==0.1.9) (4.39.3)
Requirement already satisfied: numpy>=1.21.0 in /Users/pancake/prg/r2ai/venv/lib/python3.12/site-packages (from vectordb2==0.1.9) (1.26.4)
Requirement already satisfied: scikit-learn>=0.24.0 in /Users/pancake/prg/r2ai/venv/lib/python3.12/site-packages (from vectordb2==0.1.9) (1.4.2)
Requirement already satisfied: scipy>=1.7.0 in /Users/pancake/prg/r2ai/venv/lib/python3.12/site-packages (from vectordb2==0.1.9) (1.13.0)
Requirement already satisfied: sentence-transformers in /Users/pancake/prg/r2ai/venv/lib/python3.12/site-packages (from vectordb2==0.1.9) (2.6.1)
Requirement already satisfied: faiss-cpu in /Users/pancake/prg/r2ai/venv/lib/python3.12/site-packages (from vectordb2==0.1.9) (1.8.0)
INFO: pip is looking at multiple versions of vectordb2 to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement tensorflow-text (from vectordb2) (from versions: none)
ERROR: No matching distribution found for tensorflow-text
(venv) 1$

Not sure what is a clean solution here, as saving two files would also be a hassle...

We could make a single dictionary outa the two of them when saving and splitting again once loaded?

Vector distance

Is there a way that the query could return the distance of the vectors from the search, maybe even a minimum threshhold?

ImportError: cannot import name 'Memory' from 'vectordb' (/usr/local/lib/python3.10/dist-packages/vectordb/init.py)

i tried this

and this

I solved this cloning the repo and import the file from local path

If I have already chunking texts in my own way, how can I skip the text chunking method in vectordb?

Just embed my text chunks and recall them, not use chunking method.

kagisearch / vectordb Goto Github PK

vectordb's People

Contributors

Stargazers

Watchers

Forkers

vectordb's Issues

Running on m1 Mac

error in colab notebook

update, delete, search by metadata methods

ValueError: not enough values to unpack (expected 2, got 1)

tensorflow_text. not available

.

TypeError

Memory file broken, no metadata

Vector distance

ImportError: cannot import name 'Memory' from 'vectordb' (/usr/local/lib/python3.10/dist-packages/vectordb/init.py)

If I have already chunking texts in my own way, how can I skip the text chunking method in vectordb?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent