Giter VIP home page Giter VIP logo

vecs's Introduction

vecs

Python version test status Pre-commit Status

PyPI version License Download count


Documentation: https://supabase.github.io/vecs/latest/

Source Code: https://github.com/supabase/vecs


vecs is a python client for managing and querying vector stores in PostgreSQL with the pgvector extension. This guide will help you get started with using vecs.

If you don't have a Postgres database with the pgvector ready, see hosting for easy options.

Installation

Requires:

  • Python 3.7+

You can install vecs using pip:

pip install vecs

Usage

Visit the quickstart guide for more complete info.

import vecs

DB_CONNECTION = "postgresql://<user>:<password>@<host>:<port>/<db_name>"

# create vector store client
vx = vecs.create_client(DB_CONNECTION)

# create a collection of vectors with 3 dimensions
docs = vx.get_or_create_collection(name="docs", dimension=3)

# add records to the *docs* collection
docs.upsert(
    records=[
        (
         "vec0",           # the vector's identifier
         [0.1, 0.2, 0.3],  # the vector. list or np.array
         {"year": 1973}    # associated  metadata
        ),
        (
         "vec1",
         [0.7, 0.8, 0.9],
         {"year": 2012}
        )
    ]
)

# index the collection for fast search performance
docs.create_index()

# query the collection filtering metadata for "year" = 2012
docs.query(
    data=[0.4,0.5,0.6],              # required
    limit=1,                         # number of records to return
    filters={"year": {"$eq": 2012}}, # metadata filters
)

# Returns: ["vec1"]

vecs's People

Contributors

beerose avatar ilaffey2 avatar jbritain avatar jeanmaried avatar kiwicopple avatar leothomas avatar lovaarutinovi avatar olirice avatar ooiyeefei avatar trancethehuman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vecs's Issues

Efficient Access Patterns for Databases with Many Collections

Currently the way vecs retrieves collections by name does not seem to scale well when working with a database with many collections (ie 1000+).

By looking at the code this seems that vecs is fetching ALL the collections and then looping through to find one with a matching name. It would be more efficient if there was a where clause in _list_collections in which we could specify the collection by name so that only one would be fetched.

Maybe i'm missing something?

    def get_collection(self, name: str) -> Collection:
        """Get an existing collection"""
        from vecs.collection import Collection

        collections = Collection._list_collections(self)
        for collection in collections:
            if collection.name == name:
                return collection
        raise CollectionNotFound("No collection found with requested name")
        
        # ...
        
        @classmethod
    def _list_collections(cls, client: "Client") -> List["Collection"]:
        query = text(
            """
        select
            relname as table_name,
            atttypmod as embedding_dim
        from
            pg_class pc
            join pg_attribute pa
                on pc.oid = pa.attrelid
        where
            pc.relnamespace = 'vecs'::regnamespace
            and pc.relkind = 'r'
            and pa.attname = 'vec'
            and not pc.relname ^@ '_'
        """
        )
        xc = []
        with client.Session() as sess:
            for name, dimension in sess.execute(query):
                existing_collection = cls(name, dimension, client)
                xc.append(existing_collection)
        return xc

No function to close the connection

Currently there is no function to close the connection with postgres.
Have tried to access the Session object and close it, but didn't work.

Feature Request: $in operator

Right now, there is no support for $in operator for metadata filtering. It is limiting. There are many use cases where a field in metadata is an array. It is supported by Pinecone, and would love to see this being supported by vecs.

Feature request: allow query offset

Currently, only limit is supported, and the maximum result count is 1000.
What if we want to query the 1001th result ?
Without changing the limit, we could fetch the first 1000, then execute a second query,
specifying the offset as 1000.

NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:postgres

Traceback (most recent call last):
File "main.py", line 6, in
vx = vecs.create_client(DB_CONNECTION)
File "/home/runner/StarryKindSphere/venv/lib/python3.10/site-packages/vecs/init.py", line 14, in create_client
return Client(connection_string)
File "/home/runner/StarryKindSphere/venv/lib/python3.10/site-packages/vecs/client.py", line 16, in init
self.engine = create_engine(connection_string)
File "", line 2, in create_engine
File "/home/runner/StarryKindSphere/venv/lib/python3.10/site-packages/sqlalchemy/util/deprecations.py", line 281, in warned
return fn(*args, **kwargs) # type: ignore[no-any-return]
File "/home/runner/StarryKindSphere/venv/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 552, in create_engine
entrypoint = u._get_entrypoint()
File "/home/runner/StarryKindSphere/venv/lib/python3.10/site-packages/sqlalchemy/engine/url.py", line 754, in _get_entrypoint
cls = registry.load(name)
File "/home/runner/StarryKindSphere/venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 368, in load
raise exc.NoSuchModuleError(
sqlalchemy.exc.NoSuchModuleError: Can't load plugin: sqlalchemy.dialects:postgres
îș§

Option to load private models from Higgingface

Describe the bug
Cannot load the private models from the HuggingFace, No **kwargs are passed to the ST() in the class TextEmbedding(AdapterStep)

Expected behavior
'use_auth_token' = None can be passed in the init() and ST()

Screenshots
If applicable, add screenshots to help explain your problem.

Versions:

  • PostgreSQL: [e.g. 14.1]
  • vecs version: e.g. 0.2.6

Additional context
Did add manually, But it was problem when I deploy my application in the cloud when it installs.

Explain Query Method

I would like to be able to explain query method to ensure that the query is actually leveraging the indices.

collection.explain(<query params>)

FATAL: Max client connections reached

Describe the bug
I am intermittently experiencing the error shown below:

(psycopg2.OperationalError) connection to server at "aws-0-us-west-1.pooler.supabase.com" (54.177.55.191), port 5432 failed: FATAL: Max client connections reached (Background on this error at: https://sqlalche.me/e/20/e3q8)

To Reproduce
I am ingesting a reasonable number of requests with a FastAPI implementation in python. The issue could be due to an incorrect implementation, but I could not gather if this was the case and if so, why, from the documentation.

The snippet below shows how I have implemented the PGVectorDB

...
from vecs.client import Client
from vecs.collection import Collection
...

class PGVectorDB(VectorDBProvider):
    def __init__(self) -> None:
        super().__init__()
        try:
            import vecs
        except ImportError:
            raise ValueError(
                f"Error, PGVectorDB requires the vecs library. Please run `poetry add vecs`."
            )
        try:
            user = os.getenv("PGVECTOR_USER")
            password = os.getenv("PGVECTOR_PASSWORD")
            host = os.getenv("PGVECTOR_HOST")
            port = os.getenv("PGVECTOR_PORT")
            db_name = os.getenv("PGVECTOR_DBNAME")

            DB_CONNECTION = (
                f"postgresql://{user}:{password}@{host}:{port}/{db_name}"
            )
            self.vx: Client = vecs.create_client(DB_CONNECTION)
        except Exception as e:
            raise ValueError(
                f"Error {e} occurred while attempting to connect to the pgvector provider."
            )
        self.collection: Optional[Collection] = None

    def initialize_collection(
        self, collection_name: str, dimension: float
    ) -> None:
        self.collection = self.vx.get_or_create_collection(
            name=collection_name, dimension=dimension
        )

    def upsert(self, entry: VectorEntry, commit=True) -> None:
        if self.collection is None:
            raise ValueError(
                "Please call `initialize_collection` before attempting to run `upsert`."
            )

        self.collection.upsert(
            records=[(entry.id, entry.vector, entry.metadata)]
        )

The database is being created in multiple threads due to the nature of how the FastAPI is configured.

Expected behavior
I do not expect to encounter this error. I am only communicating with the database with four worker threads.

Versions:

  • Supabase defaults

Additional context
It would be great if there was more documentation around how to use the vecs library in production.

Set better default for lists and probes

The number of index lists and ivfflat.probes parameter are key to balancing accuracy with performance

  • from pgvector docs:

Choose an appropriate number of lists - a good place to start is rows / 1000 for up to 1M rows and sqrt(rows) for over 1M rows

When querying, specify an appropriate number of probes (higher is better for recall, lower is better for speed) - a good place to start is sqrt(lists)

  • in the worst case (random vectors) the number of probes required for a given precision level is dependent on the vector dimension

  • when vectors are embeddings (vs the random worst case) a far smaller number of probes are needed to achieve high precision


Currently

n_lists = (
    int(max(n_records / 1000, 30))
    if n_records < 1_000_000
    else int(math.sqrt(n_records))
)

We do not currently record the number of lists when creating an index, so that value is not known at query time


This task is to make decisions about how defaults should be set, update the defaults, and document the expected behavior.

Remove limit of 1 entry per filter

Summary

Hello, in the build_features function, there is currently a limit of only allowing 1 entry per filter.
Code snippet below.

def build_filters(json_col: Column, filters: Dict):
    """
    PRIVATE

    Builds filters for SQL query based on provided dictionary.

    Args:
        json_col (Column): The column in the database table.
        filters (Dict): The dictionary specifying filter conditions.

    Raises:
        FilterError: If filter conditions are not correctly formatted.

    Returns:
        The filter clause for the SQL query.
    """

    if not isinstance(filters, dict):
        raise FilterError("filters must be a dict")

    if len(filters) > 1:
        raise FilterError("max 1 entry per filter")

In my example, my filters are

{'agent_id': {'$eq': 139}, 'thread_id': {'$eq': 139}}

and filtering is done through llama-index.

This should be removed.

Rationale

The limit makes it significantly harder to build useful applications that leverage vecs/supabase as the storage of choice for the RAG datastore. This limitation is not documented anywhere and if I had read that you could only filter by one data point in your json metadata, I would have gone with another option. Being able to filter by multiple metadata fields when doing a vector search is something I and many others will need to be able to do for complex RAG systems.

Design

An dense explanation in sufficient detail that someone familiar with the
project could implement the feature. Specifics and corner cases should be covered.

Examples

Illustrations and examples to clarify descriptions from previous sections.

Drawbacks

What are the negative trade-offs?

Alternatives

What other solutions have been considered?

Unresolved Questions

What parts of problem space or proposed designs are unknown or TBD?

Trying to use filters, getting "Only 1 Filter per entry"

Describe the bug

I have a table with vectors and metadata, I use the following filter:

{'$and': [{'user_id': {'$eq': 'test_user1'}, 'bot_id': {'$eq': 'test_bot1'}}]}

the data is:

{
  "content": "input: I like the color red\noutput: ok",
  "metadata": {
    "bot_id": "test_bot1",
    "user_id": "test_user1"
  }
}

There's a check in the code that, if the filter length is > 1, it raises this error:

if len(filters) > 1:
        raise FilterError("max 1 entry per filter")

I tried removing the check but it ended up not returning results.

How am i supposed to check for multiple $eq conditions in the filters then?

Versions:

  • PostgreSQL: [e.g. 14.1]
  • vecs version: e.g. 0.2.6

Additional context
Add any other context about the problem here.

Export SQL for a migration

Vecs does not currently interop well with migrations.

Provide a function that exports a SQL snippet that can be used to setup a migration

Example

docs = vx.create_collection(...)

docs.export_migration()
# Returns: "create table vecs.docs ......"

Support for Metadata Projection

It would be great if we can select specific fields to be queried rather than ship the entire metadata column over the wire. This is helpful if there is a lot of metadata.

`$in` operator doesn't seem to work

I am trying to use the $in operator to query metadata which contains a list of integers: "tags": [20]. Using the filter always returns an empty list even though their is a match.

Steps to reproduce the behavior:

  1. have a vecs entry with metadata like so: {"tags": [20]}
  2. query with that filter:
collection.query(
            data=[...],
            include_metadata=True,
            include_value=True,
            filters={"tags": {"$in": [20]}})

I tried with:

collection.query(
            data=[...],
            include_metadata=True,
            include_value=True,
            filters={"tags": {"$eq": [20]}})

Which works so it seems specific to $in.

Expected behavior
Should return matching entry

Versions:

  • PostgreSQL: 15.1
  • vecs version: 0.4.2

Empty Description on PyPI

The package description on pypi for vecs doesn't have a lot of detail and might imply inactivity. Would be great if it could be updated.

Pretty sure it just needs appropriate fields in the pyproject.toml to publish from the readme.

Screenshot 2023-09-14 at 16 05 04

Feature Request: Progress Bars

The lack of feedback during Collection.upsert and Collection.create_index is a bad DX.

It would be great to get some progress bars but I haven't been able to get them working properly in notebooks and shell environments.

If anyone has experience adding them this would be a great community contribution

upsert

for chunk in flu(vectors).chunk(chunk_size):

but it can't always assume that vectors is sized. Maybe a runtime check to see if *vectors* has a known length and providing it to the progress bar if known would be best

create_index

vecs/src/vecs/collection.py

Lines 346 to 350 in 87ed2d3

stmt = postgresql.insert(clone_table).from_select(
self.table.c, select(self.table)
)
stmt = stmt.on_conflict_do_nothing()
sess.execute(stmt)

this one is a little more involved as it'll have to introduce client side keyset pagination on the id primary key to get feedback to python

Deleting with filters not working (Package mismatch with source code?)

Describe the bug
Trying to use the function in collection.delete() that has an optional filters parameter to delete with specific metadata.
The source code has it:

def delete(

As well as official documentation https://supabase.com/docs/guides/ai/python/api#deleting-vectors

When I download the latest vecs package (0.4.1) I am NOT seeing these changes locally. My hunch is that there's a package mismatch between what's deployed and the source code.

To Reproduce
1.Start a python virtual environment.
2.pip install vecs
3. run the following python code (substitute connection strings)

import vecs
vx = vecs.create_client(<connection>)
collection_name = 'collection_name'
db = vx.get_or_create_collection(name=collection_name, dimension=1536)
db.delete(filters={"foo": {"$eq": "bar"}})

you will get: TypeError: Collection.delete() got an unexpected keyword argument 'filters'

Expected behavior
Expecting the code to run the source code in github and actually delete the vectors associated with a particular metadata.

Do i need to store all filters in the "metadata" column?

Summary

I am trying to create a personalized chatbot, currently according to the vecs docs, i should be storing things like "user_id" in the metadata JSON, however i'm not sure how scalable that is.

Unresolved Questions

Can I for example, add a column to the table called "user_id" and "bot_id", and modify the query to select the vectors where user_id = X and bot_id = Y?

Wouldn't this result in much faster queries?

What would be the best way to approach this?

Thank you!

version 0.3 breaking upsert vectors

Describe the bug
Following the bump to 0.3 , I am no longer able to upset to a vecs collection the same way I did before.

To Reproduce

vx = vecs.create_client(self.connection_string)
emails_collection = vx.get_collection(name="emails")

emails_collection.upsert(vectors=[(str(email.metadata.id), response["data"][0]["embedding"], email_json)]) 

Gives Error:

 File "/usr/local/lib/python3.10/site-packages/embedding/email_embeddings.py", line 66, in embed_single_email
    emails_collection.upsert(vectors=[(str(email.metadata.id), response["data"][0]["embedding"], email_json)])  
TypeError: Collection.upsert() got an unexpected keyword argument 'vectors'

Expected behavior
Update docs or support vectors keyword arg again

Versions:

  • vecs version: e.g. 0.3

Additional context
Downgraded my version to vecs==0.2.6, since I didn't feel like looking into the error.

Adding another column in collections to store raw text

As the title suggests, some other vector databases often have 'id,' 'metadata,' and 'embedding' columns, which Vecs supports here. However, they also support a raw text field (documents in Choma) for storing raw texts. Should it be an optional argument to add a 'text' field to the collection table? Or do you have any advice on how we should handle such a case in PgVector?

Feature Request: Preprocessing Transform

Feature Request: Preprocessor Step

Most of the time, users are working with text in a handful of known media types (markdown, html, etc) or images.

This proposal is to add a generic preprocessing step on collection creation for preprocessing media into vectors so users don't have to operate on vectors directly

Something similar to

docs: Collection =vx.create_collection(
    docs='docs',
    dimension=512,
    transform = TextPreprocessor(          # this is new
        chunker= markdown_chunker,
        model= "sentence-transformers/all-Mini-L6-v2"
    )    
)

docs.upsert([
    ("id_0", "# Some markdown", {}),
    ("id_1", "# Some markdown", {}),
])

Open questions:

  • Should chunking be supported? If so, separately or with the preprocessor?
  • Which media types are required?

Optional Arguments for useful postgres settings during create_index

It could be useful to optionally allow users to provide a statement_timeout during create_index as it can be difficult to interrupt once started and could potentially take a long time if the table contains a lot of data

For IVFFLat indexes, insufficient maintenance_work_mem can cause the index creation to fail at the last step. ref. This should be an optional argument to create_index to make it possible to adjust when the postgres role has sufficient permission

Clarify Request for Support Policy

Hey @olirice - this is a follow up to this thread on Reddit.

Thanks for making the fixes in #37. Those are helpful and appreciate you being responsive.

To follow up with a more detail on a support policy - I actually think https://github.com/supabase-community/supabase-py does a great job of this. At the very top of the Readme it says what level of support the library has.

image

This sets a healthy expectation with the user about what stage the "product" (loosely defined) is and what users should expect. Right now, reading the docs, I have no idea if this is a hackathon project or something that I could run in production.

Hope that helps.

Check for a covering index using pg_catalog params, not ix name/string matching

Currently vecs uses string pattern matching on the index name to determine if an index exists and supports the correct vector ops

Logic is here

vecs/src/vecs/collection.py

Lines 623 to 677 in cc412ce

def index(self) -> Optional[str]:
"""
PRIVATE
Note:
The `index` property is private and expected to undergo refactoring.
Do not rely on it's output.
Retrieves the SQL name of the collection's vector index, if it exists.
Returns:
Optional[str]: The name of the index, or None if no index exists.
"""
if self._index is None:
query = text(
"""
select
relname as table_name
from
pg_class pc
where
pc.relnamespace = 'vecs'::regnamespace
and relname ilike 'ix_vector%'
and pc.relkind = 'i'
"""
)
with self.client.Session() as sess:
ix_name = sess.execute(query).scalar()
self._index = ix_name
return self._index
def is_indexed_for_measure(self, measure: IndexMeasure):
"""
Checks if the collection is indexed for a specific measure.
Args:
measure (IndexMeasure): The measure to check for.
Returns:
bool: True if the collection is indexed for the measure, False otherwise.
"""
index_name = self.index
if index_name is None:
return False
ops = INDEX_MEASURE_TO_OPS.get(measure)
if ops is None:
return False
if ops in index_name:
return True
return False

It would be preferable to lookup the supported ops in pg_catalog such that it is more intuitive

Upsert returns 500/502/504 if given too many rows/characters ~2000 trips it

Describe the bug

upsert returns 500/502/504 if given too many rows/characters ~2000 trips it

https://github.com/hwchase17/langchainjs/blob/6a40c13c52c6efe70a44070f31b458abc26876a8/langchain/src/vectorstores/supabase.ts#L74

I want to know if this is still an issue. If we are to use supabase for storing embeddings, I need to know that it can handle ~500 token chunks, for our use case. Counting characters is a no go, as I support multiple languages and tokenizers have varying character:token ratios depending on the language.

Expected behavior
Chunks of any reasonable size work well with upserts/inserts

Screenshots
https://github.com/hwchase17/langchainjs/blob/6a40c13c52c6efe70a44070f31b458abc26876a8/langchain/src/vectorstores/supabase.ts#L74

@olirice Appreciate any support :)

Additional context
Unsure if this is the right repo to discuss this but since it seems like pgvector-related work is happening here, thought this would be the right place to ask - if not let me know where would be better and I'll repost the issue there.

User-definable ivfflat.probes

When an index is present on a collection, the ivfflat.probes parameter defines controls the balance of speed of results and precision of results.

Currently, ivfflat.probes is fixed at 10 so the user does not have control over the precision of results

sess.execute(text("set local ivfflat.probes = 10"))

There should be a key-word only argument in the query method to allow that to be user adjustable

vecs/src/vecs/collection.py

Lines 291 to 299 in ffeab7f

def query(
self,
query_vector: Iterable[Numeric],
limit: int = 10,
filters: Optional[Dict] = None,
measure: Union[IndexMeasure, str] = IndexMeasure.cosine_distance,
include_value: bool = False,
include_metadata: bool = False,
) -> Union[List[Record], List[str]]:

note, I'm not sure if settings can use bind params like

text("set ivfflat.probes = :probes").bindparams(probes=probes)

so it might need to use an f-string where the user's parameter has been confirmed to be an int

if not isinstance(probes, int):
  raise ...
text(f"set ivfflat.probes = {probes}"

can't add to an existing table

When trying to add another batch, with vectors, to an existing table, I get this error:

from flupy import flu
import numpy as np
from tqdm import tqdm

batch_size = 50

# Convert DataFrame to list of tuples
data = list(new_df[['id', 'content', 'metadata']].itertuples(index=False, name=None))

# Define the starting record index
start_index = 0

# Adjust the data list to start from the desired record
data = data[start_index:]

uploaded_rows = 0

# Iterate over the dataset in chunks
for chunk_ix, chunk in enumerate(tqdm(flu(data).chunk(batch_size))):


    records: List[Tuple[str, np.ndarray, Dict]] = []

    # Extract ids, contents, and metadata from the chunk
    ids, contents, metadatas = zip(*chunk)

    # Create embeddings for the current chunk
    embedding_chunk = co.embed(texts=contents, input_type="search_document", model="embed-english-v3.0").embeddings
    #embedding_chunk = embeddings.embed_documents(contents)
    # Enumerate the embeddings and create a record to insert into the database
    for row_ix, (id, content, embedding, metadata) in enumerate(zip(ids, contents, embedding_chunk, metadatas)):

        # Prepare the record with id, embedding, and metadata
        records.append((id, embedding, metadata))

    # Perform upsert operation for the current batch
    docs.upsert(records)

    # Update the count of uploaded rows
    uploaded_rows += len(records)

print(f"Total uploaded rows: {uploaded_rows}")
12 frames
InvalidColumnReference: there is no unique or exclusion constraint matching the ON CONFLICT specification


The above exception was the direct cause of the following exception:

ProgrammingError                          Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/sqlalchemy/engine/default.py](https://localhost:8080/#) in do_execute(self, cursor, statement, parameters, context)
    920 
    921     def do_execute(self, cursor, statement, parameters, context=None):
--> 922         cursor.execute(statement, parameters)
    923 
    924     def do_execute_no_params(self, cursor, statement, context=None):

ProgrammingError: (psycopg2.errors.InvalidColumnReference) there is no unique or exclusion constraint matching the ON CONFLICT specification

works fine if i send it into a new table. appreciate any advice how to resolve it. thanks!

str object is not callable

After I added my items to the table I run indexing and get this error:

docs.index()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-3-d199780f7211>](https://localhost:8080/#) in <cell line: 1>()
----> 1 docs.index()

TypeError: 'str' object is not callable

Can't connect to Supabase

Describe the bug
I'm trying to connect vecs to my supabase db but it's throwing a sql error. I tried to install psycogs-binary and sqlalchemy but that didn't help either

To Reproduce
Steps to reproduce the behavior:

  1. Try connecting vecs to your supabase db in a google colab

Expected behavior
I'd expect it to connect

Screenshots
If applicable, add screenshots to help explain your problem.
Screen Shot 2023-06-07 at 3 40 50 PM

Versions:

  • PostgreSQL: [e.g. 14.1]
  • vecs version: e.g. 0.2.6

Additional context
Add any other context about the problem here.

Openai Example on documentation doesn't seem to work well.

Hi,

I am trying to run this piece of example, however, it looks like it is not working like expected.

import openai

openai.api_key = '<OPENAI-API-KEY>'

dataset = [
    "The cat sat on the mat.",
    "The quick brown fox jumps over the lazy dog.",
    "Friends, Romans, countrymen, lend me your ears",
    "To be or not to be, that is the question.",
]

embeddings = []

for sentence in dataset:
    response = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=[sentence]
    )
    embeddings.append((sentence, response["data"][0]["embedding"]))

I am getting an error on dimensions.

image

When I try fixing it by adding an id, it however gives another unexpected error of this format.
image

how could this be resolve?

Suggestion: Use the containment operator to speed up metadata equality

Hi @olirice, thanks for creating this library!

Using the containment operator (@>) for equality should speed up metadata filtering, as it'll allow the GIN index to used. From the docs: "The default GIN operator class for jsonb supports queries with the key-exists operators ?, ?| and ?&, the containment operator @>, and the jsonpath match operators @? and @@."

You can also use jsonb_path_ops for a smaller index size. "The non-default GIN operator class jsonb_path_ops does not support the key-exists operators, but it does support @>, @? and @@."

import json
import psycopg
import random

conn = psycopg.connect(dbname='vecs', autocommit=True)
conn.execute('CREATE TABLE docs (id bigserial PRIMARY KEY, metadata jsonb)')

for _ in range(100):
    metadata = []
    for _ in range(10000):
        md = {
            'a': random.randint(1, 10000),
            'b': random.choice(['value' + str(i) for i in range(1, 100)])
        }
        metadata.append(json.dumps(md))

    sql = 'INSERT INTO docs (metadata) VALUES ' + ','.join(['(%s)' for _ in metadata])
    conn.execute(sql, metadata)

conn.execute('CREATE INDEX ON docs USING gin (metadata jsonb_path_ops)')

# no index
res = conn.execute("EXPLAIN ANALYZE SELECT * FROM docs WHERE metadata -> 'a' = '1'").fetchall()
print('\n'.join([r[0] for r in res]))

# index
res = conn.execute("""EXPLAIN ANALYZE SELECT * FROM docs WHERE metadata @> '{"a": 1}'""").fetchall()
print('\n'.join([r[0] for r in res]))

# no index
res = conn.execute("""EXPLAIN ANALYZE SELECT * FROM docs WHERE metadata -> 'b' = '"value1"'""").fetchall()
print('\n'.join([r[0] for r in res]))

# index
res = conn.execute("""EXPLAIN ANALYZE SELECT * FROM docs WHERE metadata @> '{"b": "value1"}'""").fetchall()
print('\n'.join([r[0] for r in res]))

Edit: More from the docs on jsonb_path_ops

Although the jsonb_path_ops operator class supports only queries with the @>, @? and @@ operators, it has notable performance advantages over the default operator class jsonb_ops. A jsonb_path_ops index is usually much smaller than a jsonb_ops index over the same data, and the specificity of searches is better, particularly when queries contain keys that appear frequently in the data. Therefore search operations typically perform better than with the default operator class.

Feature Request: Async Client

Feature Request: Async Client

Most of what vecs manages involves interacting with a database over a network. Sqlalchemy and psycopg2 both support async operations but vecs does not. Creating an async client would unblock the python interpreter while waiting on those external systems and would enable much better performance in async environments e.g. an async enabled web server like FastAPI. That use case will be particularly important if vecs users deploy the Collection.query or Collection.upsert methods behind an API

Potential Usage:

vx = await vecs.create_async_client(DB_CONNECTION)
...
docs = await vx.create_collection(name="docs", dimension=3)
...
# same for `upsert`, `create_index`, and `query`

Add `ef_construction` and `m` as optional parameters when building HSNW index

Summary

Currently the HSNW index creation does not specify (or allow specifying) the parameters m (max number of connections per layer), and ef_construction (size of the dynamic candidate list when constructing the graph) when creating the index. This means that HSNW index is built with the default pgvector values of m=16 and ef_construction=64. It would be great to be able to set these when instantiating the HSNW index using the vecs collection.

Rationale

These parameters can have important effects on the index performance.

Design

One possible implementation would be to create an IndexParameters class, which would be accepted by the create_index() function (alongside the IndexMeasure and IndexMethod) which would contain optional parameters m:int=16 and ef_construction:int=64 (pgvector defaults).

And then inserted into the SQL command sent to pgvector as:

f"""
create index ix_{ops}_hnsw_{unique_string}
on vecs."{self.table.name}"
using hnsw (vec {ops}) with (m={m} and ef_construction={ef_construction});
"""

The IndexParamters class could also be used to afford users fine grain control on the number of lists used when creating the IVFFlat index. An n_lists:Optional[int]=None parameter would be added to IndexParamters class and, in absence of a value supplied by the use, the n_lists value would be calculated as it currently is:

n_lists = (
    int(max(n_records / 1000, 30))
    if n_records < 1_000_000
    else int(math.sqrt(n_records))
)

Lastly a warning can be raised if the user supplies m and ef_construction but specifies and ivfflat index or inversely, if the user supplies a value for n_lists when specifying the hsnw index type. Only a warning is needed since the default values would be applied in this case.

It's a fairly small/self-contained feature so I'm happy to submit a PR!

Cheers!

Markdown `AdapterStep` for chunking by heading

Context

Markdown is a common format for documents ingested into vector systems and has more exploitable structure than simple text.

This task is to create an vecs.adapters.base.AdapterStep that handles chunking markdown by heading.

Ideally it would also accept parameters for

  • The maximum number of words in each chunk

e.g.

from vecs.adapters import MarkdownChunker

MarkdownChunker(
  max_tokens=512
)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.