Giter VIP home page Giter VIP logo

pinecone-datasets's Introduction

Pinecone Datasets

install

pip install pinecone-datasets

Usage - Loading

You can use Pinecone Datasets to load our public datasets or with your own datasets. Datasets library can be used in 2 main ways: ad-hoc loading of datasets from a path or as a catalog loader for datasets.

Loading Pinecone Public Datasets (catalog)

Pinecone hosts a public datasets catalog, you can load a dataset by name using list_datasets and load_dataset functions. This will use the default catalog endpoint (currently GCS) to list and load datasets.

from pinecone_datasets import list_datasets, load_dataset

list_datasets()
# ["quora_all-MiniLM-L6-bm25", ... ]

dataset = load_dataset("quora_all-MiniLM-L6-bm25")

dataset.head()

# Prints
# ┌─────┬───────────────────────────┬─────────────────────────────────────┬───────────────────┬──────┐
# │ id  ┆ values                    ┆ sparse_values                       ┆ metadata          ┆ blob │
# │     ┆                           ┆                                     ┆                   ┆      │
# │ str ┆ list[f32]                 ┆ struct[2]                           ┆ struct[3]         ┆      │
# ╞═════╪═══════════════════════════╪═════════════════════════════════════╪═══════════════════╪══════╡
# │ 0   ┆ [0.118014, -0.069717, ... ┆ {[470065541, 52922727, ... 22364... ┆ {2017,12,"other"} ┆ .... │
# │     ┆ 0.0060...                 ┆                                     ┆                   ┆      │
# └─────┴───────────────────────────┴─────────────────────────────────────┴───────────────────┴──────┘

Expected dataset structure

pinecone datasets can load dataset from every storage where it has access (using the default access: s3, gcs or local permissions)

we expect data to be uploaded to the following directory structure:

├── my-subdir                     # path to where all datasets
│   ├── my-dataset                # name of dataset
│   │   ├── metadata.json         # dataset metadata (optional, only for listed)
│   │   ├── documents             # datasets documents
│   │   │   ├── file1.parquet      
│   │   │   └── file2.parquet      
│   │   ├── queries               # dataset queries
│   │   │   ├── file1.parquet  
│   │   │   └── file2.parquet   
└── ...

The data schema is expected to be as follows:

  • documents directory contains parquet files with the following schema:
    • Mandatory: id: str, values: list[float]
    • Optional: sparse_values: Dict: indices: List[int], values: List[float], metadata: Dict, blob: dict
      • note: blob is a dict that can contain any data, it is not returned when iterating over the dataset and is inteded to be used for storing additional data that is not part of the dataset schema. however, it is sometime useful to store additional data in the dataset, for example, a document text. In future version this may become a first class citizen in the dataset schema.
  • queries directory contains parquet files with the following schema:
    • Mandatory: vector: list[float], top_k: int
    • Optional: sparse_vector: Dict: indices: List[int], values: List[float], filter: Dict
      • note: filter is a dict that contain pinecone filters, for more information see here

in addition, a metadata file is expected to be in the dataset directory, for example: s3://my-bucket/my-dataset/metadata.json

from pinecone_datasets.catalog import DatasetMetadata

meta = DatasetMetadata(
    name="test_dataset",
    created_at="2023-02-17 14:17:01.481785",
    documents=2,
    queries=2,
    source="manual",
    bucket="LOCAL",
    task="unittests",
    dense_model={"name": "bert", "dimension": 3},
    sparse_model={"name": "bm25"},
)

full metadata schema can be found in pinecone_datasets.catalog.DatasetMetadata.schema

Loading your own dataset from catalog

To set you own catalog endpoint, set the environment variable DATASETS_CATALOG_BASEPATH to your bucket. Note that pinecone uses the default authentication method for the storage type (gcsfs for GCS and s3fs for S3).

export DATASETS_CATALOG_BASEPATH="s3://my-bucket/my-subdir"
from pinecone_datasets import list_datasets, load_dataset

list_datasets()

# ["my-dataset", ... ]

dataset = load_dataset("my-dataset")

additionally, you can load a dataset from the Dataset class

from pinecone_datasets import Dataset

dataset = Dataset.from_catalog("my-dataset")

Loading your own dataset from path

You can load your own dataset from a local path or a remote path (GCS or S3). Note that pinecone uses the default authentication method for the storage type (gcsfs for GCS and s3fs for S3).

from pinecone_datasets import Dataset

dataset = Dataset.from_path("s3://my-bucket/my-subdir/my-dataset")

This assumes that the path is structured as described in the Expected dataset structure section

Loading from a pandas dataframe

Pinecone Datasets enables you to load a dataset from a pandas dataframe. This is useful for loading a dataset from a local file and saving it to a remote storage. The minimal required data is a documents dataset, and the minimal required columns are id and values. The id column is a unique identifier for the document, and the values column is a list of floats representing the document vector.

import pandas as pd

df = pd.read_parquet("my-dataset.parquet")

metadata = DatasetMetadata(**metadata_dict)

dataset = Dataset.from_pandas(documents = df, queries = None, metadata = metadata)

Please check the documentation for more information on the expected dataframe schema. There's also a column mapping variable that can be used to map the dataframe columns to the expected schema.

Usage - Accessing data

Pinecone Datasets is build on top of pandas. This means that you can use all the pandas API to access the data. In addition, we provide some helper functions to access the data in a more convenient way.

Accessing documents and queries dataframes

accessing the documents and queries dataframes is done using the documents and queries properties. These properties are lazy and will only load the data when accessed.

document_df: pd.DataFrame = dataset.documents

query_df: pd.DataFrame = dataset.queries

Usage - Iterating

One of the main use cases for Pinecone Datasets is iterating over a dataset. This is useful for upserting a dataset to an index, or for benchmarking. It is also useful for iterating over large datasets - as of today, datasets are not yet lazy, however we are working on it.

# List Iterator, where every list of size N Dicts with ("id", "values", "sparse_values", "metadata")
dataset.iter_documents(batch_size=n) 

# Dict Iterator, where every dict has ("vector", "sparse_vector", "filter", "top_k")
dataset.iter_queries()

The 'blob' column

Pinecone dataset ship with a blob column which is inteneded to be used for storing additional data that is not part of the dataset schema. however, it is sometime useful to store additional data in the dataset, for example, a document text. We added a utility function to move data from the blob column to the metadata column. This is useful for example when upserting a dataset to an index and want to use the metadata to store text data.

from pinecone_datasets import import_documents_keys_from_blob_to_metadata

new_dataset = import_documents_keys_from_blob_to_metadata(dataset, keys=["text"])

Usage saving

you can save your dataset to a catalog managed by you or to a local path or a remote path (GCS or S3).

Saving to Catalog

To set you own catalog endpoint, set the environment variable DATASETS_CATALOG_BASEPATH to your bucket. Note that pinecone uses the default authentication method for the storage type (gcsfs for GCS and s3fs for S3).

After this environment variable is set you can save your dataset to the catalog using the save function

from pinecone_datasets import Dataset

metadata = DatasetMetadata(**{"name": "my-dataset", ...})

🚨 NOTE Dataset name in the metadata must match the dataset_id parameter you pass to the catalog, in this example 'my-dataset'


dataset = Dataset.from_pandas(documents, queries, metadata)
dataset.to_catalog("my-dataset")

Saving to Path

You can save your dataset to a local path or a remote path (GCS or S3). Note that pinecone uses the default authentication method for the storage type (gcsfs for GCS and s3fs for S3).

dataset = Dataset.from_pandas(documents, queries, metadata)
dataset.to_path("s3://my-bucket/my-subdir/my-dataset")

upserting to Index

When upserting a Dataset to an Index, only the document data will be upserted to the index. The queries data will be ignored.

TODO: add example for API Key adn Environment Variables

ds = load_dataset("dataset_name")

ds.to_pinecone_index("index_name")

# or, if you run in notebook environment

await ds.to_pinecone_index_async("index_name")

the to_pinecone_index function also accepts additional parameters:

  • batch_size for controlling the upserting process
  • api_key for passing your API key, otherwise you can
  • kwargs - for passing additional parameters to the index creation process

For developers

This project is using poetry for dependency managemet. supported python version are 3.8+. To start developing, on project root directory run:

poetry install --with dev

To run test locally run

poetry run pytest --cov pinecone_datasets

pinecone-datasets's People

Contributors

daverigby avatar igiloh-pinecone avatar jamescalam avatar miararoy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pinecone-datasets's Issues

[Bug] CI workflow for PR branches always fails

Is this a new bug?

  • I believe this is a new bug
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

PR #23 introduced a new CI workflow that runs on PR branches, but never ran this workflow itself.
Now that this code is on main - any new PR branches automatically fails - see example

Expected Behavior

PR CI workflow passes if code is correct

Steps To Reproduce

Open a PR

[Feature] Add asyncio support

Is this your first time submitting a feature request?

  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing functionality

Describe the feature

Currently, load_dataset, list_datasets and to_pinecone_index functions are not async. These are potentially long running functions that might block the main thread for most asyncio applications. The goal is to add support for async equivalent of these functions.

list_datasets: gcsfs and s3fs are async compatible so it should be relatively easy to add async equivalents.

to_pinecone_index: Might require Pinecone Client 3.0 so we might need to wait until it is stable.

load_dataset: We need to improve the functionality here. Currently load_dataset does not actually load the dataset but just creates a Dataset object that might be confusing for the users. Long running tasks should be clear to the user and download should be explicit. (Currently download happens on property access to queries/documents or by calling head function.)
See https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris as an example.

In order to change this functionality I suggest changing load_dataset to get_dataset_loader (or another name) and creating two functions to fetch queries or documents such as dataset_loader.load_documents (async) and dataset_loader.load_queries (async). In that case we might need to deprecate load_dataset but keep several versions with a DeprecationWarning. We might also need some refactor.

Describe alternatives you've considered

We can keep the API as is but as asyncio is becoming more and more popular I think it is a good idea to catch up.

Who will this benefit?

to_pinecone_index_async will be especially useful for big bulk upserts. The other changes will improve the user experience.

Are you interested in contributing this feature?

Sure, I think we need to have a discussion first and plan the changes properly.

Anything else?

No response

[Feature] support for dataset creation with sentence transformers

Is this your first time submitting a feature request?

  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing functionality

Describe the feature

@miararoy @igiloh-pinecone

It would be nice to do something like:

sentences = [
   "How do I get a replacement Medicare card?",
   "What is the monthly premium for Medicare Part B?"
]

dataset = Dataset.from_sentence_transformers(
    'sentence-transformers/all-MiniLM-L6-v2',
    sentences
)

There are more than 124 embedding models, that work with the Sentence Transformers library

I created a PR here here to engage some discussion.

Describe alternatives you've considered

No response

Who will this benefit?

I believe, that developers who would like to create sentences databases.

Are you interested in contributing this feature?

yes

Anything else?

I created a PR here

[Feature] Add other compatible S3 providers (Cloudflare R2, minIO, localhost etc..)

Is this your first time submitting a feature request?

  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing functionality

Describe the feature

Hi,
my name is Tomer Shalev and I would like to propose a feature which I believe
can bring a lot of value for our users.

It is highly desirable for organizations to use:

  • A more cost effective S3 provider such as Cloudflare R2
  • Or a self hosted S3 (such as minIO)

While I was playing with this library, i wanted to use my Cloudflare R2 buckets, but I couldn't because
the current code does not support general http endpoints.

I created a PR for this in order to engage and propose a simple solution:
#30

Describe alternatives you've considered

Unfortunately, there are not alternative solutions for this support.

Who will this benefit?

Anyone who is:

  1. Trying to be more cost effective with storage solutions.
  2. Under regulation and cannot use Amazon S3 or Google Storage.
  3. Obliged to self host data.

Are you interested in contributing this feature?

#30

Anything else?

No response

[Bug] Dependency issues with Python 3.12

Is this a new bug?

  • I believe this is a new bug
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When installing pinecone-datasets via pip with Python 3.12 on MacOS 14, I run into errors around the installation of pyarrow. While researching the problem I found this issue on the PyArrow Github. It was closed last year as they expected to build wheels for PyArrow on Python 3.12 but it does not appear these have been published.

Switching to Python 3.11 allows the package to be installed normally.

Expected Behavior

Expect to install pinecone-datasets successfully.

Steps To Reproduce

  1. On MacOS, ensure that Python 3.12 is installed.
  2. Run python3.12 -m venv .venv3.12 to create a virtual environment specifically for 3.12.
  3. Source the venv source .venv3.12/bin/activate.
  4. Attempt to install pip install pinecone-datasets.

Relevant log output

Building wheels for collected packages: pyarrow
  Building wheel for pyarrow (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for pyarrow (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [434 lines of output]
      <string>:34: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
      WARNING setuptools_scm.pyproject_reading toml section missing 'pyproject.toml does not contain a tool.setuptools_scm section'
      Traceback (most recent call last):
        File "/private/var/folders/lw/66k9ftls4cs_4vtpblj6bkr00000gn/T/pip-build-env-yda_nrue/overlay/lib/python3.12/site-packages/setuptools_scm/_integration/pyproject_reading.py", line 36, in read_pyproject
          section = defn.get("tool", {})[tool_name]
                    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
...
pyarrow/error.pxi:82:60: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      
      Error compiling Cython file:
      ------------------------------------------------------------
      ...
                  raise ArrowException(message)
      
      
      # This is an API function for C++ PyArrow
      cdef api int pyarrow_internal_check_status(const CStatus& status) \
              nogil except -1:
                             ^
      ------------------------------------------------------------
      
      pyarrow/error.pxi:143:23: The keyword 'nogil' should appear at the end of the function signature line. Placing it before 'except' or 'noexcept' will be disallowed in a future version of Cython.
      
      gmake[2]: *** [CMakeFiles/lib_pyx.dir/build.make:71: CMakeFiles/lib_pyx] Error 1
      gmake[1]: *** [CMakeFiles/Makefile2:137: CMakeFiles/lib_pyx.dir/all] Error 2
      gmake: *** [Makefile:136: all] Error 2
      error: command '/opt/homebrew/bin/cmake' failed with exit code 2
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pyarrow
Failed to build pyarrow
ERROR: Could not build wheels for pyarrow, which is required to install pyproject.toml-based projects

Environment

- **OS**: MacOS 14
- **Language version**: Python 3.12.4
- **Pinecone client version**: N/A

Additional Context

No response

[Bug] HttpError : Invalid bucket name: 'wikipedia-simple-text-embedding-ada-002-100K', 400

Is this a new bug?

  • I believe this is a new bug
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Hi,

I have used code from one of the example colab Notebook on RAG with langchain to make a lab for students on vector databases.

A minority of the students encountered the following error when importing the wikipedia-simple-text-embedding-ada-002-100K dataset from pinecone_datasets:
image
image (1)
image (2)

Expected Behavior

This cell is supposed to run and import the dataset (it works on my laptop and for most of the students).

Steps To Reproduce

In python 3.11 with the packages versions described later run pinecone_datasets.load_dataset('wikipedia-simple-text-embedding-ada-002-100K ')

Relevant log output

No response

Environment

- **OS**: multiple (Windows and MacOS)
- **Language version**: python 3.11
- **Pinecone client version**: pinecone_datasets==0.6.2

Additional Context

None of our troubleshooting attempts worked, and we have not identifier the common denominator that leads to this error happening. When using the list_datasets() method, the wikipedia-simple-text-embedding-ada-002-100K appears in the list, and we were thinking it might be a server side error.

[Bug] Unable to load yfcc-10M-filter-euclidean dataset

Is this a new bug?

  • I believe this is a new bug
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I get the error FileNotFoundError: Dataset does not exist. Please check the path or dataset_id when trying to load the yfcc-10M-filter-euclidean dataset.

Expected Behavior

The dataset should be loaded as its available within list_datasets().

Steps To Reproduce

from pinecone_datasets import list_datasets, load_dataset

datasets = list_datasets()
dataset_name =  "yfcc-10M-filter-euclidean"
assert dataset_name in datasets, "Dataset does not exists!"
dataset = load_dataset(dataset_name)

Relevant log output

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 load_dataset('yfcc-10M-filter-euclidean')

File ~/vector_db_benchmark/venv/lib/python3.10/site-packages/pinecone_datasets/public.py:59, in load_dataset(dataset_id, **kwargs)
     57     raise FileNotFoundError(f"Dataset {dataset_id} not found in catalog")
     58 else:
---> 59     return Dataset.from_catalog(dataset_id, **kwargs)

File ~/vector_db_benchmark/venv/lib/python3.10/site-packages/pinecone_datasets/dataset.py:89, in Dataset.from_catalog(cls, dataset_id, catalog_base_path, **kwargs)
     83 catalog_base_path = (
     84     catalog_base_path
     85     if catalog_base_path
     86     else os.environ.get("DATASETS_CATALOG_BASEPATH", cfg.Storage.endpoint)
     87 )
     88 dataset_path = os.path.join(catalog_base_path, f"{dataset_id}")
---> 89 return cls(dataset_path=dataset_path, **kwargs)

File ~/vector_db_benchmark/venv/lib/python3.10/site-packages/pinecone_datasets/dataset.py:190, in Dataset.__init__(self, dataset_path, **kwargs)
    188     self._dataset_path = dataset_path
    189     if not self._fs.exists(self._dataset_path):
--> 190         raise FileNotFoundError(
    191             "Dataset does not exist. Please check the path or dataset_id"
    192         )
    193 else:
    194     self._fs = None

FileNotFoundError: Dataset does not exist. Please check the path or dataset_id

Environment

- **OS**: macOS 14.4.1
- **Language version**: Python 3.10.10
- **Pinecone client version**: 0.7.0

Additional Context

Looking at the metadata about the datasets

from pinecone_datasets import list_datasets, load_dataset

datasets = list_datasets(as_df=True)
dataset_name =  "yfcc-10M-filter-euclidean"
datasets.query('name == @dataset_name').to_dict()

Results show that the data is not in the bucket:

{'name': {27: 'yfcc-10M-filter-euclidean'},
 'created_at': {27: '2023-08-24 13:51:29.136759'},
 'documents': {27: 10000000},
 'queries': {27: 100000},
 'source': {27: 'big-ann-challenge 2023'},
 'license': {27: None},
 'bucket': {27: None},
 'task': {27: None},
 'dense_model': {27: {'name': 'yfcc', 'tokenizer': None, 'dimension': 192}},
 'sparse_model': {27: None},
 'description': {27: 'Dataset from the 2023 big ann challenge - filter track. Distance: Euclidean. see https://big-ann-benchmarks.com/neurips23.html'},
 'tags': {27: None},
 'args': {27: None}}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.