Giter VIP home page Giter VIP logo

embedding-reader's Introduction

embedding_reader

pypi Open In Colab Try it on gitpod

Embedding reader is a module to make it easy to read efficiently a large collection of embeddings stored in any file system.

  • 400GB of embeddings read in 8min using an nvme drive
  • 400GB of embeddings read in 40min using an hdd drive
  • 400GB of embeddings read in 1.3h from aws s3

Install

pip install embedding_reader

Python examples

Checkout these examples to call this as a lib:

Simple example

from embedding_reader import EmbeddingReader

embedding_reader = EmbeddingReader(embeddings_folder="embedding_folder", file_format="npy")

print("embedding count", embedding_reader.count)
print("dimension", embedding_reader.dimension)
print("total size", embedding_reader.total_size)
print("byte per item", embedding_reader.byte_per_item)

for emb, meta in embedding_reader(batch_size=10 ** 6, start=0, end=embedding_reader.count):
    print(emb.shape)

Laion5B example

In laion5B you can find 5B ViT-L/14 image embeddings, you can read them with that code:

from embedding_reader import EmbeddingReader

embedding_reader = EmbeddingReader(embeddings_folder="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/", file_format="npy")

print("embedding count", embedding_reader.count)
print("dimension", embedding_reader.dimension)
print("total size", embedding_reader.total_size)
print("byte per item", embedding_reader.byte_per_item)

for emb, meta in embedding_reader(batch_size=10 ** 6, start=0, end=embedding_reader.count):
    print(emb.shape)

It takes about 3h to read laion2B-en embeddings at 300MB/s

Numpy & Parquet Metadata Example

The parquet_npy format supports reading from both a .npy collection and a .parquet collection that are in the same order. Here is an example of usage:

from embedding_reader import EmbeddingReader

embedding_reader = EmbeddingReader(
    embeddings_folder="embedding_folder",
    metadata_folder="metadata_folder",
    meta_columns=['image_path', 'caption'],
    file_format="parquet_npy"
)

for emb, meta in embedding_reader(batch_size=10 ** 6, start=0, end=embedding_reader.count):
    print(emb.shape)
    print(meta["image_path"], meta["caption"])

emb is a numpy array like the previous examples while meta is a pandas dataframe with the columns requested in meta_columns.

Who is using embedding reader?

Some use cases of embedding reader include:

  • building knn indices in autofaiss
  • computing zero shot attributes using clip
  • running training or inferences of linear layer models on top of embeddings

Embeddings are a powerful concept, they allow turning highly complex data into point in a linearly separable space. Embeddings are also much smaller and more efficient to manipulate than usual data (images, audio, video, text, interaction items, ...)

To learn more about embeddings read Semantic search blogpost

File system support

Thanks to fsspec, embedding_reader supports reading and writing files in many file systems. To use it, simply use the prefix of your filesystem before the path. For example hdfs://, s3://, http://, or gcs://. Some of these file systems require installing an additional package (for example s3fs for s3, gcsfs for gcs). See fsspec doc for all the details.

API

This module exposes one class:

EmbeddingReader(folder, file_format, embedding_column="embedding", meta_columns=None, metadata_folder=None)

initialize the reader by listing all files and retrieving their metadata

  • folder embeddings folder. Can also be a list of folders. (required)
  • file_format parquet, npy or parquet_npy. (required)
  • embedding_column embedding column in parquet. (default embedding)
  • meta_columns meta columns in parquet. (default None)
  • metadata_folder metadata folder, used by the parquet_npy reader (default None)

.embeddings_folder

the embedding folder

.count

total number of embedding in this folder

.dimension

dimension of one embedding

.byte_per_item

size of one embedding in bytes

.total_size

size in bytes of the collection

.nb_files

total number of embedding files in this folder

.max_file_size

max size in bytes of the embedding files of the collection

call(batch_size, start=0, end=None, max_piece_size=None, parallel_pieces=None, show_progress=True, max_ram_usage_in_bytes=2**31)

Produces an iterator that yields tuples (data, meta) with the given batch_size

  • batch_size amount of embeddings in one batch. (required)
  • start start of the subset of the collection to read. (default 0)
  • end end of the subset of the collection to read. (default end of collection)
  • max_piece_size maximum size of a piece. The default value works for most cases. Increase or decrease based on your file system performances (default max(number of embedding for 50MB, batch size in MB))
  • parallel_pieces Number of pieces to read in parallel. Increase or decrease depending on your filesystem. (default ~min(round(max_ram_usage_in_bytes/max_piece_sizebyte_per_item), 50)*)
  • show_progress Display a tqdm bar with the number of pieces done. (default True)
  • max_ram_usage_in_bytes Constraint the ram usage of embedding reader. The exact max ram usage is min(max_ram_usage_in_bytes, size of a batch in bytes). (default 4GB)

Architecture notes and benchmark

The main architecture choice of this lib is the build_pieces function that builds decently sizes pieces of embedding files (typically 50MB) initially. These pieces metadata can then be used to fetch in parallel these pieces, which are then used to build the embedding batches and provided to the user. In order to reach the maximal speed, it is better to read files of equal size. The number of threads used is constrained by the maximum size of your embeddings files: the lower the size, the more threads are used (you can also set a custom number of threads, but ram consumption will be higher).

In practice, it has been observed speed of up to 100MB/s when fetching embeddings from s3, 1GB/s when fetching from an nvme drive. That means reading 400GB of embeddings in 8 minutes (400M embeddings in float16 and dimension 512) The memory usage stays low and flat thanks to the absence of copy. Decreasing the batch size decreases the amount of memory consumed, you can also set max_ram_usage_in_bytes to have a better control on the ram usage.

For development

Either locally, or in gitpod (do export PIP_USER=false there)

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code

python -m pytest -x -s -v tests -k "dummy" to run a specific test

embedding-reader's People

Contributors

bencwallace avatar dotlambda avatar frikallo avatar hitchhicker avatar rom1504 avatar veldrovive avatar victor-paltz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

embedding-reader's Issues

str() causes .npy file header to fail regex

I'm using the attached npy embedding. this npy file is (768, ) in size, and was computed using CLIP ViT-L/14 and saved using the below function

  filename = some_name
  emb = np.frombuffer(vector_data, dtype='float16')
  np.save(path, emb)

when I use embedding_reader to load this, first file fails due to a header parsing error.

python --version
Python 3.10.8

...
tracing your code using fsspec and reading the failing file manually
...

>>> f.seek(0)
0
>>> f
<fsspec.implementations.local.LocalFileOpener object at 0x1049fb940>
>>> f.size
1664
>>> isinstance(f.size, int)
True
>>> file_size = f.size
>>> file_size
1664
>>> first_line = f.read(min(file_size, 300)).split(b"\n")[0]
>>> first_line
b"\x93NUMPY\x01\x00v\x00{'descr': '<f2', 'fortran_order': False, 'shape': (768,), }                                                          "
>>> result = re.search(r"'shape': \(([0-9]+), ([0-9]+)\)", str(first_line))
>>> result
>>> str(first_line)
'b"\\x93NUMPY\\x01\\x00v\\x00{\'descr\': \'<f2\', \'fortran_order\': False, \'shape\': (768,), }

it seems like when str is cast, escape characters are added which cause the header parsing to fail? is this not the expected result?

this is causing autofaiss to break when building an index. I'm able to use autofaiss when loading the embeddings from memory, so not likely their issue?

Would there be a way to allow for more recent pyarrow versions?

Hi, would there be a way to allow more recent pyarrow versions?
The current pyarrow>=6.0.1,<8 requirements means we're stuck with a version from 2017, and makes it difficult to install libraries that require more recent versions. Is there a specific reason this library uses pyarrow>=6.0.1,<8.

I'm happy to contribute to any implementation changes if needed.

AttributeError: 'NoneType' object has no attribute 'group'

from embedding_reader import EmbeddingReader

embedding_reader = EmbeddingReader(embeddings_folder="./data/test1/imgs", file_format="npy")

print("embedding count", embedding_reader.count)
print("dimension", embedding_reader.dimension)
print("total size", embedding_reader.total_size)
print("byte per item", embedding_reader.byte_per_item)

for emb, meta in embedding_reader(batch_size=10 ** 6, start=0, end=embedding_reader.count):
    print(emb.shape)

error:

Traceback (most recent call last):
  File "/home/kemove/anaconda3/envs/dalle2/lib/python3.9/site-packages/embedding_reader/numpy_reader.py", line 39, in file_to_header
    return (None, [filename, *read_numpy_header(f)])
  File "/home/kemove/anaconda3/envs/dalle2/lib/python3.9/site-packages/embedding_reader/numpy_reader.py", line 28, in read_numpy_header
    shape = (int(result.group(1)), int(result.group(2)))
AttributeError: 'NoneType' object has no attribute 'group'

Slow and incorrect exploration of embedding files with fs.glob()

When looking for the list of files having the requested file_format, this code is not optimal because fsspec will explore all the files in the parent folder. It can even explore other unwanted files.

glob_pattern = path.rstrip("/") + f"**/*.{file_format}"

ex:
if we want to find all the files ending with .npy in /tmp/tmpeejv3hoh
fs.glob("/tmp/tmpeejv3hoh**/*.npy") will explore all the files in /tmp and could even match wrong files like /tmp/tmpeejv3hoh_2/toto.npy

glob_pattern = path.rstrip("/") + f"**/*.{file_format}"

EmbeddingReader.__init__() got an unexpected keyword argument 'metadata_folder'

Hello,

I am trying to use the library to read some embeddings but I get this error

here how to reproduce

from embedding_reader import EmbeddingReader

embedding_reader = EmbeddingReader(embeddings_folder="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/img_emb/",
                                   metadata_folder="https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-en/laion2B-en-metadata/", 
                                   file_format="parquet_npy",    
                                   meta_columns=["url", "caption"],
)

print("embedding count", embedding_reader.count)
print("dimension", embedding_reader.dimension)
print("total size", embedding_reader.total_size)
print("byte per item", embedding_reader.byte_per_item)

for emb, meta in embedding_reader(batch_size=10 ** 6, start=0, end=embedding_reader.count, max_ram_usage_in_bytes=2**30):
    print(emb.shape)
    print(meta["url"], meta["caption"])
    break

Error :
TypeError: EmbeddingReader.__init__() got an unexpected keyword argument 'metadata_folder'

Add a sequential reader

Useful

  • as a benchmark
  • When data is local and doesn't benefit from parallelization
  • when the bottleneck is memory rather than time

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.