Light

reimplement parquet reader using ParquetFile about embedding-reader HOT 20 CLOSED

rom1504 commented on July 27, 2024

reimplement parquet reader using ParquetFile

from embedding-reader.

Comments (20)

rom1504 commented on July 27, 2024

maybe just try to use the current code but cache usage of ParquetFile and close the file when all pieces have been read

from embedding-reader.

rom1504 commented on July 27, 2024

consider using https://arrow.apache.org/docs/python/dataset.html , it's really good

from embedding-reader.

rom1504 commented on July 27, 2024

along with https://arrow.apache.org/docs/python/filesystems.html#filesystem-fsspec

from embedding-reader.

rom1504 commented on July 27, 2024

import fsspec
import pyarrow.dataset as ds
from tqdm import tqdm
fs, p = fsspec.core.url_to_fs("https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion1B-nolang/laion1B-nolang-metadata")
files = fs.ls(p, detail=False)
d = ds.dataset(files, filesystem=fs)
b = d.to_batches()
for _ in tqdm(b):
    pass

about 1M sample/s eg 200MB/s, saturates the external server connection

from embedding-reader.

rom1504 commented on July 27, 2024

idea:

use numpy read with the current implementation for numpy
use pyarrow dataset for parquet
zip the 2 for the numpy parquet reader

this should fix the parquet speed and will solve the memory issue

from embedding-reader.

rom1504 commented on July 27, 2024

doesn't work due to no support of start/end in pyarrow.dataset

from embedding-reader.

rom1504 commented on July 27, 2024

numpy reader doing 300MB/s from https fsspec
numpy parquet is at 30MB/s ...
this needs to be improved

from embedding-reader.

rom1504 commented on July 27, 2024

maybe https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner can be used to support start stop

from embedding-reader.

rom1504 commented on July 27, 2024

no, can't

from embedding-reader.

rom1504 commented on July 27, 2024

ok best idea is to cache the files

from embedding-reader.

rom1504 commented on July 27, 2024

seems a little bit better but not much

@lru_cache(maxsize=None)
def open_parquet_file(fs, filename):
    return pq.read_table(fs.open(filename, "rb"), use_threads=False)

from embedding-reader.

rom1504 commented on July 27, 2024

actually lru cache doesn't work with multiple threads...

from embedding-reader.

rom1504 commented on July 27, 2024

r = Semaphore(1)

d = {}
def open_parquet_file(fs, filename):
    r.acquire()
    if filename in d:
        r.release()
        return d[filename]
    print(filename)
    t = pq.read_table(fs.open(filename, "rb"), use_threads=False)
    d[filename] = t
    r.release()
    return t

makes things much much faster

from embedding-reader.

rom1504 commented on July 27, 2024

estimated 325min total (laion1B-nolang) with use_threads=True ; probably the same with False

from embedding-reader.

rom1504 commented on July 27, 2024

150min for numpy alone

from embedding-reader.

rom1504 commented on July 27, 2024

1012 without this change

from embedding-reader.

rom1504 commented on July 27, 2024

seems to have solved the memleak too

from embedding-reader.

rom1504 commented on July 27, 2024

https://github.com/rom1504/embedding-reader/pull/21/files

faster but didn't solve memleak

from embedding-reader.

rom1504 commented on July 27, 2024

actually looks like it did. memory usage is a bit high (15GB with these settings), but memleak seems gone

from embedding-reader.

rom1504 commented on July 27, 2024

did a fix here

from embedding-reader.

Related Issues (20)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.