Comments (20)
maybe just try to use the current code but cache usage of ParquetFile and close the file when all pieces have been read
from embedding-reader.
consider using https://arrow.apache.org/docs/python/dataset.html , it's really good
from embedding-reader.
along with https://arrow.apache.org/docs/python/filesystems.html#filesystem-fsspec
from embedding-reader.
import fsspec
import pyarrow.dataset as ds
from tqdm import tqdm
fs, p = fsspec.core.url_to_fs("https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion1B-nolang/laion1B-nolang-metadata")
files = fs.ls(p, detail=False)
d = ds.dataset(files, filesystem=fs)
b = d.to_batches()
for _ in tqdm(b):
pass
about 1M sample/s eg 200MB/s, saturates the external server connection
from embedding-reader.
idea:
- use numpy read with the current implementation for numpy
- use pyarrow dataset for parquet
- zip the 2 for the numpy parquet reader
this should fix the parquet speed and will solve the memory issue
from embedding-reader.
doesn't work due to no support of start/end in pyarrow.dataset
from embedding-reader.
numpy reader doing 300MB/s from https fsspec
numpy parquet is at 30MB/s ...
this needs to be improved
from embedding-reader.
maybe https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner can be used to support start stop
from embedding-reader.
no, can't
from embedding-reader.
ok best idea is to cache the files
from embedding-reader.
seems a little bit better but not much
@lru_cache(maxsize=None)
def open_parquet_file(fs, filename):
return pq.read_table(fs.open(filename, "rb"), use_threads=False)
from embedding-reader.
actually lru cache doesn't work with multiple threads...
from embedding-reader.
r = Semaphore(1)
d = {}
def open_parquet_file(fs, filename):
r.acquire()
if filename in d:
r.release()
return d[filename]
print(filename)
t = pq.read_table(fs.open(filename, "rb"), use_threads=False)
d[filename] = t
r.release()
return t
makes things much much faster
from embedding-reader.
estimated 325min total (laion1B-nolang) with use_threads=True ; probably the same with False
from embedding-reader.
150min for numpy alone
from embedding-reader.
1012 without this change
from embedding-reader.
seems to have solved the memleak too
from embedding-reader.
https://github.com/rom1504/embedding-reader/pull/21/files
faster but didn't solve memleak
from embedding-reader.
actually looks like it did. memory usage is a bit high (15GB with these settings), but memleak seems gone
from embedding-reader.
did a fix here
from embedding-reader.
Related Issues (20)
- Slow and incorrect exploration of embedding files with fs.glob() HOT 2
- add numpy + parquet reader HOT 1
- use a local cache to make parquet reading faster
- solve memleak in parquet numpy reader HOT 7
- Add a sequential reader HOT 2
- improve parquet reader in the same way as parquet numpy reader
- build dedicated package, depending on this to create clip subset
- consider moving most examples to dedicated packages HOT 1
- Missing format string specifier HOT 1
- try out new fsspec feature to speed up parquet read
- AttributeError: 'NoneType' object has no attribute 'group' HOT 7
- How do you prepare the local dataset? HOT 3
- str() causes .npy file header to fail regex HOT 5
- [question] Multiple vs single parquet/np files for embeddings HOT 1
- Would there be a way to allow for more recent pyarrow versions? HOT 3
- EmbeddingReader.__init__() got an unexpected keyword argument 'metadata_folder' HOT 8
- 1.7.0 published tarball does not include requirements.txt HOT 6
- add example for some classical datasets
- Add example of predicting clip text from clip image HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from embedding-reader.