Comments (7)
Is there memleak for parquet/numpy reader as well ?
from embedding-reader.
I don't know
I would think no or we would have seen it in autofaiss
from embedding-reader.
I didn't ran out of memory actually, so maybe it's just that the gc is not running for some reason until it's needed (but it did use more than 200GB of ram..)
from embedding-reader.
I am not sure if this is same as what I have faced..
I am running into OOM when running the NSFW model example from the PR #16
I ran it with all the defaults, 10 parallel processes, except that the embeddings are already on file system(not form a cloud storage)
I see that memory keeps increasing as more batches are being processed, it eventually leads to OOM.
VM has 128 GB ram, all of it is available and no other process is contesting RAM
After doing some debugging using profiler.
Root cause seems to be somewhere in read_piece
function from https://github.com/rom1504/embedding-reader/blob/main/embedding_reader/parquet_numpy_reader.py
Most specifically from reading the metadata it seems
I did a small modification to read_piece method to move out the metadata part so that I can apply @profile
to it.
I tried a bunch of things like setting parallel_pieces
to 1, deleting pandas objects and calling explicit GC. None of them changed the behavior.
One thing to note is that, we are reading all the metadata using pyarrow and then picking the relevant columns. I see that pandas has a read_parquet
method where we can specify the columns we want to read. Maybe using pandas read_parquet is elegant? Note that pandas internally uses pyarrow as a default engine.
I have tried with change also and I didn't see any difference, it leads OOM eventually.
from embedding-reader.
probably the same problem
embeddings are there now https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion1B-nolang/
fsspec supports https as well, so I'm going to use that to reproduce and fix
from embedding-reader.
ok so after testing some more, it seems the problem is very related with the fact that it doesn't make much sense to read parquet files in arbitrary pieces because the disk serialization doesn't allow retrieving only these pieces. At best splitting in row groups could make sense.
I will test some more and maybe simply change the default such that the parquet+numpy and the parquet reader read whole file at once. or whole row groups at once
Another way to solve the problem in a more general way could be to keep the piece fetching logic, but instead of using it to load pieces of embedding files, use it to prepare on disk (or in memory) pieces of files at the byte level, and then load the batch all at once using numpy/pyarrow
from embedding-reader.
solved now
from embedding-reader.
Related Issues (20)
- Slow and incorrect exploration of embedding files with fs.glob() HOT 2
- add numpy + parquet reader HOT 1
- use a local cache to make parquet reading faster
- reimplement parquet reader using ParquetFile HOT 20
- Add a sequential reader HOT 2
- improve parquet reader in the same way as parquet numpy reader
- build dedicated package, depending on this to create clip subset
- consider moving most examples to dedicated packages HOT 1
- Missing format string specifier HOT 1
- try out new fsspec feature to speed up parquet read
- AttributeError: 'NoneType' object has no attribute 'group' HOT 7
- How do you prepare the local dataset? HOT 3
- str() causes .npy file header to fail regex HOT 5
- [question] Multiple vs single parquet/np files for embeddings HOT 1
- Would there be a way to allow for more recent pyarrow versions? HOT 3
- EmbeddingReader.__init__() got an unexpected keyword argument 'metadata_folder' HOT 8
- 1.7.0 published tarball does not include requirements.txt HOT 6
- add example for some classical datasets
- Add example of predicting clip text from clip image HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from embedding-reader.