Giter VIP home page Giter VIP logo

Comments (20)

rom1504 avatar rom1504 commented on July 27, 2024

maybe just try to use the current code but cache usage of ParquetFile and close the file when all pieces have been read

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

consider using https://arrow.apache.org/docs/python/dataset.html , it's really good

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

along with https://arrow.apache.org/docs/python/filesystems.html#filesystem-fsspec

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024
import fsspec
import pyarrow.dataset as ds
from tqdm import tqdm
fs, p = fsspec.core.url_to_fs("https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion1B-nolang/laion1B-nolang-metadata")
files = fs.ls(p, detail=False)
d = ds.dataset(files, filesystem=fs)
b = d.to_batches()
for _ in tqdm(b):
    pass

about 1M sample/s eg 200MB/s, saturates the external server connection

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

idea:

  • use numpy read with the current implementation for numpy
  • use pyarrow dataset for parquet
  • zip the 2 for the numpy parquet reader

this should fix the parquet speed and will solve the memory issue

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

doesn't work due to no support of start/end in pyarrow.dataset

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

numpy reader doing 300MB/s from https fsspec
numpy parquet is at 30MB/s ...
this needs to be improved

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

maybe https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner can be used to support start stop

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

no, can't

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

ok best idea is to cache the files

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

seems a little bit better but not much

@lru_cache(maxsize=None)
def open_parquet_file(fs, filename):
    return pq.read_table(fs.open(filename, "rb"), use_threads=False)

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

actually lru cache doesn't work with multiple threads...

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024
r = Semaphore(1)

d = {}
def open_parquet_file(fs, filename):
    r.acquire()
    if filename in d:
        r.release()
        return d[filename]
    print(filename)
    t = pq.read_table(fs.open(filename, "rb"), use_threads=False)
    d[filename] = t
    r.release()
    return t

makes things much much faster

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

estimated 325min total (laion1B-nolang) with use_threads=True ; probably the same with False

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

150min for numpy alone

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

1012 without this change

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

seems to have solved the memleak too

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

https://github.com/rom1504/embedding-reader/pull/21/files

faster but didn't solve memleak

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

actually looks like it did. memory usage is a bit high (15GB with these settings), but memleak seems gone

from embedding-reader.

rom1504 avatar rom1504 commented on July 27, 2024

did a fix here

from embedding-reader.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.