Comments (9)
I don't know about what hashing method is running under the hood, but this was the questions that I had when contemplating it.
Normally the entire file is passed through in chunks, and each chunk is passed in serial, and includes the previous chunk, which means that parallelizing this will not work. In the alternative it should be possible hash chunks in parallel, and then take the hashes of the parallel chunks and then do a sha256 of those chunks.
Additionally, I know that there is hardware acceleration and various hardware instructions for sha256, however I dont know how much you really will enjoy supporting different hardware, or what sort of applications which will automatically handle the accelerators and what level of support that they have.
I seem to remember it taking around 2 days to index ~4TB of models on a 2TB NVME cached ZFS 8 drive SAS array
from kubo.
@endomorphosis mind providing some answers to below questions? These will let us narrow down the areas that require optimization to better serve your use case (AI model data?):
- data itself: how does 4TB of you model data look like? (e.g. average file sizes, number of files in a directory, directory tree depth)
- data onboarding: is it done with plain
ipfs add
or a different command else? Do you use any custom parameters? - storage backend: are you using flatfs+leveldb (default) or something else (badgerds?)]
- is
ipfs daemon
running while you perform the import? (or did you pass--offline
flag to skip announcements?)
from kubo.
it looks like:
Mostly large language models mostly between 7GB to 70G using badgerds on huggingface repositories, IPFS daemon is running online and i use -r to archive the entire folder, e.g. https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/tree/main
https://github.com/endomorphosis/ipfs_transformers/blob/main/ipfs_transformers/ipfs_kit_lib/ipfs.py#L73
/usr/local/bin/ipfs daemon --enable-gc --enable-pubsub-experiment
from kubo.
ipfs@workstation:/storage/cloudkit-models/Mixtral-8x7B-v0.1-GGUF-Q8_0$ time ipfs add mixtral-8x7b-v0.1.Q8_0.gguf
added QmXVg3Ae6wRwbvkVqMwySyx6qdcVdjEy1iu8xHnwT9dAoB mixtral-8x7b-v0.1.Q8_0.gguf
46.22 GiB / 46.22 GiB [=============================================================================================================================================================================] 100.00%
real 15m32.782s
user 0m22.393s
sys 1m54.226s
(copy from zfs to ssd, then zfs to zfs)
barberb@workstation:/storage/cloudkit-models/Mixtral-8x7B-v0.1-GGUF-Q8_0$ time cp mixtral-8x7b-v0.1.Q8_0.gguf /tmp/mixtral.bin
real 1m59.033s
user 0m0.681s
sys 1m17.771s
barberb@workstation:/storage/cloudkit-models/Mixtral-8x7B-v0.1-GGUF-Q8_0$ time cp mixtral-8x7b-v0.1.Q8_0.gguf ../
real 2m28.763s
user 0m0.624s
sys 0m45.349s
Dual Xeon E5-V4 CPU with 8x WD gold and 1x Samsung 3x8 pcie lane enterprise NVME on ZFS
I will try on my windows laptop workstation on only NVME using the desktop client in a second, and will update this, but it looks like ~ 50MB/s, and there the CPU utilization is always very low.
Windows nvme -> nvme (different devices) Intel(R) Xeon(R) CPU E3-1535M v6 using the desktop client
12m 30s
80% total cpu util (4 cores)
from kubo.
@endomorphosis Triage questions:
- do you mind sharing your
ipfs config show
? - is
time ipfs add mixtral-8x7b-v0.1.Q8_0.gguf
taking the same amount of time if you do it in offline mode? (daemon shut down, or running in--offline
mode). In the past, it helped with importing things like 350GiB wikipedia (by skipping online tasks such as announcements, gc, etc) and performing expensive import this way and then restarting node might unblock you for now. - otherwise, we need to wait for improvements / research tracked in
- #6523 (comment) (this is pretty old, but setting
syncwrite
tofalse
will speed things up significantly) - #9678
- #6523 (comment) (this is pretty old, but setting
from kubo.
devel@workstation:/tmp$ time ipfs add mixtral-8x7b-v0.1.Q8_0.gguf
added QmXVg3Ae6wRwbvkVqMwySyx6qdcVdjEy1iu8xHnwT9dAoB mixtral-8x7b-v0.1.Q8_0.gguf
46.22 GiB / 46.22 GiB [===========================================================================================================================================================================================================] 100.00%
real 14m8.824s
user 12m19.404s
sys 1m31.919s
running offline
fregg@workstation:/tmp$ time ipfs add mixtral-8x7b-v0.1.Q8_0.gguf
added QmXVg3Ae6wRwbvkVqMwySyx6qdcVdjEy1iu8xHnwT9dAoB mixtral-8x7b-v0.1.Q8_0.gguf
46.22 GiB / 46.22 GiB [===========================================================================================================================================================================================================] 100.00%
real 9m1.508s
user 12m46.422s
sys 2m11.760s
syncwrite to false
from kubo.
I do want to mention that I am writing a wrapper to import datasets into IPFS as well, where the number of files will be on the order of 7 million legal caselaw documents, I want to know if you have any feedback about what optimizations ought to be made for that instance?
from kubo.
Per #9678 I have an old branch with a bunch of optimizations for data import that I could rescue/cleanup. (edit: not for merging though as it hardcodes some stuff, just for using).
One low hanging fruit would be to add support for badger4 and pebble backends too. Badgerv1 is ages old.
from kubo.
Per #9678 I have an old branch with a bunch of optimizations for data import that I could rescue/cleanup. (edit: not for merging though as it hardcodes some stuff, just for using).
One low hanging fruit would be to add support for badger4 and pebble backends too. Badgerv1 is ages old.
I will look at that for my repositories
https://github.com/endomorphosis/ipfs_transformers
https://github.com/endomorphosis/ipfs_datasets
I would like to impress on your org (protocol labs) that I am trying to make this ergonomic for machine learning developers, and from what I have seen is that other projects (e.g. Bachalau, Iroh) are migrating away from libp2p / IPFS, because the project needs to be performance optimized.
I only have two more weeks I can spend on this hugging face bridge, and right now its more of a life raft in case the government decides to overregulate machine learning, than it is a viable solution that ML devs would turn to, but if something reasonably effective for decentralized low latency MLops existed I would probably stop development and just buy filecoins.
from kubo.
Related Issues (20)
- Lost Some of the files HOT 2
- panic: runtime error: invalid memory address or nil pointer dereference
- Memory leak HOT 19
- Add ability to deny serving any `Paths` content on a gateway by default
- Debian handler scripts - Copy or Submodule? HOT 1
- Set the public gateway in the web UI, custom ports are not accepted.
- Add a grace period to obtain the lock to avoid "Error: lock /data/ipfs/repo.lock: someone else has the lock"
- Download & upload IPNS records through API / CLI HOT 3
- race condition bug or a flaky test: TestAddMultipleGCLive
- outdated coreiface/path reference in rpc example
- Release 0.31
- [question] How to failfast when get a none-exists CID HOT 1
- go1.23 timers cause issues with libp2p connmgr and more HOT 3
- Misleading error when attempting to add a socket file
- Systemd service fails to start when IPFS_PATH is not ~/.ipfs HOT 2
- Prevent multiple instances of "ipfs bitswap reprovide" running at the same time HOT 1
- Remove shebang from shell completions HOT 4
- Could not connect to the IPFS API from WebUI at http://0.0.0.0:5001/webui HOT 3
- Accept symbolic links as mounting scripts in the Docker image HOT 1
- Cannot find machine-readable HTTP API spec HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kubo.