Giter VIP home page Giter VIP logo

Comments (8)

Cydhra avatar Cydhra commented on August 22, 2024 1

Yeah feel free to. Zero copy deserializability is a feature that I'd like, so I don't think there is a reason to keep the two issues disjoint.

from vers.

Cydhra avatar Cydhra commented on August 22, 2024

I never used or worked with those, but from skimming the paper, it seems a lot less painful than my attempts at implementing succinct trees. Adding a feature that actually has use cases and is easier to implement definitely sounds more fun.
I am pretty tied up right now, but when I eventually have some free time again, I'll give it a shot.
Do sucds and cseq implement Wavelet Trees or Matrices?

from vers.

somethingelseentirely avatar somethingelseentirely commented on August 22, 2024

Yeah, they are both implementing WaveletMatrices. I think the main insight for wavelet matrices comes from the fact that the total number of bits in each level stays constant for a balances wavelet tree (this invariant doesn't hold for compressed versions like huffman trees), which makes them also a lot easier to implement.

I'd be happy to also give an implementation a shot, but I'd probably aready try to build it with zerocopy deserializability in mind (#5) if that's ok with you. Because I need this for an on-disk db storage format.

from vers.

Cydhra avatar Cydhra commented on August 22, 2024

Implementing it with zero-copy deserializability in mind is obviously possible, but not really necessary. While it is true, that the data structures are immutable, they do not freely lend themselves to zero-copy (de)serialization.
Zero-copy deserialization has to mind the endianess of the backing system. That is, each zero copy data structure needs to have multiple versions which handle either endianness, with different numeric types. Mapping, for example, a big-endian encoded file into a little-endian machine works, but the data types must be handled differently because of the inverted byte order. In contrast, converting the big-endian structure into little-endian on the fly is also possible, but destroys the zero-copy property. We cannot just write a data structure that is already the archived version, which rkyv usually provides, because of that.

I started implementing a WaveletMatrix on the branch dev_wavelet. The data structure, just like the other data structures in this library is agnostic to zero-copy deserialization. If zero-copy serialization were to be implemented for it, more data types that represent either endianness (among other things like #[repr(C)]) need to be added. The only thing that is possible to already do here, is to add the functionality of the WaveletMatrix via a trait, to later reuse code for the zero copy versions.

Zero copy versions of other data structures, as outlined in #5, need these traits too, inducing potentially breaking changes.

from vers.

somethingelseentirely avatar somethingelseentirely commented on August 22, 2024

Thanks for pushing this! I carry the wavelet matrix paper around with me in my backpack, but there were other fires that popped up that need to be put out first. 😓

I've also recently encountered the endianness issue while working with candle and model weight storage formats like SafeTensor. The approach of all Rust libraries and storage formats I've encountered seems to be to just assume that all of the relevant platforms are/support little-endian these days.

I would still want to make sure that that is the case and add something like:

#[cfg(not(target_endian = "little"))]
compile_error!("Zero copy is only supported on LE architectures.)

But after some philosophical wrangling with my inner perfectionist, I've concluded that this is a valid approach 😅. There are few BE-only systems supported by Rust, and they are all embedded systems that don't have SIMD and maybe don't even have a file system let alone the ability to mmap files.

Edit:
To expand a bit on how this should make zero copy pretty easy to implement. Given that the data-structures are already immutable, they would only need to be repr(C) and derive(FromBytes, FromZeroes, AsBytes) from the zerocopy crate. Due to native number types already being the correct endian thanks to the above check, they don't need to use the byteorder types, and can just use rusts native number types.

from vers.

Cydhra avatar Cydhra commented on August 22, 2024

To expand a bit on how this should make zero copy pretty easy to implement. Given that the data-structures are already immutable, they would only need to be repr(C) and derive(FromBytes, FromZeroes, AsBytes) from the zerocopy crate.

Maybe it is easier when using zerocopy rather than rkyv. However, there is unsafe code in vers that may or may not play nicely with the safety assumptions of zerocopy.

I still strongly prefer to have this issue decoupled from zero-copying (despite what I said two months ago), but I'd be happy to see a proof of concept with zerocopy.

from vers.

somethingelseentirely avatar somethingelseentirely commented on August 22, 2024

Definitely fair. With zerocopy there should be virtually no considerations to make while implementing this, and the struct can be made zero-copy later on. I think the only limitation is that once zero copy is implemented the struct's can't be changed anymore/need to be versioned, but that's not really relevant for implementing the wavelet matrix itself.

from vers.

Cydhra avatar Cydhra commented on August 22, 2024

That is unfortunately already true because of serde (I think). Changes that need to be made are collected in #9

from vers.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.