Giter VIP home page Giter VIP logo

Comments (3)

fd0 avatar fd0 commented on August 31, 2024 1

FYI, here's a library which contains my implementation of the rolling hash algorithm alone: https://github.com/chmduquesne/rollinghash/tree/master/rabinkarp64

It'll provide the raw hash value so the caller can decide what to do and where to split. I'm closing this (rather old) issue for now, please feel free to add further comments. Thanks!

from chunker.

fd0 avatar fd0 commented on August 31, 2024

Hey, thanks for raising this issue. The main application for this package is the the restic backup program and blocks of zeroes are often found in binary files like VM images. During development I thought about how to handle blocks of zeroes and decided to implement it this way for the following reasons:

  • For backups I estimated that it is a good idea to consider a split in a file as soon as a block of zero bytes (at least 64 byte) is found and the chunk is large enough, i.e. at least MinSize bytes.
  • The chunker can be seeded by specifying a randomly generated polynomial, but this does not change the behavior for blocks of null bytes. I think this is good for a backup program.
  • In my early benchmarks testing the digest for zeros was much faster than testing for a different value.

When files have a large number of zeroes in them (think sparse files or VM images) this results in many small chunks of zeroes with MinSize bytes each. With your proposal, for a large number of null bytes the chunker won't find any split mark at all, and return a chunk of MaxSize bytes. While this is a bit better (after all, we have less blocks to care about) I like the behavior for smaller blocks of zero bytes (as outlined above) better.

As already described in #9 I won't change the behavior lightly as the deduplication in restic depends on the chunker.

What do you think?

from chunker.

green-coder avatar green-coder commented on August 31, 2024

I totally understand that you won't change how the chunks are chosen since you already have an app in production which is using it. But that topic is interesting, so let's keep discussing it.

Large series of zeroes (or other constants) are the problem, it's not how we cut them : in a CDC, a lack of variance in the data means a lack of possible markers. Creating small chunks (via minSize) or large chunks (via maxSize) from that kind of data is not related to how well we cut, it is related to what we choose to do with those embarrassing blocks (less of them with big ones vs. more of them with small ones).

IMHO, a good marker is one which will have a probability of appearing as close as a uniform distribution as possible. The purpose is to obtain a low variance in the chunk size, as well as an average of chunk size as close as what we expect. That's specially useful if we choose the minSize and maxSize parameters based on those expectations.

Each time we run into either the minSize or the maxSize while chunking because of too many markers or not enough of them, we do not do a CDC anymore and we loose the advantages associated with it.

In practice, I have no idea if your approach is better or not, the problem is really complex unfortunately.

from chunker.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.