The purpose of the chunker is to find some markers in the data. It is best if those ma

Chunker on data containing series of zeroes about chunker HOT 3 CLOSED

restic commented on August 31, 2024

Chunker on data containing series of zeroes

from chunker.

Comments (3)

fd0 commented on August 31, 2024 1

FYI, here's a library which contains my implementation of the rolling hash algorithm alone: https://github.com/chmduquesne/rollinghash/tree/master/rabinkarp64

It'll provide the raw hash value so the caller can decide what to do and where to split. I'm closing this (rather old) issue for now, please feel free to add further comments. Thanks!

from chunker.

fd0 commented on August 31, 2024

Hey, thanks for raising this issue. The main application for this package is the the restic backup program and blocks of zeroes are often found in binary files like VM images. During development I thought about how to handle blocks of zeroes and decided to implement it this way for the following reasons:

For backups I estimated that it is a good idea to consider a split in a file as soon as a block of zero bytes (at least 64 byte) is found and the chunk is large enough, i.e. at least MinSize bytes.
The chunker can be seeded by specifying a randomly generated polynomial, but this does not change the behavior for blocks of null bytes. I think this is good for a backup program.
In my early benchmarks testing the digest for zeros was much faster than testing for a different value.

When files have a large number of zeroes in them (think sparse files or VM images) this results in many small chunks of zeroes with MinSize bytes each. With your proposal, for a large number of null bytes the chunker won't find any split mark at all, and return a chunk of MaxSize bytes. While this is a bit better (after all, we have less blocks to care about) I like the behavior for smaller blocks of zero bytes (as outlined above) better.

As already described in #9 I won't change the behavior lightly as the deduplication in restic depends on the chunker.

What do you think?

from chunker.

green-coder commented on August 31, 2024

I totally understand that you won't change how the chunks are chosen since you already have an app in production which is using it. But that topic is interesting, so let's keep discussing it.

Large series of zeroes (or other constants) are the problem, it's not how we cut them : in a CDC, a lack of variance in the data means a lack of possible markers. Creating small chunks (via minSize) or large chunks (via maxSize) from that kind of data is not related to how well we cut, it is related to what we choose to do with those embarrassing blocks (less of them with big ones vs. more of them with small ones).

IMHO, a good marker is one which will have a probability of appearing as close as a uniform distribution as possible. The purpose is to obtain a low variance in the chunk size, as well as an average of chunk size as close as what we expect. That's specially useful if we choose the minSize and maxSize parameters based on those expectations.

Each time we run into either the minSize or the maxSize while chunking because of too many markers or not enough of them, we do not do a CDC anymore and we loose the advantages associated with it.

In practice, I have no idea if your approach is better or not, the problem is really complex unfortunately.

from chunker.

Chunker on data containing series of zeroes about chunker HOT 3 CLOSED

Comments (3)

Related Issues (15)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent