Giter VIP home page Giter VIP logo

Comments (7)

fd0 avatar fd0 commented on August 31, 2024 1

I'm fine with proposed changes to this library, but the changes must stay backwards compatible. We need to do any changes so that (without opting in) the behavior must not change (e.g. yield different chunks).

from chunker.

SaveTheRbtz avatar SaveTheRbtz commented on August 31, 2024

Hm... never mind -- seems like I've misused the API: using NewWithBoundaries does not automatically adjusts the splitmask to be an average of the two (which is not even possible when there there is no power of two between them). Calling SetAverageBits would likely fix my problem of skewed chunk sizes.

from chunker.

SaveTheRbtz avatar SaveTheRbtz commented on August 31, 2024

Actually, it is not that simple. On a random input it is indeed working fine but if I use a skewed input then output chunk sizes are heavily skewed towards minSize (regardless of the polynomyal).

If I change (digest&c.splitmask) == 0 to (digest&c.splitmask) == 1 then chunk size distribution becomes way better.

Problem also goes away if we match on exact number of zeroes instead of greater or equal, e.g.:

bits.TrailingZeros64(digest) == c.averageBits

Looking closer at values of digest on split, seems like values of 0, 10000, 1000000, 100000000, 10000000000, and 1000000000000 dominate the splits when input is a tar file (regardless of the polynomial). This is likely due to .tar format having runs of zeros that are longer than rolling hash's windowSize (bumping it to >=512 also fixes the problem):

00000000: 7061 785f 676c 6f62 616c 5f68 6561 6465  pax_global_heade
00000010: 7200 0000 0000 0000 0000 0000 0000 0000  r...............
00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................

[1] i'm using first 10 MiB of an uncompressed linux kernel tar as an input:

$ git archive --format tar -o ~/tmp/linux-v5.15.tar v5.15

from chunker.

fd0 avatar fd0 commented on August 31, 2024

Hm, interesting observation! I don't have much time right now, but I remember that somewhere in the original paper an attribute of the digest algorithm was described: if you feed it blocks of zeroes, the digest stays at zero. That leads to blocks of zeroes yielding always the smalles block size possible. For our use case (restic backup program) with rather large min block sizes that's good, because it deduplicates the "zero block" and just stores it once. Changing that to 1 changes the behavior significantly, if that's good or bad depends on your use case I guess.

from chunker.

SaveTheRbtz avatar SaveTheRbtz commented on August 31, 2024

Is it possible to change it to either:

bits.TrailingZeros64(digest) == c.averageBits

or a more gentle:

(digest != 0 && (digest&c.splitmask) == 0)

this should not invalidate all the existing chunks but only ones that end with a long run of zeros and hence were erroneously aggregated. IIUC restic should not be too affected given that its default min/max sizes are enormous and therefore "zero-only" segments would not be de-duplicated only if all of them are between min and max: ones smaller than min are already skipped, and the ones bigger than max would be forcefully cut.

from chunker.

SaveTheRbtz avatar SaveTheRbtz commented on August 31, 2024

What do you think about the following (API-compatible) change:

func New(rd io.Reader, pol Pol, opts ...Option) *Chunker

With options defined as following:

type Option func(*Chunker)

func WithBuffer(buf []byte) Option {
	return func(c *Chunker)  {
		c.buf = buf
	}
}

func WithBoundaries(min, max uint) Option {
	return func(c *Chunker) {
		c.MinSize = min
		c.MaxSize = max
	}
}

func WithAverageBits(averageBits int) Option {
	return func(c *Chunker) {
		c.splitmask = (1 << uint64(averageBits)) - 1
	}
}

This would allow to define more options in the future (including one for skipping 0 digests) without the need for more NewWith* constructors.

As a side benefit some of the methods can be simplified, e.g.:
NewWithBoundaries:

func NewWithBoundaries(rd io.Reader, pol Pol, min, max uint) *Chunker {
	return New(rd, pol, WithBoundaries(min, max))
}

Reset:

func (c *Chunker) Reset(rd io.Reader, pol Pol, opts ...Option) {
	opts = append(opts, WithBuffer(c.buf))
	*c = *New(rd, pol, opts...)
}

ResetWithBoundaries:

func (c *Chunker) ResetWithBoundaries(rd io.Reader, pol Pol, min, max uint) {
	c.Reset(rd, pol, WithBoundaries(min, max))
}

A draft: #37 (all existing tests pass w/o modification)

from chunker.

fd0 avatar fd0 commented on August 31, 2024

Yeah, I'm familiar with this options pattern, and I like it :)

Thanks for working on this!

from chunker.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.