ronomon / deduplication Goto Github PK

Fast multi-threaded content-dependent chunking deduplication for Buffers in C++ with a reference implementation in Javascript. Ships with extensive tests, a fuzz test and a benchmark.

License: MIT License

JavaScript 71.62% C++ 27.51% Python 0.87%

chunking content-dependent deduplication nodejs

deduplication's People

Contributors

Stargazers

Watchers

Forkers

happy-ferret mchepel safai-labs lx6317648 maobaolong smoozilla onlyone0001 normano honzapatcz

deduplication's Issues

distinguish between forced and natural cut points

Fantastic open source contribution, thank you!

I'm confused about the purpose of flags. Perhaps I'm missing something obvious, but would you mind explaining why the last chunk in a non-last source buffer isn't hashed and reported?

deduplication/binding.js

Line 192 in 47e077c

if (flags === 0 && (sourceOffset + chunkSize) === sourceLength) break;

Shift-Right in Gear rollsum messes with resynchronisation after changes.

This implementation of FastCDC has changed the Gear rollsum to use a right-shift instead of a left-shift. This is done so that the least-significant bits of the hash include data from all 32 previous bytes in the rolling window, so that "zero-padding" of the hash judgement mask is not required.

However, this is problematic because it means bytes don't fully "expire" from the hash, so it no longer only includes the last 32 bytes, and bytes from the very start of the block can modify the hash. This is because the addition in the Gear rollsum also
propagates some bit changes upwards, so the shift-right might never fully expire a byte out of the hash. In some ways this improves the entropy/distribution of the hash, but it also means a change early in a chunk can change the break point at the end, messing with resynchronization.

One way to address this would be to replace the addition with an xor, so that bit changes don't propagate upwards. This would make it like a simplified buzzhash. However, this would further weaken the hash, and Gear is already very weak.

The number of bits in the Gear rollsum directly limits how many bytes are included in the sliding window. Reducing the number of bits from 32 to 31 directly reduces the sliding window size. The Gear rollsum is already a not particularly strong rollsum, so this could matter. In languages that don't have uint32 types this might be unavoidable, but copying this in languages without that limitation is unnecessarily limiting the rollsum strength.

This implementation has been widely copied by other implementations, so this problem affects at least the following implementations as well;

For the record, FastCDC's zero-padding solution when using a left-shift is also IMHO sub-optimal. We know that the most significant bits include the most information about the past 32 bytes, so the best solution is simply use the N most significant bits for a 2^N expected size. This can be done with a mask to check the upper bits are zero as in the FastCDC paper like this;

  MASK = (-1 << (32-N))

  if (!(hash & MASK))
    <it's a chunk boundary>

But a shift-right is probably just as fast and an even simpler way to do the same thing;

  SHIFT = 32 - N

  if !(hash >> SHIFT):
     <it's a chunk boundary>

However, just as fast but even more flexible is to check if the hash value is smaller than a threshold. This allows arbitrary expected size values, not just powers-of-2. It uses all the bits in the hash and treats the most significant bits as most important. This matches nicely with the characteristics of the Gear rollsum;

  THRESHOLD = 2^32 / tgt_size

  if (hash < THRESHOLD):
     <it's a chunk boundary>

Note the above suggestions assume uint32 support. They will need obvious modifications if using a 31 bit hash.

Observations and Recommendations using this chunker.

This is not so much a bug as general observations and recommendations about using this chunker. I'm putting this here since this chunker is widely copied by other chunker implementations for other languages, and people copying this probably also want to know this. I'm not sure where else this kind of info should go. Apologies if this is annoying. This comes out of some testing and analysis I did of chunker algorithms here;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

Observations related to this FastCDC implementation are;

The average argument is not really the average block length. The actual average is very close to minimum + average/2 (for reasons explained below).
This implementation changes the FastCDC "transition point" to max(0, average - 1.5*minimum). This means that the transition point is less than minimum unless minimum < 0.4*average.
FastCDC gets most of its speed advantages from "cut-point-skipping" past minimum, and recommends minimum = 0.5*average or even minimum=average. This means for the FastCDC recommended minimum settings that give good speed, the first mask is never used, and only the second mask is used.
When only the second mask is used, FastCDC's "normalized chunking" feature is not used. Since this implementation uses "NC1" with only one bit of normalization, it behaves like a normal exponential chunker with average set to half the value. An exponential chunker's actual average block length (excluding truncation from maximum) is minimum + average, hence this chunker's actual average being minimum + average/2 (provided maximum is set large enough to minimize truncations).
Not using "normalized chunking" is probably a good thing anyway, since my testing suggests that is worse than simple exponential chunking when you compare using the same actual average and minimum chunk sizes.

With all of that in mind, based on my testing, for this chunker the best minimum/average/maximum settings for optimal deduplication with good speed would be minimum=average/2, maximum>2.5*average with an average chunk length of average. For faster chunking speed with slightly reduced deduplication use minimum=average, maximum>3*average with an average chunk length of 1.5*average. Unfortunately this chunker throws an error if minimum >= average, so it doesn't let you do that. However, you can get away with setting minimum = average -1.

Would a Promise-base example be useful?

The demo is all well and good, but for an application, something based on Promises seems more suitable. I offer this bit of code that is working for me:

function findFileChunks (infile, average) {
  const fd = fs.openSync(infile, 'r')
  const minimum = Math.round(average / 2)
  const maximum = average * 2
  const source = Buffer.alloc(maximum * 2)
  const target = Buffer.alloc(dedupe.targetSize(minimum, source.length))

  return new Promise((resolve, reject) => {
    let flags = 0
    const close = (error) => {
      fs.closeSync(fd)
      if (error) {
        // force the loop to exit
        flags = 1
        reject(error)
      }
    }

    let chunks = []
    let fileOffset = 0
    let chunkOffset = 0
    let sourceStart = 0

    while (flags === 0) {
      const length = source.length - sourceStart
      const bytesRead = fs.readSync(fd, source, sourceStart, length, fileOffset)
      fileOffset += bytesRead
      flags = (bytesRead < length) ? 1 : 0
      const sourceSize = sourceStart + bytesRead
      try {
        dedupe.deduplicate(average, minimum, maximum, source, 0, sourceSize, target, 0, flags,
          (error, sourceOffset, targetOffset) => {
            // n.b. the library throws the error, so this is always undefined
            if (error) {
              close(error)
              return
            }
            let offset = 0
            while (offset < targetOffset) {
              const hash = target.slice(offset, offset + 32)
              offset += 32
              const size = target.readUInt32BE(offset)
              offset += 4
              chunks.push({ hash, offset: chunkOffset, size })
              chunkOffset += size
            }
            // Anything remaining in the source buffer should be moved to the
            // beginning of the source buffer, and become the sourceStart for the
            // next read so that we do not read data we have already read:
            sourceStart = sourceSize - sourceOffset
            if (sourceStart > 0) {
              source.copy(source, 0, sourceOffset, sourceOffset + sourceStart)
            }
            if (flags !== 0) {
              // the last block has finished processing
              close()
              resolve(chunks)
            }
          }
        )
      } catch (err) {
        close(err)
      }
    }
  })
}

For demo purposes, this would need a .then() and some console logging to match the current demo. Let me know if this is useful and I can create a proper pull request. Or not, and feel free to close this.

Why `error` in callback when the library throws exceptions?

So the callback is given an error argument, but it is never set. Instead, the library throws exceptions. Why is that the case? And thank you for writing this useful library.

ronomon / deduplication Goto Github PK

deduplication's People

Contributors

Stargazers

Watchers

Forkers

deduplication's Issues

distinguish between forced and natural cut points

Shift-Right in Gear rollsum messes with resynchronisation after changes.

Observations and Recommendations using this chunker.

Would a Promise-base example be useful?

Why `error` in callback when the library throws exceptions?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent