Giter VIP home page Giter VIP logo

deduplication's People

Contributors

jorangreef avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

deduplication's Issues

Shift-Right in Gear rollsum messes with resynchronisation after changes.

This implementation of FastCDC has changed the Gear rollsum to use a right-shift instead of a left-shift. This is done so that the least-significant bits of the hash include data from all 32 previous bytes in the rolling window, so that "zero-padding" of the hash judgement mask is not required.

However, this is problematic because it means bytes don't fully "expire" from the hash, so it no longer only includes the last 32 bytes, and bytes from the very start of the block can modify the hash. This is because the addition in the Gear rollsum also
propagates some bit changes upwards, so the shift-right might never fully expire a byte out of the hash. In some ways this improves the entropy/distribution of the hash, but it also means a change early in a chunk can change the break point at the end, messing with resynchronization.

One way to address this would be to replace the addition with an xor, so that bit changes don't propagate upwards. This would make it like a simplified buzzhash. However, this would further weaken the hash, and Gear is already very weak.

The number of bits in the Gear rollsum directly limits how many bytes are included in the sliding window. Reducing the number of bits from 32 to 31 directly reduces the sliding window size. The Gear rollsum is already a not particularly strong rollsum, so this could matter. In languages that don't have uint32 types this might be unavoidable, but copying this in languages without that limitation is unnecessarily limiting the rollsum strength.

This implementation has been widely copied by other implementations, so this problem affects at least the following implementations as well;

For the record, FastCDC's zero-padding solution when using a left-shift is also IMHO sub-optimal. We know that the most significant bits include the most information about the past 32 bytes, so the best solution is simply use the N most significant bits for a 2^N expected size. This can be done with a mask to check the upper bits are zero as in the FastCDC paper like this;

  MASK = (-1 << (32-N))

  if (!(hash & MASK))
    <it's a chunk boundary>

But a shift-right is probably just as fast and an even simpler way to do the same thing;

  SHIFT = 32 - N

  if !(hash >> SHIFT):
     <it's a chunk boundary>

However, just as fast but even more flexible is to check if the hash value is smaller than a threshold. This allows arbitrary expected size values, not just powers-of-2. It uses all the bits in the hash and treats the most significant bits as most important. This matches nicely with the characteristics of the Gear rollsum;

  THRESHOLD = 2^32 / tgt_size

  if (hash < THRESHOLD):
     <it's a chunk boundary>

Note the above suggestions assume uint32 support. They will need obvious modifications if using a 31 bit hash.

Observations and Recommendations using this chunker.

This is not so much a bug as general observations and recommendations about using this chunker. I'm putting this here since this chunker is widely copied by other chunker implementations for other languages, and people copying this probably also want to know this. I'm not sure where else this kind of info should go. Apologies if this is annoying. This comes out of some testing and analysis I did of chunker algorithms here;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

Observations related to this FastCDC implementation are;

  1. The average argument is not really the average block length. The actual average is very close to minimum + average/2 (for reasons explained below).
  2. This implementation changes the FastCDC "transition point" to max(0, average - 1.5*minimum). This means that the transition point is less than minimum unless minimum < 0.4*average.
  3. FastCDC gets most of its speed advantages from "cut-point-skipping" past minimum, and recommends minimum = 0.5*average or even minimum=average. This means for the FastCDC recommended minimum settings that give good speed, the first mask is never used, and only the second mask is used.
  4. When only the second mask is used, FastCDC's "normalized chunking" feature is not used. Since this implementation uses "NC1" with only one bit of normalization, it behaves like a normal exponential chunker with average set to half the value. An exponential chunker's actual average block length (excluding truncation from maximum) is minimum + average, hence this chunker's actual average being minimum + average/2 (provided maximum is set large enough to minimize truncations).
  5. Not using "normalized chunking" is probably a good thing anyway, since my testing suggests that is worse than simple exponential chunking when you compare using the same actual average and minimum chunk sizes.

With all of that in mind, based on my testing, for this chunker the best minimum/average/maximum settings for optimal deduplication with good speed would be minimum=average/2, maximum>2.5*average with an average chunk length of average. For faster chunking speed with slightly reduced deduplication use minimum=average, maximum>3*average with an average chunk length of 1.5*average. Unfortunately this chunker throws an error if minimum >= average, so it doesn't let you do that. However, you can get away with setting minimum = average -1.

Would a Promise-base example be useful?

The demo is all well and good, but for an application, something based on Promises seems more suitable. I offer this bit of code that is working for me:

function findFileChunks (infile, average) {
  const fd = fs.openSync(infile, 'r')
  const minimum = Math.round(average / 2)
  const maximum = average * 2
  const source = Buffer.alloc(maximum * 2)
  const target = Buffer.alloc(dedupe.targetSize(minimum, source.length))

  return new Promise((resolve, reject) => {
    let flags = 0
    const close = (error) => {
      fs.closeSync(fd)
      if (error) {
        // force the loop to exit
        flags = 1
        reject(error)
      }
    }

    let chunks = []
    let fileOffset = 0
    let chunkOffset = 0
    let sourceStart = 0

    while (flags === 0) {
      const length = source.length - sourceStart
      const bytesRead = fs.readSync(fd, source, sourceStart, length, fileOffset)
      fileOffset += bytesRead
      flags = (bytesRead < length) ? 1 : 0
      const sourceSize = sourceStart + bytesRead
      try {
        dedupe.deduplicate(average, minimum, maximum, source, 0, sourceSize, target, 0, flags,
          (error, sourceOffset, targetOffset) => {
            // n.b. the library throws the error, so this is always undefined
            if (error) {
              close(error)
              return
            }
            let offset = 0
            while (offset < targetOffset) {
              const hash = target.slice(offset, offset + 32)
              offset += 32
              const size = target.readUInt32BE(offset)
              offset += 4
              chunks.push({ hash, offset: chunkOffset, size })
              chunkOffset += size
            }
            // Anything remaining in the source buffer should be moved to the
            // beginning of the source buffer, and become the sourceStart for the
            // next read so that we do not read data we have already read:
            sourceStart = sourceSize - sourceOffset
            if (sourceStart > 0) {
              source.copy(source, 0, sourceOffset, sourceOffset + sourceStart)
            }
            if (flags !== 0) {
              // the last block has finished processing
              close()
              resolve(chunks)
            }
          }
        )
      } catch (err) {
        close(err)
      }
    }
  })
}

For demo purposes, this would need a .then() and some console logging to match the current demo. Let me know if this is useful and I can create a proper pull request. Or not, and feel free to close this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.