Giter VIP home page Giter VIP logo

g414-hash's People

Contributors

stevenschlansker avatar sunnygleason avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

g414-hash's Issues

Confusing int vs long with regards to FileOperations2.advanceBytes

Hi,

It seems that the size of HashFile2s is limited by Integer.MAX_VALUE --

/**
 * Advances the file pointer by <code>count</code> bytes, throwing an
 * exception if the postion has exhausted a long (hopefully not likely).
 */
private static long advanceBytes(long pos, long count, boolean isLongPos)
        throws IOException {
    long newpos = pos + count;
    if (newpos < count || newpos > Integer.MAX_VALUE)
        throw new IOException("HashFile is too big.");
    return newpos;
}

All of the arguments are lovingly created as longs, but if the position ever grows past Integer.MAX_VALUE (which is only ~2GB) then a "HashFile too big" exception is thrown.

Specify encoding in String.getBytes() method (for hashing)

There are couple of places where Strings are converted to byte[] without specifying encoding.
This is risky, since default is to use platform-specific default encoding, which varies from system to system (typically it's either Latin-1 or UTF-8; nowadays more commonly UTF-8).
It would probably make most sense to just use UTF-8; if caller wants other encoding, she can just get bytes explicitly, calc hash on byte array.

Add 32-bit hash versions?

Given that time to produce 64-bit hashes is probably not much less than equivalent for producing 2 32-int hashes, it would be useful to also have 32-bit versions. Specific use case is that of building Bloom filters, where number of bits needed is variable; so when, say, 80 bits are needed (8 hash functions over 1024 bit mask), it would be enough to calculate 3 32-bit hashes instead of 2 64-bit hashes.

HashFile fails to get() some keys that were added and are contained in the HashFile.elements() set

Hi again,
I've come to bother you with what I presume is another bug :-)

https://github.com/stevenschlansker/g414-stress-test

I did not submit this one as a patch since the test dataset is 15MB compressed, feel free to take any or all of the repo that you like and use it as you please (including integrating it into the core if you think it is a good regression test)

Short version:
For 600K keys, the HashFile contains some data that cannot be fetched yet shows up via the iterator. About 40k keys are affected with the test data set.

I've included a bonus test showing that HashFile2 does not have this problem.

The test data is a scrubbed version of my production data, which is how I noticed this problem.

use string.getBytes("UTF-8") instead of string.getBytes()

From Tatu...

tatu: instead of getBytes() ; latter uses whatever platform default is, which is system-dependant
[6:04pm] tatu: could also use ISO-8859-1 or such. but utf-8 is probably better. or utf-16
[6:04pm] tatu: also I think it'd be good to expose variant that takes byte[]... or maybe super-class has that?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.