Giter VIP home page Giter VIP logo

fake-name / intraarchivededuplicator Goto Github PK

View Code? Open in Web Editor NEW
94.0 6.0 12.0 5.36 MB

Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.

License: BSD 3-Clause "New" or "Revised" License

Python 85.00% Shell 1.45% C++ 3.98% CSS 4.89% HTML 2.52% Cython 2.17%
postgresql image-search deduplication bk-tree python cython

intraarchivededuplicator's People

Contributors

fake-name avatar johannesbuchner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

intraarchivededuplicator's Issues

getWithinDistance returns [0] in some cases

Hello, I've been using your c++ bktree implementation and just repackaged it as a python package (https://github.com/gpip/cBKTree). One thing I noticed is that in some cases getWithinDistance returns [0] as the result, which seems to be some sort of sentinel node. This happens both when there are no matches within a distance or when there are, but it doesn't happen on every situation.

Here's an example where that happens:

from cbktree import BkHammingTree, explicitSignCast

DATA = {
    # Format: id -> bitstring
    1: '1011010010010110110111111000001000001000100011110001010110111011',
    2: '1011010010010110110111111000001000000001100011110001010110111011',
    3: '1101011110100100001011001101001110010011100010011101001000110101',
}

SEARCH_DIST = 2  # 2 out of 64 bits

int_bits = lambda b: explicitSignCast(int(b, 2))


tree = BkHammingTree()
descriptor = {}
for node_id, bits in DATA.items():
    ib = int_bits(bits)
    descriptor[node_id] = ib
    tree.insert(ib, node_id)

# Find near matches for each node that was inserted.
for node_id, ib in descriptor.items():
    res = tree.getWithinDistance(ib, SEARCH_DIST)
    print("{}: {}".format(node_id, res))

# Find near matches for items that were not inserted.

new = '1101011110100100001011001101001110010011100010011101001000110101'
print("new: {}".format(tree.getWithinDistance(int_bits(new), SEARCH_DIST)))

ones = '1' * 64
print("111..: {}".format(tree.getWithinDistance(int_bits(ones), SEARCH_DIST)))

# XXX Should return empty, returns [0] instead.
zeroes = '0' * 64
print("000..: {}".format(tree.getWithinDistance(int_bits(zeroes), SEARCH_DIST)))

My understanding is that the last call should return an empty set, just like the call before that one does. Have you hit this before?

Initial Update

Hi ๐Ÿ‘Š

This is my first visit to this fine repo, but it seems you have been working hard to keep all dependencies updated so far.

Once you have closed this issue, I'll create separate pull requests for every update as soon as I find one.

That's it for now!

Happy merging! ๐Ÿค–

Usage/Example of Bk Tree

I am developing a software where I will find and remove duplicates or similar images.

I did some research and found that Bk Trees should be a good data strucutre as I am gonna create perceptual hashes of images and I need to be able to compare the distance of hashes in the Bk tree which translates to the similarity of the images to be compared.

I looked at the file: https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/deduplicator/bktree.hpp
however I'm unclear how I am supposed to create hashes and be able to compare them.

I think adding some examples on usage of Bk Tree in the readme would be helpful.

Tree of uint64 instead of int64

Hi,

Given a 64 bit descriptor like the following

descr = "1000000000000000000000000000000000000000000000000000000000000000"
assert len(descr) == 64

it seems there's no way to insert it correctly:

>>> int(descr, 2)
9223372036854775808L
>>> explicitSignCast(int(descr, 2))
0

That is, both 000...000 and 100..000 would be inserted as the number 0, 000...001 and 100...001 as the number 1, and so on. Does this mean the tree as it stands right now should be used with 63 bits descriptors instead?

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.