Giter VIP home page Giter VIP logo

duplicateimages's People

Contributors

lene avatar rwxguo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

duplicateimages's Issues

Reliability and configuration

I am using it on around 10.000 images and finding a lot of duplicates - Now I don't want to make a mistake here and filter out duplicates, that actually aren't duplicates. How would you recommend me to configure --max-distance --hash-size to achieve a configuration, that will really only sort out the image if it is an exact duplicate.

My image set contains of images with an average size of 128x128 (png).

unable to change hash alorithm or hash size!

whenever I use -
find-dups "out_2k1" --algorithm ahash --parallel --progress --hash-db hashes.json --slow

I get the error - ValueError: Algorithm mismatch: phash != ahash

Similar thing hapens when I change hash-size

Feat. Request: Exclude Folders

I would love the option to exclude some directories or patterns from the root dir. Like caching directories.

My solution right now is to move the directories temporarily to some other dir to exclude them

Exclude dir does not recognize whitespace

My photo archive is structured in this way:yyyy - mm - TAG eg "2012 - 12 - new zealand" and I have some decade folders to sort it a bit more: yyyy - yyyy eg "2010 - 2019" which makes this path: "./2010 - 2019/2012 - 12 - new zealand"

find-dups .  --exclude-dir "./2010 - 2019/2012 - 12 - new zealand"
INFO: Scanning . (excluding "./2010, -, 2019, 2012, -, 12, New zealand"")

maybe we could have --exclude-dir "path" "path" "path"
and --exclude-regex "regex"

By the way: THANK YOU!

big fileset doesn't work

When working on large number of files in my case around 111,524 files with long files names and buried sub directories, files_in_dirs function fails.

I wrote very preliminary function to get the file list and found using temp files to store the file names works well (though highly inefficient) but will always get the list of files without heavy memory overhead.

I would recommend to convert the existing in memory list to the file structure.

I am not a python expert but will try to work on code and post it here.

Bug

Command: find-dups -h

Gives an error:
Traceback (most recent call last):
File "/usr/local/bin/find-dups", line 5, in
from duplicate_images.duplicate import main
File "/usr/local/lib/python3.9/site-packages/duplicate_images/duplicate.py", line 26, in
register_heif_opener()
File "/usr/local/lib/python3.9/site-packages/pillow_heif/as_plugin.py", line 178, in register_heif_opener
if _pillow_heif.get_lib_info()["HEIF"]:
File "/usr/local/lib/python3.9/site-packages/pillow_heif/_deffered_error.py", line 11, in getattr
raise self.ex
File "/usr/local/lib/python3.9/site-packages/pillow_heif/as_plugin.py", line 25, in
import _pillow_heif
ImportError: dlopen(/usr/local/lib/python3.9/site-packages/_pillow_heif.cpython-39-darwin.so, 2): Symbol not found: __ZTTNSt3__118basic_stringstreamIcNS_11char_traitsIcEENS_9allocatorIcEEEE
Referenced from: /usr/local/lib/python3.9/site-packages/pillow_heif/.dylibs/libde265.0.dylib (which was built for Mac OS X 12.0)
Expected in: /usr/lib/libc++.1.dylib
in /usr/local/lib/python3.9/site-packages/pillow_heif/.dylibs/libde265.0.dylib

Implementing the libary with a 'growing' imageset (runtime)

What I would like to achieve I about the following:

EXISTING_HASHES: set = set()

def is_duplicate(img_bytes: bytes):
    if get_hash(img_bytes) in EXISTING_HASHES:
        return True
    return False

def main():
    image_bytes = get_new_image()
    if is_duplicate(image_bytes)
        return

    with open(file) as f:
       f.write(image_bytes)
        

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.