jhnc / findimagedupes Goto Github PK

View Code? Open in Web Editor NEW

92.0 92.0 8.0 267 KB

Finds visually similar or duplicate images

License: GNU General Public License v3.0

Makefile 2.30% Perl 97.41% Shell 0.29%

findimagedupes's People

Stargazers

Watchers

Forkers

jefaughnan puck isimluk aziis98 teleshoes thielsn p07r0457 jferns285

findimagedupes's Issues

merge database may lose entries on fingerprint mismatch

Given:

db1 = { img1 => fp1a, img2 => fp2 }
db2 = { img1 => fp1b, img2 => fp2 }

it looks like:

findimagedupes -f db1 -f db2 -M db3 → db3 = { img1 => fp1c, img2 => fp2 }

and

findimagedupes -f db1 -f db2 -M db3 -- img3 → db3 = { img2 => fp2, img3 => fp3 }

However, the latter should give: db3 = { img1 => fp1c, img2 => fp2, img3 => fp3 }

The @regen array is ignored when explicit filelist is provided on the commandline.

how to install findimagedupes on macOS?

I have tried multiple installation schemes, but it cannot be installed on mac, how can I install it correctly?

HEIC Support - iPhone Pictures

Hey,

is it possible to easily add a HEIC support for comparision? If I've time I can also look into it, but tight now I have to study for my exams.

Does imagemagick even support HEIC?

Is it possible to print the AMOUNT of similarity along with similar files?

I see the output puts similar files on one line.

Is it possible to also include the % similarity for each of them?

SIGABRT from GraphicsMagick when analyzing old WMF files

With some types of WMF files, findimagedupes gets a SIGABRT via GraphicsMagick at this line:

findimagedupes/findimagedupes

Line 491 in a787e23

try $image->Read($file);

For example, if you run findimagedupes on the directory containing all the files downloaded from here you get a coredump:

https://telparia.com/fileFormatSamples/image/wmf/

and this is caused specifically by this file:

https://telparia.com/fileFormatSamples/image/wmf/MINN.XK4

with the description of:
MINN.XK4: Windows metafile, size 23398 words, 5 objects, largest record size 0x12

Concurrent access to fingerprint DB

Is it safe to run multiple invocations of findimagedupes with each accessing a single fingerprint DB file?

The context is an image store of just over 1TB of images and using parallel to generate the hashes across all CPU cores first. For example something like this:

find /path/to/files/{InstantUpload,Media/Photos} -maxdepth 3 -type d | \
  nice -n 15 \
  parallel -X --max-args 1 \
    --jobs 8 -l 12 \
    -u --tmpdir \
    /path/to/file/tmp \
    findimagedupes -R -f '/path/to/files/.findimagedupes.db' --no-compare '{}'

Is this safe, or should each job slot be using a separate DB file, then merge all the files at the end?

how i can extract the results in a csv or text file, with every similar files separated with a new line?

I would like to know how i can extract the results in a csv or text file, with every similar files separated with a new line?

I know that a1 and a2 are exact matches as b1 and b2. How i can understand this or group them using findimagedupes per similarity?

and fp_data how it can be opened? it is a sql database?

findimagedupes  -v=md5  -R -q -f=fp_data   -t 70% . 
e9550f2c38e5584022b6cac469777c55      /a2.jpg
aff489c1d36c9625f4e48b4e6223548f       /b1.jpg
19b6b4df7cc0ad09397089b8bcfd2714    /a1.jpg
aff489c1d36c9625f4e48b4e6223548f     /b2.jpg

Enable (tune?) parallelism

Great tool!

When running the script, a worker for each CPU core seems to be spawned. But all work then happens on one of the workers.

It seems, that creation of the fingerprints takes most of the time, at least for small (20k images) collections.

The creation of the fingerprints could possibly be parallelized very well. Or would merging the individual thread/process results be a hassle?

Even on a 6+ years old system, the CPU + SSD load was around 20%. So for current systems, probably acceleration of up to 10x could be achieved.

I'm now thinking about hacking this together by launching parallel runs with separate fingerprint databases, and then merging them. I'm afraid stuff is going to break, given my skills...

Do you have plans to implement parallelism?

request: findimagedupes can be used also for comparing similarity of videos

By reading this: https://unix.stackexchange.com/questions/503060/evaluate-the-similarity-between-two-video-files
i can image that using ffmpeg you can extract first/last frame of two video files and then get a similarity between these two frames of these two videos.

I think this is an incredible idea! No such tool exists in whole web!

What do you say?

findimagedupes --help errors

Run findimagedupes --help

Last line returns version instead of --version


    -v, --verbosity=LIST
            Enable display of informational messages to stdout, where LIST
            is a comma-delimited list of:

            md5     Display the checksum for each file, as per md5sum(1).

            fingerprint | fp
                    Display the base64-encoded fingerprint of each file.

            Alternatively, --verbosity may be given multiple times, and
            accumulates. Note that this may not be sensible. For example, to
            be useful, md5 output probably should not be merged with
            fingerprint data.

    version Display the program version, then exit.

Moreover, exit code of findimagedupes --help is 1, is this expected?

Finally findimagedupes --version doesn't print the version, just the name.

OS: Macos, v2.20.1

suggest print out each filename in different line

After find out duplicate images, I would like to use image viewer(such as phototonic) to manage them, for example:

findimagedupes -R ~/picture | xargs phototonic

If the filenames have whitespace, I can use xargs -d '\n' to separate the filenames. However, findimagedupes prints out the duplicate images in the same line:
1.jpg 1-1.jpg
2.jpg 2-1.jpg
that make it difficult to separate the filenames if the filenames have whitespace.

If print out the each filenames in different lines:
1.jpg
1-1.jpg
2.jpg
2-1.jpg

I can easily separate the filenames and pass to pipeline:
findimagedupes -R ~/picture | xargs -d '\n' phototonic

PS: I tried to used -p option( findimagedupes -R ~/picture -p `which phototonic` ), but it does not work.

honour EXIF Orientation

Fingerprint generation should take account of EXIF orientation metadata.

graphicsmagick supports auto-orient but not via its perl interface. I have requested an enhancement: https://sourceforge.net/p/graphicsmagick/feature-requests/57/

If this is not made, a user-contributed patch is available that could be used.

Alternatively, imagemagick version 7 provides auto-orient. Given #2 and that imagemagick development seems more active, it may be cleaner to switch back to using imagemagick and drop use of graphicsmagick once distributions start offering im7 (currently most offer im6).

option: only equal size?

For some tasks need to search for similar images with only the same size. Is such an optional modification possible? That is, to compare not the strings phash, but the strings (Width)x(Height):phash?

Option not to waste diskspace on merged fingerprint database

I going through a huge collection (currently around 1M files, totaling 1TB) of images that is not static. Images are removed and added, in particular images that are similar to images already in the collection get added. That means it's not enough to run this program once on the collection.

As calculating fingerprints on that amount of images takes a very long time, it seems wise to cache fingerprints; and as the answer to #4 was negative, there are clear advantages to maintaining many small databases each covering their own small part of the collection.

But when I then want to search for similar images across those parts, I want to use a command like findimagedupes -f db-part1 -f db-part2, but that results in "Error: Require --merge if using multiple fingerprint databases". The man-page says '--merge' takes a filename as argument, and while I see a need to merge the databases, I see no need to write it to a file - that'll just be a file taking up diskspace and getting out-of-date.

Sort output by Size

Does the software sort the output (list of duplicate file paths separated by space) by size (in each line)?

jhnc / findimagedupes Goto Github PK

findimagedupes's People

Stargazers

Watchers

Forkers

findimagedupes's Issues

Recommend Projects

Recommend Topics

Recommend Org