jhnc / findimagedupes Goto Github PK
View Code? Open in Web Editor NEWFinds visually similar or duplicate images
License: GNU General Public License v3.0
Finds visually similar or duplicate images
License: GNU General Public License v3.0
Given:
db1 = { img1 => fp1a, img2 => fp2 }
db2 = { img1 => fp1b, img2 => fp2 }
it looks like:
findimagedupes -f db1 -f db2 -M db3
โ db3 = { img1 => fp1c, img2 => fp2 }
and
findimagedupes -f db1 -f db2 -M db3 -- img3
โ db3 = { img2 => fp2, img3 => fp3 }
However, the latter should give: db3 = { img1 => fp1c, img2 => fp2, img3 => fp3 }
The @regen
array is ignored when explicit filelist is provided on the commandline.
I have tried multiple installation schemes, but it cannot be installed on mac, how can I install it correctly?
Hey,
is it possible to easily add a HEIC support for comparision? If I've time I can also look into it, but tight now I have to study for my exams.
Does imagemagick even support HEIC?
I see the output puts similar files on one line.
Is it possible to also include the % similarity for each of them?
With some types of WMF files, findimagedupes gets a SIGABRT via GraphicsMagick at this line:
Line 491 in a787e23
For example, if you run findimagedupes on the directory containing all the files downloaded from here you get a coredump:
https://telparia.com/fileFormatSamples/image/wmf/
and this is caused specifically by this file:
https://telparia.com/fileFormatSamples/image/wmf/MINN.XK4
with the description of:
MINN.XK4: Windows metafile, size 23398 words, 5 objects, largest record size 0x12
Is it safe to run multiple invocations of findimagedupes with each accessing a single fingerprint DB file?
The context is an image store of just over 1TB of images and using parallel
to generate the hashes across all CPU cores first. For example something like this:
find /path/to/files/{InstantUpload,Media/Photos} -maxdepth 3 -type d | \
nice -n 15 \
parallel -X --max-args 1 \
--jobs 8 -l 12 \
-u --tmpdir \
/path/to/file/tmp \
findimagedupes -R -f '/path/to/files/.findimagedupes.db' --no-compare '{}'
Is this safe, or should each job slot be using a separate DB file, then merge all the files at the end?
I would like to know how i can extract the results in a csv or text file, with every similar files separated with a new line?
I know that a1 and a2 are exact matches as b1 and b2. How i can understand this or group them using findimagedupes per similarity?
and fp_data how it can be opened? it is a sql database?
findimagedupes -v=md5 -R -q -f=fp_data -t 70% .
e9550f2c38e5584022b6cac469777c55 /a2.jpg
aff489c1d36c9625f4e48b4e6223548f /b1.jpg
19b6b4df7cc0ad09397089b8bcfd2714 /a1.jpg
aff489c1d36c9625f4e48b4e6223548f /b2.jpg
Great tool!
When running the script, a worker for each CPU core seems to be spawned. But all work then happens on one of the workers.
It seems, that creation of the fingerprints takes most of the time, at least for small (20k images) collections.
The creation of the fingerprints could possibly be parallelized very well. Or would merging the individual thread/process results be a hassle?
Even on a 6+ years old system, the CPU + SSD load was around 20%. So for current systems, probably acceleration of up to 10x could be achieved.
I'm now thinking about hacking this together by launching parallel runs with separate fingerprint databases, and then merging them. I'm afraid stuff is going to break, given my skills...
Do you have plans to implement parallelism?
request: findimagedupes can be used also for comparing similarity of videos
By reading this: https://unix.stackexchange.com/questions/503060/evaluate-the-similarity-between-two-video-files
i can image that using ffmpeg you can extract first/last frame of two video files and then get a similarity between these two frames of these two videos.
I think this is an incredible idea! No such tool exists in whole web!
What do you say?
Run findimagedupes --help
Last line returns version
instead of --version
-v, --verbosity=LIST
Enable display of informational messages to stdout, where LIST
is a comma-delimited list of:
md5 Display the checksum for each file, as per md5sum(1).
fingerprint | fp
Display the base64-encoded fingerprint of each file.
Alternatively, --verbosity may be given multiple times, and
accumulates. Note that this may not be sensible. For example, to
be useful, md5 output probably should not be merged with
fingerprint data.
version Display the program version, then exit.
Moreover, exit code of findimagedupes --help
is 1, is this expected?
Finally findimagedupes --version
doesn't print the version, just the name.
OS: Macos, v2.20.1
After find out duplicate images, I would like to use image viewer(such as phototonic
) to manage them, for example:
findimagedupes -R ~/picture | xargs phototonic
If the filenames have whitespace, I can use xargs -d '\n'
to separate the filenames. However, findimagedupes
prints out the duplicate images in the same line:
1.jpg 1-1.jpg
2.jpg 2-1.jpg
that make it difficult to separate the filenames if the filenames have whitespace.
If print out the each filenames in different lines:
1.jpg
1-1.jpg
2.jpg
2-1.jpg
I can easily separate the filenames and pass to pipeline:
findimagedupes -R ~/picture | xargs -d '\n' phototonic
PS: I tried to used -p
option( findimagedupes -R ~/picture -p `which phototonic`
), but it does not work.
Fingerprint generation should take account of EXIF orientation metadata.
graphicsmagick supports auto-orient but not via its perl interface. I have requested an enhancement: https://sourceforge.net/p/graphicsmagick/feature-requests/57/
If this is not made, a user-contributed patch is available that could be used.
Alternatively, imagemagick version 7 provides auto-orient. Given #2 and that imagemagick development seems more active, it may be cleaner to switch back to using imagemagick and drop use of graphicsmagick once distributions start offering im7 (currently most offer im6).
For some tasks need to search for similar images with only the same size. Is such an optional modification possible? That is, to compare not the strings phash
, but the strings (Width)x(Height):phash
?
I going through a huge collection (currently around 1M files, totaling 1TB) of images that is not static. Images are removed and added, in particular images that are similar to images already in the collection get added. That means it's not enough to run this program once on the collection.
As calculating fingerprints on that amount of images takes a very long time, it seems wise to cache fingerprints; and as the answer to #4 was negative, there are clear advantages to maintaining many small databases each covering their own small part of the collection.
But when I then want to search for similar images across those parts, I want to use a command like findimagedupes -f db-part1 -f db-part2
, but that results in "Error: Require --merge if using multiple fingerprint databases". The man-page says '--merge' takes a filename as argument, and while I see a need to merge the databases, I see no need to write it to a file - that'll just be a file taking up diskspace and getting out-of-date.
Does the software sort the output (list of duplicate file paths separated by space) by size (in each line)?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.