ssokolow / fastdupes Goto Github PK
View Code? Open in Web Editor NEWHigh-efficiency script for finding sets of duplicate files
Home Page: http://fastdupes.readthedocs.org/en/latest/
License: GNU General Public License v2.0
High-efficiency script for finding sets of duplicate files
Home Page: http://fastdupes.readthedocs.org/en/latest/
License: GNU General Public License v2.0
Issue #5 proposes a --noninteractive
option which causes --delete
to fall back to keeping all entries if no --prefer
option matches.
There should also be some similar mechanism for falling back to hardlinking but I'll need to think about how to cleanly factor the API so everything's nice and orthogonal without constraining the functionality.
Most of the other duplication tool removers allow combining options like delete
and noprompt
to preserve the first file in each set of duplicates and delete the rest without prompting the user. Thus I think that it is a very common use case.
Currently combining --delete
with --noninteractive
causes to save ALL files if there is no preferred path match.
Something like --keep-only-one
along with both --delete
and --noninteractive
should delete all but the first match if the preferred list is empty.
EDIT: I'm trying to work on, will post here if I have some progres. In the meanwhile Thoughts would be appreciated. Created a pull request.
Having a clickable link to the readthedocs page from the readme and perhaps as project homepage would make things easier :)
Also, awesome project... I've been pondering to write something similar since I've been looking for something like this for so long.
Just a note, to make installing it easier I've added it to Pypi, let me know which username I should transfer the ownership too (if you're interested in that): https://pypi.python.org/pypi/fastdupes
Another addition to the docs, install now as easy as: pip install fastdupes
Showing duplicated directory trees as single entries rather than massive lists of file-oriented dupe groups would be much more useful but will require a non-trivial increase in CPU or memory requirements. (So it should be made optional)
Given that this is similar to what diff -qr
does, I'll want to investigate existing tools, libraries, and literature before implementing this.
It would be great to be able to import fastdupes and use it in our own Python projects, but the documentation is lacking clear guidance.
I can imagine something as friendly as this would become quite a popular package:
import fastdupes
removed = fastdupes.remove(path="/some/path", recursive=True)
print(f"Removed {len(removed)} duplicate files!")
Given how much one needs to trust something that either informs or, via --delete
, causes the deletion of files, fastdupes needs a unit test suite.
My goal is a test suite which both has 100% branch coverage as judged by coverage.py and is satisfactory when evaluated based on a manual inspection of which algorithm and implementation corner cases it tests.
Note to self: Make sure it has a test to confirm that the byte-for-byte comparison in -E
only reads enough to determine that a file is unique and then stops examining it.
Once it's complete enough to be useful, tox, Travis-CI, and Coveralls.io will be used to ensure effectiveness. (And to begin supporting Python 3.x too)
Tested on Ubuntu 16.04.
#!/bin/sh
git clone https://github.com/ssokolow/fastdupes
cd fastdupes
mkdir files
seq 100000 > files/file1; echo "1" >> files/file1
seq 100000 > files/file2; echo "2" >> files/file2
cmp files/file1 files/file2
python fastdupes.py files
Cloning into 'fastdupes'...
remote: Counting objects: 279, done.
remote: Total 279 (delta 0), reused 0 (delta 0), pack-reused 279
Receiving objects: 100% (279/279), 93.39 KiB | 0 bytes/s, done.
Resolving deltas: 100% (116/116), done.
Checking connectivity... done.
files/file1 files/file2 differ: byte 588896, line 100001
Found 2 files to be compared for duplication.
Found 1 sets of files with identical sizes. (2 files examined)
Found 1 sets of files with identical header hashes. (2 files examined)
Found 1 sets of files with identical hashes. (2 files examined)
/tmp/fastdupes/files/file2
/tmp/fastdupes/files/file1
If I'm going to micro-optimize cold-cache access, I'll need to have some way to flush or bypass all of the caches between fastdupes and the drive platters.
/proc/sys/vm/drop_caches
would cover all of the in-kernel caches.Given that I'm quickly running out of clever tricks to minimize the amount of disk I/O I do, I'm going to need to start working on micro-optimizations.
The most obvious one is tuning the CHUNK_SIZE
value used for reading data to be hashed.
According to the hard drive data sheets I examined, the average latency to
acquire a specific block (seek time, rotational latency, etc.) ranges from
roughly 14ms to 3ms.
Assuming that the average uncached, seekless throughput
for a modern disk drive ranges from 60MB/s (as Google and hdparm
seem to
agree on for 7200 RPM drives) to 73MB/s (lower bound for 15K RPM drives
according to manufacturer data sheets), then the point where read time
overtakes seek time in best-case scenarios for pseudo-parallel reads is at:
73 MB/s / 1000 ms/s * 3.0ms = 0.219MB = 219KB
219KB * (1000/1024 KB/KiB) = 213.8672KiB
As such, 216KiB (rounded up to a multiple of 4KiB) should be a good
rule-of-thumb lower bound for chunk sizes. (Actual chunk size must take
available RAM into account since, theoretically, a user may use -E
on a
system with tons of dupes of a single file)
The next step I'll want to take will be doing some actual experimentation and statistical analysis on commonly duplicated files.
I'll also want to consider whether dynamically tuning the read increment size based on the number of
files being compared and possibly the available RAM (To minimize seeking) is both feasible and worthwhile.
Possibly something like this:
block_size = min(max_block_size, max_consumption / file_count)
TODO: Is there such a thing as a disk access profiler I could use with this?
Once I've implemented certain other optimizations which involve caching file metadata between passes, I'll want to make sure I haven't introduced any unnecessary memory bloat in the process.
...of course, I'll need to look up the best option for profiling memory consumption in a Python program first.
Given that the algorithm will only ever subdivide groups and the generated hashes are never used outside fastdupes, there's no harm in seek()
ing past the first HEAD_SIZE
bytes when doing the final full-content comparison and it might provide a tiny bit of speed-up in some situations.
More importantly, for files HEAD_SIZE
or smaller, it means that we should be able to skip the final pass completely if the internal data structures preserve the file size read by the first pass, which eliminates the overhead of at least one syscall.
When the paths to find contains file with unicode filename, fastdupe will throw an error at line:
306, in sizeClassifier
filestat = _stat(path)
WindowsError: [Error 123]
I am using python 2.7 in windows to run fastdupes.
The syscall and I/O overhead of checking whether inputs are files or folders may become significant with long lists (especially once issue #9 makes them more common).
I'll want to investigate how much overhead there actually is and whether there's any lighter way to do it. (eg. Some variation on my algorithm that could replace a two-syscall "check, then open" approach with an "open and catch IOError" workflow that would only make one syscall in the more common case.)
Phases 1 and 2 of fastdupes's algorithm (group by size, group by headers) are essentially an intentional seek storm. The algorithm reads a block or two for each file, then moves on to the next file. As such, the effects of throughput are dwarfed by the effects of access time. (That is, seek time plus time spent waiting for the desired data to rotate into place under the head.)
This is what Tagged Command Queueing (SCSI) and Native Command Queueing (SATA) are intended to optimize in multitasking workloads by allowing the drive's firmware to reorder a stream of requests to minimize the amount of physical travel needed to service them.
Unfortunately, fastdupes is currently written as a single-threaded, single-process program which blocks on every request before sending the next one, which prevents TCQ/NCQ reordering from being applied.
(It also can't take advantage of the parallelism available when it's actually been asked to compare files on multiple physical devices)
The simplest, most portable way to solve this would probably be to initialize a bidirectional, threadsafe queue, spin off a bunch of worker threads, and use a scatter-gather approach so fastdupes can be blocking on several reads simultaneously and the kernel's TCQ/NCQ support will have something to work with.
For now, this bug will track:
Depending on the filesystem block size and how much fragmentation there is, this may also depend significantly on the results of issue #2.
One of the big flaws that I dislike about fdupes is the irrevocable nature. It would be nice to send files to the trash. Fortunately, there is already a nice library for this purpose, in send2trash.
By all accounts, it would just take a few short lines to make this happen. I have already done so with my own copy of fast dupes, but I figured that you would want to do it your way, if you agree to its inclusion.
In order to streamline the "find dupes, do something manual, re-determine dupes" workflow that often emerges in one-off situations, there should be some way to cache data between runs.
(Something like a --hash-cache=/path/to/file
option.)
I'm thinking probably a key-value store which maps inode
values to (size, ctime, sha1)
triples so that fastdupes can take an rsync-inspired "assume that inode+size+ctime = sha1" shortcut.
(and probably at least starting out SQLite-backed given my bad experiences with shelve
corruption, the need for incremental update support, and my aversion to premature optimization.)
Reducing HEAD_SIZE
from 64KiB to 16KiB produced excellent performance gains at effectively no cost.
However, there's still potentially some performance gain to be had (at the very least, in warm-cache situations) if reducing it to 8KiB or even 4KiB is feasible without allowing too many false positives to leak over into the final "read and hash everything" phase.
I'll need to either dig up some research or do my own statistical analysis on where "the first divergent byte" tends to appear in sets of identically-sized but non-identical files.
Sphinx can generate man pages and has markup roles for the stuff optparse generates for --help
.
So far, I haven't been able to find a --help
-to-Sphinx plugin so, when time permits, I need to take the --help
-to-manpage script I found a while ago and adapt it into a Sphinx plugin.
In proper "UNIX® Philosophy" fashion, fastdupes should be more open to being part of a shell pipeline.
As such, it should be possible for it to receive a list of \n
or NULL
-separated filenames via sys.stdin
.
Unlike the other steps, I've done practically nothing to optimize the initial recursive tree traversal phase.
I'll want to do some cost-benefit research on the following as well as identifying other potential improvements:
fnmatch.fnmatch
uses regexes internally and doesn't cache them. Given how many times it gets called, I should try using re.compile
with fnmatch.translate
instead.fnmatch.translate
outputs so the ignore check can be handled in a single pass.sizeClassifier
and for things like inode-based hardlink detection.After screwing up a "move to DVD+R" operation, I realized that it would be very helpful to have a way to partially (or fully) automate the process of answering --delete
prompts.
The current proposal is:
--prefer=/path/to/something
would add the given path to a list--noninteractive
is also passed, in which case, the system will act as if the prompt had been displayed and "all" typed.(Ensuring that the principle of least surprise is followed.)
It'd probably also be a good idea to add a "prefer " command to the dynamic prompts so that the settings can be altered mid-run without having to restart the whole process. (Though that will require issue #6 to be resolved first.)
I'll need to come up with some proposals than implement a groupBy
classifier that operates on inodes. Ideally, sharing cached stat()
results with the size classifier.
(Though its output won't behave quite the same way as the others since, instead of determining what should be compared to what in the final full-content examination, it'll be used to determine which paths don't need to be compared at all.)
I think the title says it all.
Once issue #19 (unit test suite) is done, I'll want to track down as many Google-visible places as possible where people recommend duplicate-finding tools and add a mention for fastdupes.
Example: http://ubuntu.wordpress.com/2005/10/08/find-duplicate-copies-of-files/
As far as I can see it would only change a few print statements and a str
/basestr
change. The patch as far as I can gather:
@@ -106,7 +106,7 @@
in
"""
fhash, read = hashlib.sha1(), 0
- if isinstance(handle, basestring):
+ if isinstance(handle, str):
handle = file(handle, 'rb')
if limit:
@@ -250,7 +250,7 @@
pos + 1, group_count, fun_desc, count, len(paths)
))
- for key, group in classifier(paths, *args, **kwargs).items():
+ for key, group in list(classifier(paths, *args, **kwargs).items()):
groups.setdefault(key, set()).update(group)
count += len(group)
@@ -439,11 +439,11 @@
:rtype: :class:`~__builtins__.int`
"""
dupeList = sorted(dupeList)
- print
+ print()
for pos, val in enumerate(dupeList):
- print "%d) %s" % (pos + 1, val)
+ print("%d) %s" % (pos + 1, val))
while True:
- choice = raw_input("[%s/%s] Keepers: " % (mainPos, mainLen)).strip()
+ choice = input("[%s/%s] Keepers: " % (mainPos, mainLen)).strip()
if not choice:
print ("Please enter a space/comma-separated list of numbers or "
"'all'.")
@@ -495,7 +495,7 @@
value = DEFAULTS[key]
if isinstance(value, (list, set)):
value = ', '.join(value)
- print "%*s: %s" % (maxlen, key, value)
+ print("%*s: %s" % (maxlen, key, value))
def delete_dupes(groups, prefer_list=None, interactive=True, dry_run=False):
"""Code to handle the :option:`--delete` command-line option.
@@ -530,7 +530,7 @@
assert preferred # Safety check
for path in pruneList:
- print "Removing %s" % path
+ print("Removing %s" % path)
if not dry_run:
os.remove(path)
@@ -598,8 +598,8 @@
delete_dupes(groups, opts.prefer, not opts.noninteractive,
opts.dry_run)
else:
- for dupeSet in groups.values():
- print '\n'.join(dupeSet) + '\n'
+ for dupeSet in list(groups.values()):
+ print('\n'.join(dupeSet) + '\n')
if __name__ == '__main__':
main()
Once issue #6 has been resolved, it should be used to add a link
command to the --delete
prompt which is similar to the existing all
command but, instead, hardlinks all of the files together.
I'll need to decide on how the UI should handle the case where full deduplication was not achieved because the matches were spread across more than one partition.
Both to allow reliable comparison of the output (either between regular and --exact
mode or between two different code revisions) and for user comfort, there should at least be an option to sort the output.
I'll need to do some experimenting with python -m timeit
to determine what kind of sorting to use and what to make optional.
For example:
First and most importantly, I'll want to investigate rmlint since a brief examination of the README suggests that it's doing the same kinds of optimizations I'm doing but has had a lot more effort put into it.
Depending on how much faster rmlint is, I'll probably want to focus more on the extensibility and hackability I gain from using Python as opposed to C.
Second, I'll also want to investigate schweikh3.c
and samefile
:
<mauke> feature request: if you could make it output compatible with
http://www.ioccc.org/1998/schweikh3.hint , that would be sweet
(http://www.ioccc.org/1998/schweikh3.c)
<mauke> I don't like the way fdupes works. samefile's interface is superior
<mauke> it says if you specify a directory twice, it will list files as
their own duplicates.
<mauke> wtf was the author thinking?
<deitarion> mauke: Lazy, I guess. I believe I fixed that in fastdupes.
Once issues #5 and #7 are resolved, I'll want to add a command line switch with a name like --autolink
or --link-if
which is used similarly to --prefer
.
The purpose would be to allow behaviours like "If you see a group of duplicates and one is named Folder.jpg
or AlbumArtSmall.jpg
, then hardlink them all together without asking."
I'll still need to finish the design work on it though.
feature request to allow entering "none" in response to Keepers:
forcing deletion of all duplicates.
The more automation there is for deleting files, the more likely it is that empty folders will get left lying around.
I'll want to also have some kind of support for automating those away too.
Various ideas for the --delete
prompt require it to behave more like a command line which fills a default command if only given numeric arguments.
As such, it will need to be re-architected.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.