ssokolow / fastdupes Goto Github PK

View Code? Open in Web Editor NEW

60.0 60.0 16.0 340 KB

High-efficiency script for finding sets of duplicate files

Home Page: http://fastdupes.readthedocs.org/en/latest/

License: GNU General Public License v2.0

Python 100.00%

fastdupes's People

Stargazers

Watchers

Forkers

sshtmc hrishikeshvganu cloudxtreme oxalorg pombredanne twac kalibyrn sumodmadhavan cobaltjacket plundhagen ixe-la slyskawa sdnahapetian naihe2010 best-media

fastdupes's Issues

Add a command-line option for automatically hardlinking dupes

Issue #5 proposes a --noninteractive option which causes --delete to fall back to keeping all entries if no --prefer option matches.

There should also be some similar mechanism for falling back to hardlinking but I'll need to think about how to cleanly factor the API so everything's nice and orthogonal without constraining the functionality.

Add a command to delete all but one file in noninteractive without using prefer. `--keep-only-one`

Most of the other duplication tool removers allow combining options like delete and noprompt to preserve the first file in each set of duplicates and delete the rest without prompting the user. Thus I think that it is a very common use case.

Currently combining --delete with --noninteractive causes to save ALL files if there is no preferred path match.

Something like --keep-only-one along with both --delete and --noninteractive should delete all but the first match if the preferred list is empty.

EDIT: ~~I'm trying to work on, will post here if I have some progres. In the meanwhile~~ Thoughts would be appreciated. Created a pull request.

Add readthedocs link to project (and perhaps readme.rst)

Having a clickable link to the readthedocs page from the readme and perhaps as project homepage would make things easier :)

Also, awesome project... I've been pondering to write something similar since I've been looking for something like this for so long.

Just a note, to make installing it easier I've added it to Pypi, let me know which username I should transfer the ownership too (if you're interested in that): https://pypi.python.org/pypi/fastdupes

Another addition to the docs, install now as easy as: pip install fastdupes

Support displaying duplicated directory trees as single results

Showing duplicated directory trees as single entries rather than massive lists of file-oriented dupe groups would be much more useful but will require a non-trivial increase in CPU or memory requirements. (So it should be made optional)

Given that this is similar to what diff -qr does, I'll want to investigate existing tools, libraries, and literature before implementing this.

Python package API documentation

It would be great to be able to import fastdupes and use it in our own Python projects, but the documentation is lacking clear guidance.

I can imagine something as friendly as this would become quite a popular package:

import fastdupes

removed = fastdupes.remove(path="/some/path", recursive=True)

print(f"Removed {len(removed)} duplicate files!")

Write a unit test suite

Given how much one needs to trust something that either informs or, via --delete, causes the deletion of files, fastdupes needs a unit test suite.

My goal is a test suite which both has 100% branch coverage as judged by coverage.py and is satisfactory when evaluated based on a manual inspection of which algorithm and implementation corner cases it tests.

Note to self: Make sure it has a test to confirm that the byte-for-byte comparison in -E only reads enough to determine that a file is unique and then stops examining it.

Once it's complete enough to be useful, tox, Travis-CI, and Coveralls.io will be used to ensure effectiveness. (And to begin supporting Python 3.x too)

Differing files with common prefix detected as duplicates

Tested on Ubuntu 16.04.

#!/bin/sh
git clone https://github.com/ssokolow/fastdupes
cd fastdupes
mkdir files
seq 100000 > files/file1; echo "1" >> files/file1
seq 100000 > files/file2; echo "2" >> files/file2
cmp files/file1 files/file2
python fastdupes.py files

Cloning into 'fastdupes'...
remote: Counting objects: 279, done.
remote: Total 279 (delta 0), reused 0 (delta 0), pack-reused 279
Receiving objects: 100% (279/279), 93.39 KiB | 0 bytes/s, done.
Resolving deltas: 100% (116/116), done.
Checking connectivity... done.
files/file1 files/file2 differ: byte 588896, line 100001
Found 2 files to be compared for duplication.        
Found 1 sets of files with identical sizes. (2 files examined)             
Found 1 sets of files with identical header hashes. (2 files examined)             
Found 1 sets of files with identical hashes. (2 files examined)             
/tmp/fastdupes/files/file2
/tmp/fastdupes/files/file1

Find or write a harness for doing repeatable cold-cache disk I/O testing

If I'm going to micro-optimize cold-cache access, I'll need to have some way to flush or bypass all of the caches between fastdupes and the drive platters.

I think /proc/sys/vm/drop_caches would cover all of the in-kernel caches.
I haven't yet found a way to force-bypass the hardware caches. I'll keep looking but I may simply have to do something like reading 256MiB of irrelevant data in between flushing the other caches and starting the test.
I still need to verify that there are no caches I've missed.

Research the ideal value for CHUNK_SIZE

Given that I'm quickly running out of clever tricks to minimize the amount of disk I/O I do, I'm going to need to start working on micro-optimizations.

The most obvious one is tuning the CHUNK_SIZE value used for reading data to be hashed.

According to the hard drive data sheets I examined, the average latency to
acquire a specific block (seek time, rotational latency, etc.) ranges from
roughly 14ms to 3ms.

Assuming that the average uncached, seekless throughput
for a modern disk drive ranges from 60MB/s (as Google and hdparm seem to
agree on for 7200 RPM drives) to 73MB/s (lower bound for 15K RPM drives
according to manufacturer data sheets), then the point where read time
overtakes seek time in best-case scenarios for pseudo-parallel reads is at:

73 MB/s / 1000 ms/s * 3.0ms = 0.219MB = 219KB
219KB * (1000/1024 KB/KiB) = 213.8672KiB

As such, 216KiB (rounded up to a multiple of 4KiB) should be a good
rule-of-thumb lower bound for chunk sizes. (Actual chunk size must take
available RAM into account since, theoretically, a user may use -E on a
system with tons of dupes of a single file)

The next step I'll want to take will be doing some actual experimentation and statistical analysis on commonly duplicated files.

I'll also want to consider whether dynamically tuning the read increment size based on the number of
files being compared and possibly the available RAM (To minimize seeking) is both feasible and worthwhile.

Possibly something like this:

block_size = min(max_block_size, max_consumption / file_count)
Maybe a 64K maximum block size, 4K minimum block size, and an 8MB consumption? (subordinate to minimum block size when in conflict)

TODO: Is there such a thing as a disk access profiler I could use with this?

Run fastdupes through a memory profiler and optimize

Once I've implemented certain other optimizations which involve caching file metadata between passes, I'll want to make sure I haven't introduced any unnecessary memory bloat in the process.

...of course, I'll need to look up the best option for profiling memory consumption in a Python program first.

The final hash pass shouldn't re-read the bit of data already hashed in the header phase

Given that the algorithm will only ever subdivide groups and the generated hashes are never used outside fastdupes, there's no harm in seek()ing past the first HEAD_SIZE bytes when doing the final full-content comparison and it might provide a tiny bit of speed-up in some situations.

More importantly, for files HEAD_SIZE or smaller, it means that we should be able to skip the final pass completely if the internal data structures preserve the file size read by the first pass, which eliminates the overhead of at least one syscall.

unicode filename error

When the paths to find contains file with unicode filename, fastdupe will throw an error at line:
306, in sizeClassifier
filestat = _stat(path)
WindowsError: [Error 123]

I am using python 2.7 in windows to run fastdupes.

Investigate lighter ways to check whether a path is a file or directory

The syscall and I/O overhead of checking whether inputs are files or folders may become significant with long lists (especially once issue #9 makes them more common).

I'll want to investigate how much overhead there actually is and whether there's any lighter way to do it. (eg. Some variation on my algorithm that could replace a two-syscall "check, then open" approach with an "open and catch IOError" workflow that would only make one syscall in the more common case.)

Explore the viability of using multiple threads/processes to exploit TCQ/NCQ

Phases 1 and 2 of fastdupes's algorithm (group by size, group by headers) are essentially an intentional seek storm. The algorithm reads a block or two for each file, then moves on to the next file. As such, the effects of throughput are dwarfed by the effects of access time. (That is, seek time plus time spent waiting for the desired data to rotate into place under the head.)

This is what Tagged Command Queueing (SCSI) and Native Command Queueing (SATA) are intended to optimize in multitasking workloads by allowing the drive's firmware to reorder a stream of requests to minimize the amount of physical travel needed to service them.

Unfortunately, fastdupes is currently written as a single-threaded, single-process program which blocks on every request before sending the next one, which prevents TCQ/NCQ reordering from being applied.

(It also can't take advantage of the parallelism available when it's actually been asked to compare files on multiple physical devices)

The simplest, most portable way to solve this would probably be to initialize a bidirectional, threadsafe queue, spin off a bunch of worker threads, and use a scatter-gather approach so fastdupes can be blocking on several reads simultaneously and the kernel's TCQ/NCQ support will have something to work with.

For now, this bug will track:

The implementation and feasability evaluation of such a design
The initial sloppy selection of a value for how many threads/processes to spawn.
The identification of the cleanest way to map paths to physical devices rather than partitions.

Depending on the filesystem block size and how much fragmentation there is, this may also depend significantly on the results of issue #2.

Support send2trash appropriately

One of the big flaws that I dislike about fdupes is the irrevocable nature. It would be nice to send files to the trash. Fortunately, there is already a nice library for this purpose, in send2trash.

By all accounts, it would just take a few short lines to make this happen. I have already done so with my own copy of fast dupes, but I figured that you would want to do it your way, if you agree to its inclusion.

Support inter-run caching

In order to streamline the "find dupes, do something manual, re-determine dupes" workflow that often emerges in one-off situations, there should be some way to cache data between runs.

(Something like a --hash-cache=/path/to/file option.)

I'm thinking probably a key-value store which maps inode values to (size, ctime, sha1) triples so that fastdupes can take an rsync-inspired "assume that inode+size+ctime = sha1" shortcut.

(and probably at least starting out SQLite-backed given my bad experiences with shelve corruption, the need for incremental update support, and my aversion to premature optimization.)

Research the ideal value for HEAD_SIZE

Reducing HEAD_SIZE from 64KiB to 16KiB produced excellent performance gains at effectively no cost.

However, there's still potentially some performance gain to be had (at the very least, in warm-cache situations) if reducing it to 8KiB or even 4KiB is feasible without allowing too many false positives to leak over into the final "read and hash everything" phase.

I'll need to either dig up some research or do my own statistical analysis on where "the first divergent byte" tends to appear in sets of identically-sized but non-identical files.

Find or write a plugin to generate rST from --help

Sphinx can generate man pages and has markup roles for the stuff optparse generates for --help.

So far, I haven't been able to find a --help-to-Sphinx plugin so, when time permits, I need to take the --help-to-manpage script I found a while ago and adapt it into a Sphinx plugin.

Support taking lists of files as input on stdin

In proper "UNIX® Philosophy" fashion, fastdupes should be more open to being part of a shell pipeline.

As such, it should be possible for it to receive a list of \n or NULL-separated filenames via sys.stdin.

Look into optimizations for the initial "gather paths to analyze" phase

Unlike the other steps, I've done practically nothing to optimize the initial recursive tree traversal phase.

I'll want to do some cost-benefit research on the following as well as identifying other potential improvements:

Look into the performance effect of checking whether excludes contain meta-characters and using simple string matching if they don't.
As I understand it, fnmatch.fnmatch uses regexes internally and doesn't cache them. Given how many times it gets called, I should try using re.compile with fnmatch.translate instead.
I should also look into what the performance effect are of programmatically combining multiple fnmatch.translate outputs so the ignore check can be handled in a single pass.
Look into the memory-I/O trade-offs inherent in doing one stat call for each file and then caching it so it can be used both for sizeClassifier and for things like inode-based hardlink detection.

Add a --prefer option for automating --delete

After screwing up a "move to DVD+R" operation, I realized that it would be very helpful to have a way to partially (or fully) automate the process of answering --delete prompts.

The current proposal is:

--prefer=/path/to/something would add the given path to a list
Prior to displaying each prompt, if any of the paths match prefixes in the list, the prompt will be skipped and treated as if the matching paths had been chosen.
If there are no matches, the prompt will be displayed unless --noninteractive is also passed, in which case, the system will act as if the prompt had been displayed and "all" typed.

(Ensuring that the principle of least surprise is followed.)

It'd probably also be a good idea to add a "prefer " command to the dynamic prompts so that the settings can be altered mid-run without having to restart the whole process. (Though that will require issue #6 to be resolved first.)

Detect and handle hardlinks

It's a waste of time to read the same inode once for each hardlink associated with it and unnecessarily slows things down if the user re-runs fastdupes later.
It doesn't make much sense to list things as duplicates when, presumably, the hardlinking was the chosen method of deduplication.
At the same time, it's not immediately obvious how to represent multiple names for the same inode in the output. (Possibly indenting all but the first one?)

I'll need to come up with some proposals than implement a groupBy classifier that operates on inodes. Ideally, sharing cached stat() results with the size classifier.

(Though its output won't behave quite the same way as the others since, instead of determining what should be compared to what in the final full-content examination, it'll be used to determine which paths don't need to be compared at all.)

Support localization via gettext

I think the title says it all.

Advertise

Once issue #19 (unit test suite) is done, I'll want to track down as many Google-visible places as possible where people recommend duplicate-finding tools and add a mention for fastdupes.

Example: http://ubuntu.wordpress.com/2005/10/08/find-duplicate-copies-of-files/

Python 3 support

As far as I can see it would only change a few print statements and a str/basestr change. The patch as far as I can gather:

@@ -106,7 +106,7 @@
         in
     """
     fhash, read = hashlib.sha1(), 0
-    if isinstance(handle, basestring):
+    if isinstance(handle, str):
         handle = file(handle, 'rb')

     if limit:
@@ -250,7 +250,7 @@
                       pos + 1, group_count, fun_desc, count, len(paths)
                   ))

-        for key, group in classifier(paths, *args, **kwargs).items():
+        for key, group in list(classifier(paths, *args, **kwargs).items()):
             groups.setdefault(key, set()).update(group)
             count += len(group)

@@ -439,11 +439,11 @@
     :rtype: :class:`~__builtins__.int`
     """
     dupeList = sorted(dupeList)
-    print
+    print()
     for pos, val in enumerate(dupeList):
-        print "%d) %s" % (pos + 1, val)
+        print("%d) %s" % (pos + 1, val))
     while True:
-        choice = raw_input("[%s/%s] Keepers: " % (mainPos, mainLen)).strip()
+        choice = input("[%s/%s] Keepers: " % (mainPos, mainLen)).strip()
         if not choice:
             print ("Please enter a space/comma-separated list of numbers or "
                    "'all'.")
@@ -495,7 +495,7 @@
         value = DEFAULTS[key]
         if isinstance(value, (list, set)):
             value = ', '.join(value)
-        print "%*s: %s" % (maxlen, key, value)
+        print("%*s: %s" % (maxlen, key, value))

 def delete_dupes(groups, prefer_list=None, interactive=True, dry_run=False):
     """Code to handle the :option:`--delete` command-line option.
@@ -530,7 +530,7 @@

         assert preferred  # Safety check
         for path in pruneList:
-            print "Removing %s" % path
+            print("Removing %s" % path)
             if not dry_run:
                 os.remove(path)

@@ -598,8 +598,8 @@
         delete_dupes(groups, opts.prefer, not opts.noninteractive,
                      opts.dry_run)
     else:
-        for dupeSet in groups.values():
-            print '\n'.join(dupeSet) + '\n'
+        for dupeSet in list(groups.values()):
+            print('\n'.join(dupeSet) + '\n')

 if __name__ == '__main__':
     main()

Add a "link" command to the --delete prompt

Once issue #6 has been resolved, it should be used to add a link command to the --delete prompt which is similar to the existing all command but, instead, hardlinks all of the files together.

I'll need to decide on how the UI should handle the case where full deduplication was not achieved because the matches were spread across more than one partition.

Result output should at least have the option to be sorted

Both to allow reliable comparison of the output (either between regular and --exact mode or between two different code revisions) and for user comfort, there should at least be an option to sort the output.

I'll need to do some experimenting with python -m timeit to determine what kind of sorting to use and what to make optional.

For example:

Natural/human sorting rather than ASCIIbetical would be best, but I don't know how much heavier it is.
It'd probably be most useful to sort paths component by component so that files with the same names get put side-by-side so it's easier to compare and contrast.

Investigate the feature sets and performance of competing tools

First and most importantly, I'll want to investigate rmlint since a brief examination of the README suggests that it's doing the same kinds of optimizations I'm doing but has had a lot more effort put into it.

Depending on how much faster rmlint is, I'll probably want to focus more on the extensibility and hackability I gain from using Python as opposed to C.

Second, I'll also want to investigate schweikh3.c and samefile:

   <mauke> feature request: if you could make it output compatible with
   http://www.ioccc.org/1998/schweikh3.hint , that would be sweet
   (http://www.ioccc.org/1998/schweikh3.c)

    <mauke> I don't like the way fdupes works. samefile's interface is superior
    <mauke> it says if you specify a directory twice, it will list files as
            their own duplicates.
    <mauke> wtf was the author thinking?
    <deitarion> mauke: Lazy, I guess. I believe I fixed that in fastdupes.

Add an option which combines --prefer and "link"

Once issues #5 and #7 are resolved, I'll want to add a command line switch with a name like --autolink or --link-if which is used similarly to --prefer.

The purpose would be to allow behaviours like "If you see a group of duplicates and one is named Folder.jpg or AlbumArtSmall.jpg, then hardlink them all together without asking."

I'll still need to finish the design work on it though.

ssokolow / fastdupes Goto Github PK

fastdupes's People

Stargazers

Watchers

Forkers

fastdupes's Issues

Recommend Projects

Recommend Topics

Recommend Org