Giter VIP home page Giter VIP logo

capidup's Introduction

CapiDup

license PyPi version PyPi pyversion Codacy Badge Codacy Coverage Build Status

Quickly find duplicate files in directories.

CapiDup recursively crawls through all the files in a list of directories and identifies duplicate files. Duplicate files are files with the exact same content, regardless of their name, location or timestamp.

This program is designed to be quite fast. It uses a smart algorithm to detect and group duplicate files using a single pass on each file (that is, CapiDup doesn't need to compare each file to every other).

CapiDup fully supports both Python 2 and Python 3.

The capidup package is a library that implements the functionality and exports an API. There is a separate capidup-cli package that provides a command-line utility.

Usage

Using CapiDup is quite simple:

>>> import capidup.finddups
>>> duplicate_groups, errors = capidup.finddups.find_duplicates_in_dirs(
...     ["/media/sdcard/DCIM", "/home/user/photos"]
... )
>>> for duplicates in duplicate_groups:
...   print(duplicates)
...
['/media/sdcard/DCIM/DSC_1137.JPG', '/home/user/photos/Lake001.jpg']
['/media/sdcard/DCIM/DSC_1138.JPG', '/home/user/photos/Lake002.jpg']
['/home/user/photos/Woman.jpg', '/home/user/photos/portraits/Janet.jpg']
>>> errors
[]

Here we find out that /media/sdcard/DCIM/DSC_1137.JPG is a duplicate of ~/photos/Lake001.jpg, DSC_1138.JPG is a duplicate of Lake002.jpg, and ~/photos/Woman.jpg is a duplicate of photos/portraits/Janet.jpg.

Algorithm

CapiDup crawls the directories and gathers the list of files. Then, it takes a 3-step approach:

  1. Files are grouped by size (files of different size must obviously be different).
  2. Within files of the same size, they are further grouped by the MD5 of the first few KBytes. Naturally, if the first few KB are different, the files are different.
  3. Within files with the same initial MD5, they are finally grouped by the MD5 of the entire file. Files with the same MD5 are considered duplicates.

Considerations

There is a very small possibility of false positives. For any given file, there is a 1 in 264 (1:18,446,744,073,709,551,616) chance of some other random file being detected as its duplicate by mistake.

The reason for this is that two different files may have the same hash: this is called a collision. CapiDup uses MD5 (which generates 128 bit hashes) for detecting whether the files are equal. It cannot distinguish between a case where both files are equal and a case where they just happen to generate the same MD5 hash.

The odds of this happening by accident for two files of the same size, are, then, extremely low. For normal home use, dealing with movies, music, source code or other documents, this concern can be disregarded.

Security

There is one case when care should be taken: when comparing files which might have been intentionally manipulated by a malicious attacker.

While the chance of two random files having the same MD5 hash are really very low (as stated above), it is possible for a malicious attacker to purposely manipulate a file to have the same MD5 as another. The MD5 algorithm is not secure against intentional disception.

This may be of concern for example when comparing things such as program installers. A malicious attacker could infect an installer with malware, and manipulate the rest of the file in such a way that it still has the same MD5 as the original. Comparing the two files, CapiDup would show them as duplicates when they are not.

Future plans

Future plans for CapiDup include having a configurable option to use a different hashing algorithm, such as SHA1 which has a larger hash size of 160 bits, or SHA2 which allows hashes up to 512 bits and has no publicly known collision attacks. SHA2 is currently used for most cryptographic purposes, where security is essential. False positives, random or maliciously provoked, would be practically impossible. Duplicate detection will of course be slower, depending on the chosen algorithm.

For the extremely paranoid case, there could be an additional setting which would check files with two different hashing algorithms. The tradeoff in speed would not be worthwhile for any normal use case, but the possibility could be there.

capidup's People

Contributors

israel-lugo avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

capidup's Issues

Detect directory loops

Now that find_duplicates_in_firs has the follow_dirlinks parameter (see #16), we need a way to detect symlink loops. If there is a symlink pointing to ., or to a parent directory, we will go in a loop.

Fortunately, os.walk() seems to stop after several levels of recursion. But still, it's probably undefined behavior.

See what commands like find or rsync do.

Breaks in index_files_by_size when files disappear while indexing (race condition)

index_files_by_size isn't catching some errors.

It is possible for os.lstat() to fail inside the loop that iterates filenames, and that exception will not be caught.

This is a corner case from a race condition. The file could exist when os.walk() lists the directory, but already be removed when we execute os.lstat(). To trigger this, we can scan /proc:

>>> capidup.finddups.find_duplicates_in_dirs(['/proc'])
...
error listing '/proc/14690/fd': Permission denied
error listing '/proc/14690/fdinfo': Permission denied
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/virtenv/local/lib/python2.7/site-packages/capidup/finddups.py", line 270, in find_duplicates_in_dirs
    sub_errors = index_files_by_size(directory, files_by_size)
  File "/tmp/virtenv/local/lib/python2.7/site-packages/capidup/finddups.py", line 121, in index_files_by_size
    file_info = os.lstat(full_path)
OSError: [Errno 2] No such file or directory: '/proc/14823/task/14823/fd/3'
>>>

The PID in question is the Python interpreter itself.

Solution: wrap os.lstat() with a try: ... except and call _print_error() for consistency.

Must create a test case (how?).

Exclude directories

Hi,
Just a suggestion:
It would be useful to have an option to exclude directories. This would make it even more flexible.
Thanks a lot for this nice tool !!

Pip install doesn't include the documentation

Installing with pip install capidup doesn't include the documentation files. We should make sure to include the documentation, so our users can know how to use the package...

Files included:

$ pip show -f capidup

---
Metadata-Version: 2.0
Name: capidup
Version: 1.0.1
Summary: Quickly find duplicate files in directories
Home-page: https://github.com/israel-lugo/capidup
Author: Israel G. Lugo
Author-email: [email protected]
License: GPLv3+
Location: /tmp/asdf/lib/python2.7/site-packages
Requires: 
Files:
  capidup-1.0.1.dist-info/DESCRIPTION.rst
  capidup-1.0.1.dist-info/METADATA
  capidup-1.0.1.dist-info/RECORD
  capidup-1.0.1.dist-info/WHEEL
  capidup-1.0.1.dist-info/metadata.json
  capidup-1.0.1.dist-info/top_level.txt
  capidup/__init__.py
  capidup/__init__.pyc
  capidup/finddups.py
  capidup/finddups.pyc
  capidup/py3compat.py
  capidup/py3compat.pyc
  capidup/version.py
  capidup/version.pyc

test_find_dups_in_dirs gives false negatives on some systems

test_dups_full.py:test_find_dups_in_dirs() (introduced initially in b415c70) can fail with the following error:

tmpdir = local('/tmp/pytest-of-capi/pytest-28/test_find_dups_in_dirs_False_02'), file_groups = (('', '', ''), ('a', 'a', 'a', 'a', 'a', 'a', ...)), num_index_errors = 0, num_read_errors = 0
flat = False

    @pytest.mark.parametrize("file_groups", file_groups_data)
    @pytest.mark.parametrize("num_index_errors", index_errors_data)
    @pytest.mark.parametrize("num_read_errors", read_errors_data)
    @pytest.mark.parametrize("flat", [True, False])
    def test_find_dups_in_dirs(tmpdir, file_groups, num_index_errors,
                               num_read_errors, flat):
        """Test find_duplicates_in_dirs with multiple files.
[...]
        # Check that duplicate groups match. The files may have been traversed
        # in a different order from how we created them; sort both lists.
        dup_groups.sort()
        expected_dup_groups.sort()
        for i in range(len(dup_groups)):
>           assert len(dup_groups[i]) == len(expected_dup_groups[i])
E           AssertionError: assert 200 == 3
E            +  where 200 = len(['/tmp/pytest-of-capi/pytest-28/test_find_dups_in_dirs_False_02/file140/g1', '/tmp/pytest-of-capi/pytest-28/test_find_..._find_dups_in_dirs_False_02/file40/g1', '/tmp/pytest-of-capi/pytest-28/test_find_dups_in_dirs_False_02/file21/g1', ...])
E            +  and   3 = len(['/tmp/pytest-of-capi/pytest-28/test_find_dups_in_dirs_False_02/file0/g0', '/tmp/pytest-of-capi/pytest-28/test_find_dups_in_dirs_False_02/file1/g0', '/tmp/pytest-of-capi/pytest-28/test_find_dups_in_dirs_False_02/file2/g0'])

capidup/tests/test_dups_full.py:208: AssertionError

This is seen on a Debian 9.1 system, kernel 4.9.0-3-amd64, Python versions 2.7.13 and 3.5.3, pytest-3.2.2. It is not however seen on the Travis build server, or my own home PC running Gentoo Linux.

Configurable hashing algorithm

Create a command-line option to let the user select between MD5 and other (more secure) hashing algorithms. E.g. SHA-1, SHA-256, SHA-512.

This way, user can select a more collision resistant hash, for when security is a greater concern (e.g. comparing software installers which might have been tampered with, and so on).

Generate plaintext README

README.md looks nice in Github, but it's not practical for users who download our code, or install the package.

A README.txt can be generated using pandoc:
pandoc -t plain -f markdown README.md > README.txt

But this doesn't support all formatting. Namely, it doesn't support HTML <sup> tags, which we use to represent 264, leaving us with "1 in 264" instead of "1 in 264". And that's just not the same thing.

We could change to AsciiDoc. That supports superscripting natively.

Partial test could include beginning and end of file

Some files may be very similar at the start, but different at the end. E.g. a large ISO of two similar operating system versions, or so on. Also, some media formats add their metadata headers at the end, instead of the start (cf ID3v1 tags for MP3).

The partial test could read X/2 KB from the start and X/2 from the end. This would also help against collisions, by mixing it up.

Should performance test this. It will cause more seeking in mechanical disks. Will it make a noticeable difference?

Look into parallelization

It might be advantageous to use multiprocessing, to calculate multiple hashes at the same time. Bottleneck will probably be storage performance, but we might be comparing directories in different drives, so it may pay off.

Exclude files by glob pattern

Similarly to the directory exclusion in issue #10 suggested by @ocumo, it would be nice to be able to exclude files as well. We can do this with an additional parameter to find_duplicates_in_dirs.

Optionally follow symlinks to directories

It would be nice to have the possibility of following symbolic links to (sub)directories. Currently, we don't follow them.

find_duplicates_in_dirs could have a new follow_links parameter, defaulting to False for compatibility.

Paranoid setting to hash with two algorithms?

Maybe we could create a paranoid setting, where files are hashed with two algorithms. Is this even worthwhile? Not like we're expecting forced collisions in SHA-512 anytime soon.

Be compatible with Python 3

We should work both in Python 2 and Python 3. Most of the code should be rather version agnostic, except perhaps for the file interaction (bytes vs str) and things like dict.iterkeys.

Any portions of code that can't work with both Python versions can be moved out to version-specific modules, to be imported accordingly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.