elmindreda / duff Goto Github PK
View Code? Open in Web Editor NEWCommand-line utility for finding duplicate files
License: Other
Command-line utility for finding duplicate files
License: Other
duff - Duplicate file finder ============================ 0. Introduction =============== Duff is a command-line utility for identifying duplicates in a given set of files. It attempts to be usably fast and uses the SHA family of message digests as a part of the comparisons. Duff resides in public Git repository on GitHub: https://github.com/elmindreda/duff The version numbering scheme for duff is as follows: * The first number is the major version. This will be updated upon what the author considers a round of feature completion. * The second number is the minor version number. This is updated for releases that include minor new features, or features that do not change the functionality of the program. * The third number, if present, is the bugfix release number. This indicates a release which only fixes bugs present in a previous major or minor release. 1. License and copyright ======================== Duff is copyright (c) 2005 Camilla Löwy <[email protected]> Duff is licensed under the zlib/libpng license. See the file `COPYING' for license details. The license is also included at the top of each source file. Duff contains shaX-asaddi. Copyright (c) 2001-2003 Allan Saddi <[email protected]> See the files `src/sha*.c' and `src/sha*.h' for license details. Duff uses the gettext.h convenience header from GNU gettext. Copyright (C) 1995-1998, 2000-2002, 2004-2006, 2009 Free Software Foundation, Inc. See the `lib/gettex.h' for license details. Duff comes with a number of files provided by the GNU autoconf, automake and gettext packages. See the individual files in question for license details. 2. Project news =============== See the file `NEWS'. 3. Building Duff ================ If you got this source tree from a Git repository then you will need to bootstrap the build environment using first `gettextize --no-changelog' and then `autoreconf -i'. Note that this requires that GNU autoconf, automake and the gettext development tools are installed. If (or once) you have a `configure' script, go ahead and run it. No additional magic should be required. If it is, then that's a bug and should be reported. This release of duff has been successfully built on the following systems: Ubuntu Natty x86_64 Earlier releases have been successfully built on the following systems: Arch Linux x86 Cygwin 1.7 i686 Darwin 7.9.0 powerpc Debian Etch powerpc Debian Etch x86 Debian Lenny x86 Debian Sarge alpha Debian Wheezy amd64 FreeBSD 4.11 x86 FreeBSD 5.4 x86 FreeBSD 8.2 i386 Mac OS X 10.3 powerpc Mac OS X 10.4 powerpc Mac OS X 10.6 i386 Mac OS X 10.6 x86_64 Mac OS X 10.6 x86_64 (with MacPorts gettext) Mac OS X 10.7 x86_64 NetBSD 1.6.1 sparc Red Hat Enterprise 4.0 x86 SunOS 5.9 sparc64 Ubuntu Breezy x86 Ubuntu Jaunty x86 Ubuntu Lucid amd64 Ubuntu Maverick amd64 The tools used were GCC and GNU or BSD make. However, it should build on most Unix systems without modifications. 4. Installing Duff ================== See the file `INSTALL'. 5. Using Duff ============= See the accompanying man page duff(1). To read the man page before installation, use the following command: groff -mdoc -Tascii man/duff.1 | less -R On GNU/Linux systems, however, the following command may suffice: man -l man/duff.1 6. Hacking Duff =============== See the file `HACKING'. 7. Bugs, feedback and patches ============================= Please send bug reports, feedback, patches and cookies to: Camilla Löwy <[email protected]> 8. Credits and thanks ===================== The following (alphabetically listed) people have contributed to duff, either by reporting bugs, suggesting new features or submitting patches: Harald Barth Alexander Bostrom Magnus Danielsson Stephan Hegel Patrik Jarnefelt Rasmus Kaj Mika Kuoppala Richard Levitte Fernando Lopez Clemens Lucas Fries Kamal Mostafa Ross Newell Allan Saddi <[email protected]> ...and everyone I forgot. Did I forget you? Drop me an email. 9. Disambiguation ================= This is duff the Unix command-line utility, not DUFF the Windows program. If you wish to find duplicate files on Windows, use DUFF. DUFF also has a SourceForge.net URL: http://dff.sourceforge.net/ 10. Release history =================== Version 0.1 was named `duplicate' and was never released anywhere. Version 0.2 was the first release named duff. It lacked a real checksumming algorithm, and was thus only released to a few individuals, during the first half of 2005. Version 0.3 was the first official release, on November 22, 2005, after a long search for a suitably licensed implementation of SHA1. Version 0.3.1 was a bugfix release, on November 27, 2005, adding a single feature (-z), which just happened to get included. Version 0.4 was the second feature release, on January 13, 2006, adding a number of missing and/or requested features as well as bug fixes. It was the first release to be considered stable and safe enough for everyday use. Version 0.5 was the third feature release, on April 11, 2011, adding a number of minor features and fixing a number of bugs. It was mostly intended to get the ball rolling again and thus low on features. Version 0.5.1 was a bugfix release, on January 17, 2012, adding a single bugfix and a new default cluster header for thorough mode. Version 0.5.2 was an minor release, on January 29, 2012, adding a number of optimizations, prefixing error and warning messages with the program name and modifying the default sampling limit.
Tried to build from source by following the instruction on README: first gettextize --no-changelog
, then autoreconf -i
, got this error:
configure.ac:47: error: `po/Makefile.in' is already registered with AC_CONFIG_FILES.
../../lib/autoconf/status.m4:288: AC_CONFIG_FILES is expanded from...
configure.ac:47: the top level
autom4te: /usr/bin/m4 failed with exit status: 1
aclocal: error: echo failed with exit status: 1
autoreconf: aclocal failed with exit status: 1
The README refers to an INSTALL file with installation details but the file is not in the tarball nor the git repo.
Hi
I'm doing some work on duff because I found it useful when fixing broken rsnapshot repositories (I will make some pull request in few days). Unfortunately such repositories are a bit unusual (millions of files, mostly hardlinked in groups of 30-50).
It seems that I'm having problem with large buckets (long lists): because each sampled file allocates 4KB of data that is going to be free at the end of bucket processing - I'm getting "out of memeory" errors at 3GB of memory allocated (because the box is light 32-bit atom-based system).
As sizeof(FileList) == 12 I see no problem increasing HASH_BITS to 16 (~800KB) or even 20 (~13MB).
I wonder what you think - if it's a good idea to add an option to make it runtime-configurable?
Another idea is to replace (optionally?) sample with some simple fast running checksum (crc64?).
It appears that in 2021 this domain was registered after having lapsed. Currently, duff.dreda.org seems to direct to malvertising of some sort. ("Your computer is infected with a virus!" type stuff.)
For checking that a backup is complete, or checking that I have all the files from a camera SD card (before I wipe the card) it would be useful to be able to run duff in a "find unique" mode that lists files which don't have duplicates.
As ever, this functionality can be constructed with an appropriate pipeline of find/sha1sum/sort/uniq or similar, but perhaps it's close enough to what duff does to be worth including?
hi,
would it be possible for you to use or introduce xxhash as a hash function ?
thanks for duff 👍
I feel like this should be obvious but maybe i'm missing something. duff marks hardlinked files as "duplicates" which means that doing the obvious thing - using duff to reduce clutter and delete duplicate files - will result in deleting files with hardlinks (multiple filenames to the same data). I can't think of any reason why this should be the default rather than the opposite. Basically -p should be default, right?
-p Physical mode. Make duff consider physical files instead of hard links. If specified, multiple hard links to the same physical file will not be reported as duplicates.
With find
you can ignore files that are of little consequence, such as files that are really small or really big:
# Files more than 1 gigabyte
find -size +1G
# Files less than 1 megabyte
find -size -1M
This would be great for duff
because when trying to free up disk space one wants to find the big files (e.g. videos) without the output being flooded by static web content (e.g. jquery-1.9.2.js, bootstrap.css).
Hopefully that isn't too tough to implement, unlike sorting which would be great but probably algorithmically prohibitive. Being able to use duff with pipes to do things like filtering would be even smarter but I don't see a way to do it the way duff reports (except with the -e
option which is a bit risky).
As one of the reasons to find duplicate files is to recover precious disc space, it would be great that the default sort of duff would be file size. Delete a single duplicated huge file is much more useful than lots of tiny ones. Or, at least, provide a command line option of sorting by size.
Other than that, duff is a great utility, thanks so much!
Similar to du -h
, would it be possible to support an option that presents file sizes in megabytes/gigabytes instead of bytes?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.