elmindreda / duff Goto Github PK

Command-line utility for finding duplicate files

License: Other

Shell 0.74% C 97.85% Makefile 0.40% M4 1.01%

duff's Introduction

duff - Duplicate file finder
============================

0. Introduction
===============

Duff is a command-line utility for identifying duplicates in a given set of
files.  It attempts to be usably fast and uses the SHA family of message
digests as a part of the comparisons.

Duff resides in public Git repository on GitHub:

  https://github.com/elmindreda/duff

The version numbering scheme for duff is as follows:

 * The first number is the major version.  This will be updated upon what the
   author considers a round of feature completion.

 * The second number is the minor version number.  This is updated for releases
   that include minor new features, or features that do not change the
   functionality of the program.

 * The third number, if present, is the bugfix release number.  This indicates
   a release which only fixes bugs present in a previous major or minor release.


1. License and copyright
========================

Duff is copyright (c) 2005 Camilla Löwy <[email protected]>

Duff is licensed under the zlib/libpng license.  See the file `COPYING' for
license details.  The license is also included at the top of each source file.

Duff contains shaX-asaddi.
Copyright (c) 2001-2003 Allan Saddi <[email protected]>
See the files `src/sha*.c' and `src/sha*.h' for license details.

Duff uses the gettext.h convenience header from GNU gettext.
Copyright (C) 1995-1998, 2000-2002, 2004-2006, 2009 Free Software Foundation,
Inc.  See the `lib/gettex.h' for license details.

Duff comes with a number of files provided by the GNU autoconf, automake and
gettext packages.  See the individual files in question for license details.


2. Project news
===============

See the file `NEWS'.


3. Building Duff
================

If you got this source tree from a Git repository then you will need to
bootstrap the build environment using first `gettextize --no-changelog' and then
`autoreconf -i'.  Note that this requires that GNU autoconf, automake and
the gettext development tools are installed.

If (or once) you have a `configure' script, go ahead and run it.  No additional
magic should be required.  If it is, then that's a bug and should be reported.

This release of duff has been successfully built on the following systems:

  Ubuntu Natty x86_64

Earlier releases have been successfully built on the following systems:

  Arch Linux x86
  Cygwin 1.7 i686
  Darwin 7.9.0 powerpc
  Debian Etch powerpc
  Debian Etch x86
  Debian Lenny x86
  Debian Sarge alpha
  Debian Wheezy amd64
  FreeBSD 4.11 x86
  FreeBSD 5.4 x86
  FreeBSD 8.2 i386
  Mac OS X 10.3 powerpc
  Mac OS X 10.4 powerpc
  Mac OS X 10.6 i386
  Mac OS X 10.6 x86_64
  Mac OS X 10.6 x86_64 (with MacPorts gettext)
  Mac OS X 10.7 x86_64
  NetBSD 1.6.1 sparc
  Red Hat Enterprise 4.0 x86
  SunOS 5.9 sparc64
  Ubuntu Breezy x86
  Ubuntu Jaunty x86
  Ubuntu Lucid amd64
  Ubuntu Maverick amd64

The tools used were GCC and GNU or BSD make.  However, it should build on most
Unix systems without modifications.


4. Installing Duff
==================

See the file `INSTALL'.


5. Using Duff
=============

See the accompanying man page duff(1).

To read the man page before installation, use the following command:

  groff -mdoc -Tascii man/duff.1 | less -R

On GNU/Linux systems, however, the following command may suffice:

  man -l man/duff.1


6. Hacking Duff
===============

See the file `HACKING'.


7. Bugs, feedback and patches
=============================

Please send bug reports, feedback, patches and cookies to:

  Camilla Löwy <[email protected]>


8. Credits and thanks
=====================

The following (alphabetically listed) people have contributed to duff, either
by reporting bugs, suggesting new features or submitting patches:

Harald Barth
Alexander Bostrom
Magnus Danielsson
Stephan Hegel
Patrik Jarnefelt
Rasmus Kaj
Mika Kuoppala
Richard Levitte
Fernando Lopez
Clemens Lucas Fries
Kamal Mostafa
Ross Newell
Allan Saddi <[email protected]>

...and everyone I forgot.  Did I forget you?  Drop me an email.


9. Disambiguation
=================

This is duff the Unix command-line utility, not DUFF the Windows program.
If you wish to find duplicate files on Windows, use DUFF.

DUFF also has a SourceForge.net URL:

  http://dff.sourceforge.net/


10. Release history
===================

Version 0.1 was named `duplicate' and was never released anywhere.

Version 0.2 was the first release named duff.  It lacked a real checksumming
algorithm, and was thus only released to a few individuals, during the first
half of 2005.

Version 0.3 was the first official release, on November 22, 2005, after a
long search for a suitably licensed implementation of SHA1.

Version 0.3.1 was a bugfix release, on November 27, 2005, adding a single
feature (-z), which just happened to get included.

Version 0.4 was the second feature release, on January 13, 2006, adding a
number of missing and/or requested features as well as bug fixes.  It was the
first release to be considered stable and safe enough for everyday use.

Version 0.5 was the third feature release, on April 11, 2011, adding a number
of minor features and fixing a number of bugs.  It was mostly intended to get
the ball rolling again and thus low on features.

Version 0.5.1 was a bugfix release, on January 17, 2012, adding a single bugfix
and a new default cluster header for thorough mode.

Version 0.5.2 was an minor release, on January 29, 2012, adding a number of
optimizations, prefixing error and warning messages with the program name and
modifying the default sampling limit.

duff's People

Contributors

Stargazers

Watchers

Forkers

fsiler fengye110 luiseduardohdbackup paulmadore marcin-gryszkalis cloudxtreme smorin leeyeetonn tabulon-ext llaith-oss robodoc zszs717524

duff's Issues

Build from source failed on Ubuntu 16.04

Tried to build from source by following the instruction on README: first gettextize --no-changelog, then autoreconf -i, got this error:

configure.ac:47: error: `po/Makefile.in' is already registered with AC_CONFIG_FILES.
../../lib/autoconf/status.m4:288: AC_CONFIG_FILES is expanded from...
configure.ac:47: the top level
autom4te: /usr/bin/m4 failed with exit status: 1
aclocal: error: echo failed with exit status: 1
autoreconf: aclocal failed with exit status: 1

Add repology badge to the website?

Small badge

Long badge

Documentation

The README refers to an INSTALL file with installation details but the file is not in the tarball nor the git repo.

Hi
I'm doing some work on duff because I found it useful when fixing broken rsnapshot repositories (I will make some pull request in few days). Unfortunately such repositories are a bit unusual (millions of files, mostly hardlinked in groups of 30-50).

It seems that I'm having problem with large buckets (long lists): because each sampled file allocates 4KB of data that is going to be free at the end of bucket processing - I'm getting "out of memeory" errors at 3GB of memory allocated (because the box is light 32-bit atom-based system).

As sizeof(FileList) == 12 I see no problem increasing HASH_BITS to 16 (~800KB) or even 20 (~13MB).
I wonder what you think - if it's a good idea to add an option to make it runtime-configurable?

Another idea is to replace (optionally?) sample with some simple fast running checksum (crc64?).

Dreda.org registration lapsed, and has been picked up by a bad actor.

It appears that in 2021 this domain was registered after having lapsed. Currently, duff.dreda.org seems to direct to malvertising of some sort. ("Your computer is infected with a virus!" type stuff.)

Argument to list unique files

For checking that a backup is complete, or checking that I have all the files from a camera SD card (before I wipe the card) it would be useful to be able to run duff in a "find unique" mode that lists files which don't have duplicates.

As ever, this functionality can be constructed with an appropriate pipeline of find/sha1sum/sort/uniq or similar, but perhaps it's close enough to what duff does to be worth including?

xxhash

hi,
would it be possible for you to use or introduce xxhash as a hash function ?
thanks for duff 👍

make -p active by default (don't follow hardlinks)

I feel like this should be obvious but maybe i'm missing something. duff marks hardlinked files as "duplicates" which means that doing the obvious thing - using duff to reduce clutter and delete duplicate files - will result in deleting files with hardlinks (multiple filenames to the same data). I can't think of any reason why this should be the default rather than the opposite. Basically -p should be default, right?

-p Physical mode. Make duff consider physical files instead of hard links. If specified, multiple hard links to the same physical file will not be reported as duplicates.

Filter files that do not match a size predicate

With find you can ignore files that are of little consequence, such as files that are really small or really big:

# Files more than 1 gigabyte
find -size +1G

# Files less than 1 megabyte
find -size -1M

This would be great for duff because when trying to free up disk space one wants to find the big files (e.g. videos) without the output being flooded by static web content (e.g. jquery-1.9.2.js, bootstrap.css).

Hopefully that isn't too tough to implement, unlike sorting which would be great but probably algorithmically prohibitive. Being able to use duff with pipes to do things like filtering would be even smarter but I don't see a way to do it the way duff reports (except with the -e option which is a bit risky).

Sort by size?

As one of the reasons to find duplicate files is to recover precious disc space, it would be great that the default sort of duff would be file size. Delete a single duplicated huge file is much more useful than lots of tiny ones. Or, at least, provide a command line option of sorting by size.

Other than that, duff is a great utility, thanks so much!

Make sizes in cluster header human readable with "-h" flag

Similar to du -h, would it be possible to support an option that presents file sizes in megabytes/gigabytes instead of bytes?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.