johnvillalovos / hardlinkpy Goto Github PK

Program to hard link duplicate files on Linux systems

License: GNU General Public License v2.0

Python 99.24% Shell 0.76%

hardlinkpy's Introduction

John Villalovos

I'm a software developer who primarily does Python programming.

Have extensive experience in Bare Metal Provisioing. Was a core developer on OpenStack's bare-metal provisioning project called Ironic. https://www.stackalytics.com/?user_id=happycamp&metric=commits

Additionally have contributed to the Beaker bare-metal provisioning project: https://github.com/beaker-project/

Experienced in the following languages/technologies:

Python
Linux
Fedora, Ubuntu, Red Hat Enterprise Linux, CentOS, and SUSE.
Shell scripting (bash for the most part)
Ansible
C/C++

My Stats

hardlinkpy's People

Stargazers

Watchers

Forkers

beurt bubeck

hardlinkpy's Issues

On-disk cache would be nice, to avoid excessive memory use

- What steps will reproduce the problem?
1. Run against file tree containing millions of files
2. Watch memory use eventually grow to several hundred MB

- What is the expected output? What do you see instead?
I would like hardlink.py's memory use to remain moderate.  Instead it will
eventually use all the RAM of the small virtual machines I use for backing
things up.

- What version of the product are you using? On what operating system?
hardlink.py: 0.05 - 2010-01-07 (07-Jan-2010), Debian Lenny

- Please provide any additional information below.
hardlink.py tries to keep its cache of file data in memory, but on
directory trees containing 10s of millions of files (such as those I have
from backing up other machines) this is difficult to fit in moderate RAM.

It would be nice to optionally be able to use an on-disk cache file of some
sort so that it doesn't need to keep it all in RAM.  This should be
optional and probably non-default because it will surely be slower.

Cheers,
Andy

Original issue reported on code.google.com by [email protected] on 31 Jan 2010 at 4:06

Ignoring some path

In reality it is not bug but a feature i'd like to use.

A trouble I started to see is how to ignore some directories. For example I 
have some versioned directories by Subversion and Git and these one do not need 
to have its files hardlinked (in my humble opinion). Would implement this 
feature?

BTW: I were writing my own hardlink tool [using bash script] when I found your 
tool. Although my tool works, it is embrionary yet, and need lots of features. 
I thought yours script more interesting and I tending to delete the mine to use 
yours. 

Congrats your beautiful job

Original issue reported on code.google.com by [email protected] on 27 Feb 2015 at 11:59

ownership of multiple directories throwing hardlink.py off

What version of the product are you using? On what operating system?

hardlink.py: 0.04 - 2007-11-14 (14-Nov-2007)


Please provide any additional information below.

[jjensen@linux-cel5-64 enterprise]$ ls -ld 4*
drwxrwsr-x  3     263  263 4096 Jan 31  2005 4
drwxr-xr-x  3 jjensen eng  4096 Feb 17  2005 4_U0
drwxr-xr-x  3 root    root 4096 May 14  2005 4_U1
drwxr-xr-x  4 apuch   eng  4096 Oct  5  2005 4_U2
drwxr-sr-x  3 apuch   eng  4096 Aug 31 03:15 4_U6

[jjensen@linux-cel5-64 enterprise]$ sudo /tmp/hardlink.py -tfn 4* | egrep
-v "C$
<snip>
Hard linking Statistics:
Directories           : 139
Regular files         : 14976
Comparisons           : 34
Hardlinked this run   : 0
Total hardlinks       : 3732
Bytes saved this run  : 0 (0 bytes)
Total bytes saved     : 6415464031 (5.975 gibibytes)
Total run time        : 1.54171490669 seconds


I know this is the incorrect result.  There are A LOT more hardlinks to
be had here.  I've noticed that the owner/groupowner of the files is
different, so even though I'm running with sudo, I decided to try this:


[jjensen@linux-cel5-64 enterprise]$ sudo chown -R root.ecslinux 4*

[jjensen@linux-cel5-64 enterprise]$ ls -ld 4*
drwxrwsr-x  3 root ecslinux 4096 Jan 31  2005 4
drwxr-xr-x  3 root ecslinux 4096 Feb 17  2005 4_U0
drwxr-xr-x  3 root ecslinux 4096 May 14  2005 4_U1
drwxr-xr-x  4 root ecslinux 4096 Oct  5  2005 4_U2
drwxr-sr-x  3 root ecslinux 4096 Aug 31 03:15 4_U6


[jjensen@linux-cel5-64 enterprise]$ sudo /tmp/hardlink.py -tfn 4*
<snip>
Directories           : 139
Regular files         : 14976
Comparisons           : 4701
Hardlinked this run   : 4590
Total hardlinks       : 7610
Bytes saved this run  : 4959785281 (4.619 gibibytes)
Total bytes saved     : 10303301920 (9.596 gibibytes)
Total run time        : 259.393012047 seconds

Wow... what a difference.  I'm noticing this in several situations. 
hardlink.py, even when run as root, tells me very very different things
based on the file and/or directory owndership.

I consider this a (big) bug.... what say you?

Original issue reported on code.google.com by joshmule on 3 Dec 2007 at 7:35

Feature request: treat only large files

It would be incredibly valuable and time saving if the algorithm could treat 
only larger files than a given value. This would make the use of the program 
much faster on large media, like external drives.

Original issue reported on code.google.com by [email protected] on 1 Jun 2012 at 7:40

improve performance by using 2 hashes to avoid reading file multiple times

Mem:   3087880k total,  3066536k used,    21344k free,    83468k buffers
Swap:  2104476k total,    29812k used,  2074664k free,   147640k cached
10518 xuefer    20   0 2387m 2.3g 2052 D    8 77.9   2:47.64 hardlink

cpu is not a problem as you can see CPU = 8%, mem by hardlink = 77.9% and it 
keeps going up.
do you know why it use so much memory? i'm running it against 679433 files. 
maybe the filecmp module cache files being read?

anyway it take so long to complete. i don't think it optimal
i don't think sorting all file together by content is because 1 file maybe read 
multiple to due to sorting algorithm, and system level file caching may be 
flush when the data is bigger than cache.

i would suggest hardlink to do md5 or sha1, like other de-duplication tool do 
for hashing, md5 string takes 32bytes (hex-mac), or 16 bytes (binary)
679433*32/1024/1024 = 20M bytes (and more due to dictionary and python variable)
just before it compare 2 files, it compare the md5 hash of the files
for file in reagularFiles:
  if sizehash[file.size] is already there:
    compare(file, sizehash[file.size][0])
  sizehash[file.size].append(file)

def compare(file1, file2):
  if md5[file1] <> md5[file2]: # only calc md5 JIT
    return False
  if filecmp.cmp(file1, file2):
    hardlink(file1, file2)
let's see if it's faster when disk bound is a problem

Original issue reported on code.google.com by [email protected] on 7 Feb 2012 at 3:48

Name clash with Red Hat's version

Hi, I am very thrilled about this new release and would like to package it.
However I cannot call it hardlink++ (like the CPP version) and since the
original tool is named hardlink I prefer not to replace the old version
with this one.

So either I called it hardlinkpy (which I do not like that much) or we
rename the project somehow to something unique.

Please let me know as soon as possible so that I can start to distribute
this program as part of RPMforge. (I am now going to test it using the
hardlinkpy name but I'd like to make sure what you think is the most
appropriate).

Thanks again for this version ! I have integrated hardlink++ as part of
mrepo and I plan to do the same with hardlinkpy.

Kind regards,

Dag Wieers

Original issue reported on code.google.com by [email protected] on 10 Dec 2007 at 9:58

hardlink.py 0.0.5 works on Mac OS X 10.6.8 / Python 2.7.2

The documentation could be updated to say that hardlink.py 0.0.5 works on Mac 
OS X 10.6.8

What steps will reproduce the problem?
1. Run hardlink.py on Mac OS X 10.6.8
2. It works!

What is the expected output? What do you see instead?

Documentation could be updated to reflect this.

What version of the product are you using? On what operating system?

$ bin/hardlink.py --version
hardlink.py: 0.05 - 2010-01-07 (07-Jan-2010)

$ python -V
Python 2.7.2 -- EPD_free 7.2-1 (32-bit)

Mac OS X 10.6.8 (10K549)

$ uname -v
Darwin Kernel Version 10.8.0: Tue Jun  7 16:33:36 PDT 2011; 
root:xnu-1504.15.3~1/RELEASE_I386


Please provide any additional information below.

Nice program - and so much easier to "port" than if it had been written in C.

Original issue reported on code.google.com by [email protected] on 27 Jan 2012 at 12:05

--verbose / -v option does not work

What steps will reproduce the problem?
1. use one of the following options:
   -v 0
   --verbose=0

What is the expected output? What do you see instead?
There should be less output and for verbosity 2 more output. There's more
output instead.

What version of the product are you using? On what operating system?
Revision 20 on Ubuntu Hardy 8.04 with Python 2.5.2.

Please provide any additional information below.
The "verbose" option is interpreted as a string, and comparisons to
integers on lines 132, 204 and 264 are always true when any value for the
option has been specified on the command line.

The attached patch fixes this by interpreting the option as an integer.

Original issue reported on code.google.com by [email protected] on 22 Sep 2008 at 3:37

Attachments:

verbosity.diff

Maximum number of hardlinks reached - Patch included

What steps will reproduce the problem?
1. Have more than 65000 duplicate files
2. Run hardlink.py

What is the expected output? What do you see instead?
Expected files to be linked, instead it reports "Failed to Link.

What version of the product are you using? On what operating system?
head revision on CentOS 6.5

Please provide any additional information below.
ext4 limits the maximum number of hardlinks to 65000

Original issue reported on code.google.com by [email protected] on 24 Jul 2014 at 2:06

Attachments:

hardlink.py-maximum.patch

If run for current directory, files in the root not linked

What steps will reproduce the problem?
1. Suppose we have this tree and all files have identical content:

    mydir/
      file1
      file2
      subdir1/
        file3
        file4
      subdir2/
        file5
        file6

2. Now run these commands:

    cd mydir
    hardlink.py .

What is the expected output? What do you see instead?
  As a result, files 3-6 are hardlinked, but files 1-2 are not.

  If we re-create the tree, do *not* change the working directory to
  mydir and run "hardlink.py mydir", all files are linked. The same is
  true if the current working directory is mydir/subdir1 or
  mydir/subdir2 and "hardlink.py .." is run.

What version of the product are you using? On what operating system?
  Revision 20 on Ubuntu 8.04 with Python 2.5.

Please provide any additional information below.
  This issue is also tracked as
http://github.com/akaihola/hardlinkpy/tree/f8f60e4e/ditz/issue-f3f4244409fd2cc05
9be6478f87a1fb657f10e13.yaml

Original issue reported on code.google.com by [email protected] on 30 Sep 2008 at 1:51

--dry-run does not stop changes from happening

What steps will reproduce the problem?
1. Create a tree with distinct copies of the same file
2. run hardlink.py with --dry-run or -n option

What is the expected output? What do you see instead?

I would expect to see the inode numbers remain unchanged, instead they are
all the same (aka hardlinked).

whardin@freeman /tmp/hardlink-test $ ls -li *
dir1:
total 964
13329023 -rw-r--r-- 1 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir2:
total 964
13329026 -rw-r--r-- 1 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir3:
total 964
13329029 -rw-r--r-- 1 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir4:
total 964
13329032 -rw-r--r-- 1 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir5:
total 964
13329035 -rw-r--r-- 1 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir6:
total 964
13329038 -rw-r--r-- 1 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir7:
total 964
13329041 -rw-r--r-- 1 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir8:
total 964
13329044 -rw-r--r-- 1 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir9:
total 964
13329047 -rw-r--r-- 1 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm
whardin@freeman /tmp/hardlink-test $ ../hardlinkpy-read-only/hardlink.py -v
--dry-run .
File: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
File: /tmp/hardlink-test/dir2/apt-0.5.15lorg3.2-1.rf.src.rpm
Comparing: /tmp/hardlink-test/dir2/apt-0.5.15lorg3.2-1.rf.src.rpm
     to  : /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
Linked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
     to: /tmp/hardlink-test/dir2/apt-0.5.15lorg3.2-1.rf.src.rpm, saved 976485
File: /tmp/hardlink-test/dir3/apt-0.5.15lorg3.2-1.rf.src.rpm
Comparing: /tmp/hardlink-test/dir3/apt-0.5.15lorg3.2-1.rf.src.rpm
     to  : /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
Linked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
     to: /tmp/hardlink-test/dir3/apt-0.5.15lorg3.2-1.rf.src.rpm, saved 976485
File: /tmp/hardlink-test/dir4/apt-0.5.15lorg3.2-1.rf.src.rpm
Comparing: /tmp/hardlink-test/dir4/apt-0.5.15lorg3.2-1.rf.src.rpm
     to  : /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
Linked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
     to: /tmp/hardlink-test/dir4/apt-0.5.15lorg3.2-1.rf.src.rpm, saved 976485
File: /tmp/hardlink-test/dir5/apt-0.5.15lorg3.2-1.rf.src.rpm
Comparing: /tmp/hardlink-test/dir5/apt-0.5.15lorg3.2-1.rf.src.rpm
     to  : /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
Linked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
     to: /tmp/hardlink-test/dir5/apt-0.5.15lorg3.2-1.rf.src.rpm, saved 976485
File: /tmp/hardlink-test/dir6/apt-0.5.15lorg3.2-1.rf.src.rpm
Comparing: /tmp/hardlink-test/dir6/apt-0.5.15lorg3.2-1.rf.src.rpm
     to  : /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
Linked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
     to: /tmp/hardlink-test/dir6/apt-0.5.15lorg3.2-1.rf.src.rpm, saved 976485
File: /tmp/hardlink-test/dir7/apt-0.5.15lorg3.2-1.rf.src.rpm
Comparing: /tmp/hardlink-test/dir7/apt-0.5.15lorg3.2-1.rf.src.rpm
     to  : /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
Linked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
     to: /tmp/hardlink-test/dir7/apt-0.5.15lorg3.2-1.rf.src.rpm, saved 976485
File: /tmp/hardlink-test/dir8/apt-0.5.15lorg3.2-1.rf.src.rpm
Comparing: /tmp/hardlink-test/dir8/apt-0.5.15lorg3.2-1.rf.src.rpm
     to  : /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
Linked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
     to: /tmp/hardlink-test/dir8/apt-0.5.15lorg3.2-1.rf.src.rpm, saved 976485
File: /tmp/hardlink-test/dir9/apt-0.5.15lorg3.2-1.rf.src.rpm
Comparing: /tmp/hardlink-test/dir9/apt-0.5.15lorg3.2-1.rf.src.rpm
     to  : /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
Linked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
     to: /tmp/hardlink-test/dir9/apt-0.5.15lorg3.2-1.rf.src.rpm, saved 976485


Hard linking Statistics:
Files Hardlinked this run:
Hardlinked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
        to: /tmp/hardlink-test/dir2/apt-0.5.15lorg3.2-1.rf.src.rpm
Hardlinked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
        to: /tmp/hardlink-test/dir3/apt-0.5.15lorg3.2-1.rf.src.rpm
Hardlinked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
        to: /tmp/hardlink-test/dir4/apt-0.5.15lorg3.2-1.rf.src.rpm
Hardlinked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
        to: /tmp/hardlink-test/dir5/apt-0.5.15lorg3.2-1.rf.src.rpm
Hardlinked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
        to: /tmp/hardlink-test/dir6/apt-0.5.15lorg3.2-1.rf.src.rpm
Hardlinked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
        to: /tmp/hardlink-test/dir7/apt-0.5.15lorg3.2-1.rf.src.rpm
Hardlinked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
        to: /tmp/hardlink-test/dir8/apt-0.5.15lorg3.2-1.rf.src.rpm
Hardlinked: /tmp/hardlink-test/dir1/apt-0.5.15lorg3.2-1.rf.src.rpm
        to: /tmp/hardlink-test/dir9/apt-0.5.15lorg3.2-1.rf.src.rpm

Directories           : 10
Regular files         : 9
Comparisons           : 8
Hardlinked this run   : 8
Total hardlinks       : 8
Bytes saved this run  : 7811880 (7.450 mibibytes)
Total bytes saved     : 7811880 (7.450 mibibytes)
Total run time        : 0.04345703125 seconds
whardin@freeman /tmp/hardlink-test $ ls -li *
dir1:
total 964
13329023 -rw-r--r-- 9 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir2:
total 964
13329023 -rw-r--r-- 9 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir3:
total 964
13329023 -rw-r--r-- 9 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir4:
total 964
13329023 -rw-r--r-- 9 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir5:
total 964
13329023 -rw-r--r-- 9 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir6:
total 964
13329023 -rw-r--r-- 9 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir7:
total 964
13329023 -rw-r--r-- 9 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir8:
total 964
13329023 -rw-r--r-- 9 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm

dir9:
total 964
13329023 -rw-r--r-- 9 whardin InfoSys 976485 May 15 10:36
apt-0.5.15lorg3.2-1.rf.src.rpm


What version of the product are you using? On what operating system?

Pulled from CVS:
hardlink.py: 0.04 - 2007-11-14 (14-Nov-2007)

Running on:
32-bit Gentoo Linux with Python 2.5.2

Please provide any additional information below.

64-bit CentOS4 does not exhibit the same problem with Python 2.3.4

Original issue reported on code.google.com by [email protected] on 15 May 2008 at 4:22

Feature request: file inclusion flag

It would be really nice to exclude all files from hardlinking *unless* they
match a certain regex.  Probably would be "-i", which could be stackable
like -x, but would probably also have to be mutually exclusive to -x.

Thoughts?

Original issue reported on code.google.com by joshmule on 3 Dec 2007 at 7:41

filenames equal option is broken

What steps will reproduce the problem?
1. Run hardlink.py with the --filenames-equal option
2. Watch it die.

Please provide any additional information below.
Line 166 calls the areFileContentsEqual function with only the file names,
it is missing the "options" variable. Because the function was expecting
three variables, it dies when it only gets two.

Original issue reported on code.google.com by [email protected] on 2 Nov 2007 at 8:36

hardlink 2 trees

Not an issue, but a feature request:

When doing backups, you often end up with similar directory structures and 
backup mechanisms often take advantage of hardlinks to reduce disk size of a 
collection of backups, say ordered by date.

But it happens that similar backups have been made on different disks and that 
at some point you want to put them on the same disk.

It is possible to do that using hardlink.py, but it will make a comparison of 
every single file with a tree and across trees (that is 2*N*2*N operations N 
the number of files). It would be very useful to limit the hardlinking to 
couple of files in a similar directory tree (N operations). A typical number 
for N is 10^6.

Original issue reported on code.google.com by laurent.perrinet on 17 Mar 2012 at 3:21

Better abbreviate Gibibyte etc. (+ typo)

It's actually Mebibyte, not Mibibyte - but I think it would be better to
abbreviate those units anyway... it's just more common that way. IMO.

Patch attached.

Original issue reported on code.google.com by [email protected] on 4 Jul 2008 at 8:20

Attachments:

abbrev_si_units.patch

johnvillalovos / hardlinkpy Goto Github PK

hardlinkpy's Introduction

John Villalovos

My Stats

hardlinkpy's People

Stargazers

Watchers

Forkers

hardlinkpy's Issues

Recommend Projects

Recommend Topics

Recommend Org