Mem: 3087880k total, 3066536k used, 21344k free, 83468k buffers
Swap: 2104476k total, 29812k used, 2074664k free, 147640k cached
10518 xuefer 20 0 2387m 2.3g 2052 D 8 77.9 2:47.64 hardlink
cpu is not a problem as you can see CPU = 8%, mem by hardlink = 77.9% and it
keeps going up.
do you know why it use so much memory? i'm running it against 679433 files.
maybe the filecmp module cache files being read?
anyway it take so long to complete. i don't think it optimal
i don't think sorting all file together by content is because 1 file maybe read
multiple to due to sorting algorithm, and system level file caching may be
flush when the data is bigger than cache.
i would suggest hardlink to do md5 or sha1, like other de-duplication tool do
for hashing, md5 string takes 32bytes (hex-mac), or 16 bytes (binary)
679433*32/1024/1024 = 20M bytes (and more due to dictionary and python variable)
just before it compare 2 files, it compare the md5 hash of the files
for file in reagularFiles:
if sizehash[file.size] is already there:
compare(file, sizehash[file.size][0])
sizehash[file.size].append(file)
def compare(file1, file2):
if md5[file1] <> md5[file2]: # only calc md5 JIT
return False
if filecmp.cmp(file1, file2):
hardlink(file1, file2)
let's see if it's faster when disk bound is a problem