Giter VIP home page Giter VIP logo

Comments (5)

Krzysiu avatar Krzysiu commented on June 2, 2024 1

I like that, I had to recently make a Python script to do that. Alas, I hadn't idea how to implement "close, but not exact copies" hashing system.

from czkawka.

avibathula avatar avibathula commented on June 2, 2024

There are hashing techniques that can not only tell if two pieces of data are identical, but also provide a measure of how much they differ from eachother.

Example: SimHash (similarity hashing), MinHash, Jaccard similarity (mathematical measure used to quantify the similarity between two sets or lists of elements) etc.

from czkawka.

Krzysiu avatar Krzysiu commented on June 2, 2024

Yeah, I know, thanks, but my problem was implementing it for specifically for directories - I think I'd have to perpetual hash contents of both directories and then compare it somehow - keeping in mind there are elements that doesn't fit set.

from czkawka.

wwcanoer avatar wwcanoer commented on June 2, 2024

Your request is implemented in Jam Software's SpaceObServer "Similar Folders" tab. It lists pairs of folders and their "% similar". When you click a pair, in the bottom pane, the two directories are compared similar to "Beyond Compare", with different colors depending on if the files are identical or same name with different size, date or MD5.

Unfortunately, it is expensive (over $200) since designed for corporate servers and is excruciatingly slow. There's a 30 day trial that I have installed and now waiting a couple days for results on 1.7 TB of data that has a lot of duplicates.

Also, it can only look at one drive. It can't compare drives. When I tried it before, my duplicate files were spread across many backup drives, so not useful. Now I have consolidated all similar folders on one drive. Waiting to see if it will be worth the wait and enable me to deduplicate that drive faster than Duplicate Cleaner.

from czkawka.

wwcanoer avatar wwcanoer commented on June 2, 2024

Duplicate Cleaner has a good Duplicate Folders feature. It will identify duplicate folders that are several layers deep. The duplicate folder has a number of duplicate files but may have additional non-duplicate folders. When a folder is selected, the right pane will show it's contents but I think only the duplicate files, not the non-duplicates, so not ideal for choosing which to keep.

Default is sorting the largest folder first, so if you select one of those for deletion, it will automatically select all every instance of it's subfolders in the long duplicate folders list.

It works great when there's only a few pages of duplicate folders, but when I get hundreds of pages, then tough to decide which to delete. I need to use other programs, like BeyondCompare, to actually compare the file trees to see which one I want to keep/delete.

To compare folders of the same name, I search a name (ex. My Documents) or substring in Everything (search) and then right-click copy that list and paste it in DuplicateCleaner, which then will find duplicates in all of those folders at once.

I have periodically searched for good duplicate or similar folders software but Duplicate Cleaner and SpaceObServer are the only two real options that I have found so far.

I use Everything, WinCatalog, Duplicate Cleaner, Beyond Compare, XYplorer, TreeSize free and periodically SpaceObServer Trial to dedup. Plus an excel VBA program to find move similar folders from diverse drives to one drive (ex. Find every "My Documents" folder and move it to a folder that has the parental path concatenated into a single string (so that I know where it came from). So then I have a list of folders like "Backup Drive 03 - Backup 2010-01-01 - Drive C - My Documents" all on one nvme drive so that I can dedup and consolidate them in one place with SpaceObServer. (vs having many slow backup USB drives connected at the same time.)

from czkawka.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.