Giter VIP home page Giter VIP logo

Comments (9)

SeeSpotRun avatar SeeSpotRun commented on August 24, 2024

Ok good feedback.

I moved to a new branch for this: https://github.com/SeeSpotRun/rmlint/tree/mem_miser which is branched from sahib's sqlite-hash-table branch.

I just ran a synthetic test case with about 5M (admittedly very small) files with the same command options as yours (except -vvv instead of -VVV) (64 bit Fedora 21). It's bogging down at the start of traversal but not stopping completely. It seems that reconstructing the file path each time we need to use it is quite cpu-intensive. I'll have a look at whether we can reduce the number of reconstructions required.

from rmlint.

SeeSpotRun avatar SeeSpotRun commented on August 24, 2024

Seems to work reasonably well now. Test case above had all 5M files in one folder and all the same size which appears to have slowed things down. New test case has 8M files spread across 1024 folders and only 8000 of each size.

Some stats on RAM usage and runtime:

  • Regular rmlint 4.21GB 236 seconds
  • mem_miser 3.43 GB 1020 seconds
  • mem_miser with option --without-fiemap 3.19 GB 1093 seconds
  • sqlite swap table 3.42 GB 1619 seconds

from rmlint.

SeeSpotRun avatar SeeSpotRun commented on August 24, 2024

Hey @vvs- are you able to test out the mem_miser branch on your system to see how it performs?

from rmlint.

vvs- avatar vvs- commented on August 24, 2024

After finding 1582327 duplicates in 655938 groups it stalled with 100% CPU utilization. I waited for two hours and killed it. No visible progress during that time was reported. No disk activity either. rmlint new -T df -pp --without-fiemap -o progressbar -o summary -VVV

Attached is a profile report. The report was created after running with 100% CPU for about two hours. It seems that the CPU utilization might be related to without-fiemap option.

from rmlint.

SeeSpotRun avatar SeeSpotRun commented on August 24, 2024

Ok thanks for the update. Out of interest, do you recall how many files/GB it had left to scan? I mean was if close to finished or still a long way to go?
I don't think the --without-fiemap is related but I suggest you drop that anyway, it should reduce disk head seek load.
To help us rule out anything not related to -pp, would you mind doing a run without -pp?
Edit: actually hold on that, i managed to reproduce the problem and should have a fix soon.

from rmlint.

SeeSpotRun avatar SeeSpotRun commented on August 24, 2024

Please try again with https://github.com/SeeSpotRun/rmlint/tree/paranoid_mem_debug (with -pp)

from rmlint.

vvs- avatar vvs- commented on August 24, 2024

Amazing! It worked like a charm. It used all available memory, but didn't crash and worked very fast. Some stats:

5098832 files
3045531 duplicates using 491.39 GB
1300870 originals
9h48m total run time

Yes, that's not mistake - less than ten hours! Congratulations!

BTW, with-metadata-cache is broken in this version - it's a no op.

from rmlint.

SeeSpotRun avatar SeeSpotRun commented on August 24, 2024

Great to hear.
Actually by default it uses 256MB for buffering file comparisons, plus the overhead for file metadata and other internal structures (which with 5 million files is quite a lot). If you have extra RAM available you can speed up the file scanning process slightly by allocating more memory to this buffer (eg --max-paranoid-mem=512MB).

from rmlint.

SeeSpotRun avatar SeeSpotRun commented on August 24, 2024

Changes have been merged into the main repo (https://github.com/sahib/rmlint/tree/sqlite-hash-table branch) so I'll close off this issue and we can continue the discussion if necessary over there.

from rmlint.

Related Issues (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.