Giter VIP home page Giter VIP logo

Comments (7)

goblin avatar goblin commented on June 14, 2024

By the way, here's the current output from -stats (I'm not exactly sure how to interpret the rates and multipliers):

           PROC INST            RATE            USE          QUOTA         FILL
           zero x1         6.8 MiB/s        2.4 TiB              ∞            ∞
            one x3         5.9 MiB/s        2.4 TiB              ∞            ∞
            two x0          11 MiB/s        2.4 TiB              ∞            ∞
          three x0         8.1 MiB/s        2.4 TiB              ∞            ∞
          split x1          12 MiB/s                                           
        backlog x24         15 MiB/s                                           
       checksum x0          26 MiB/s                                           
          index x0          11 MiB/s                                           
           gzip x0          12 MiB/s                                           
         parity x0          16 MiB/s                                           
            cmd x4          14 MiB/s                                           
          group x0             0 B/s                                           
         concur x21         15 MiB/s                                           
   (goroutines) x296

from scat.

Roman2K avatar Roman2K commented on June 14, 2024

Thanks for the detailed report!

I too noticed the ever increasing memory usage which definitely looks like a memory leak.

Could you try recompiling with a new version of Go?

from scat.

goblin avatar goblin commented on June 14, 2024

Thanks for the quick reply :-)

This was freshly compiled using Go 1.10.3, which I believe is the latest.

However, there's one important mistake I made: I had this other memory increase tested with restic, not zbackup (I bailed on zbackup early cause it seemed too slow). So it's entirely possible it's due to restic's chunker. (and it was restic that died after 6 TiB, not zbackup)

Later I'll try to run it with perf and pprof and see if I can figure it out where the leak is coming from. I'm a Go newbie though so it might be hard ;-)

from scat.

goblin avatar goblin commented on June 14, 2024

OK, so I ran some initial tests with pprof.

I started with a simple proc of split | { checksum | index - }, and the memory was increasing, although not as fast as in the original post. I fed it totally random data so there was no duplicate chunks.

I discovered it seems to be leaky by design: in procs/index.go:62, it's assigning the chunk-hash to an in-memory map. It later uses that map to see if the chunk was already processed. I originally imagined it wouldn't do that, as it can check that by seeing if the appropriate filename exists in the output directory.

So I don't see a simple way of fixing that, short of changing how it works and possibly making it slower in the process (although the filesystem checks can perhaps be cached by the OS).

I then ran it again with the original proc and some real data from tar, and there was way more places where large chunks of data were allocated. Some of them were shrinking in the process, but overall the memory consumption grew, of course. Most notable were scat/split (*splitter) Next, scat/stores/copies (*Reg) List and scat/stores/copies (*List) Add.

My plan is to rewrite it, so that it uses an on-disk database of chunks. I'll need this also for other features, such as being able to restore only particular files rather than an entire backup, or being able to keep track of tape/disk changes (i.e. backing up a huge filesystem to many smaller BluRays, tapes, or USB HDDs, only few of which are connected at a given time). This should also help with #23, as it'll be easier to rename the output chunks then (and group them into bigger ones, to also hide the individual chunks' sizes).

from scat.

Roman2K avatar Roman2K commented on June 14, 2024

Hi @goblin - glad you're still active on this project and thanks for having investigated the leak. I must admit though, I'm not using scat at the moment and most of the internals I have forgotten about, nor would I have the incentive to look at them in details. However from what I understand, I think your idea of rewriting procs/index to an on-disk database seems sensible. Index history would have to be stored within that database instead of git (since it wouldn't be a simple text file anymore), but other than that, why not. Good luck! I'd be curious to see if this this fixes the leak. Hopefully it will 🍀

May I add, I still do believe in the idea behind the project and still need such a tool. I've since fallen back to cleartext syncing to Google Drive 😫 to at least have some kind of backup despite the privacy issues and risks of loss. It's just that some open issues were preventing me from using scat as I initially envisioned it and I didn't have the guts to address them head on. I do have brewing in mind since the past few years to either give another go at it in the current code base, or rewrite the whole thing in Ruby. Yes, single-threaded, slow Matz Ruby - so enjoyable to code in that everything feels possible: easy to experiment, tinker with, tear apart and rewrite, or even... make performant, paradoxically. Should that last point prove infeasible, there's Crystal, hehe.

from scat.

goblin avatar goblin commented on June 14, 2024

I tried Ruby a few years ago, and I'm much more fond of learning Go at the moment ;-) Especially given that you've done so much work on it in Go.

from scat.

goblin avatar goblin commented on June 14, 2024

It's just that some open issues were preventing me from using scat as I initially envisioned it and I didn't have the guts to address them head on.

Which issues, specifically?

from scat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.