Comments (7)
By the way, here's the current output from -stats
(I'm not exactly sure how to interpret the rates and multipliers):
PROC INST RATE USE QUOTA FILL
zero x1 6.8 MiB/s 2.4 TiB ∞ ∞
one x3 5.9 MiB/s 2.4 TiB ∞ ∞
two x0 11 MiB/s 2.4 TiB ∞ ∞
three x0 8.1 MiB/s 2.4 TiB ∞ ∞
split x1 12 MiB/s
backlog x24 15 MiB/s
checksum x0 26 MiB/s
index x0 11 MiB/s
gzip x0 12 MiB/s
parity x0 16 MiB/s
cmd x4 14 MiB/s
group x0 0 B/s
concur x21 15 MiB/s
(goroutines) x296
from scat.
Thanks for the detailed report!
I too noticed the ever increasing memory usage which definitely looks like a memory leak.
Could you try recompiling with a new version of Go?
from scat.
Thanks for the quick reply :-)
This was freshly compiled using Go 1.10.3, which I believe is the latest.
However, there's one important mistake I made: I had this other memory increase tested with restic, not zbackup (I bailed on zbackup early cause it seemed too slow). So it's entirely possible it's due to restic's chunker. (and it was restic that died after 6 TiB, not zbackup)
Later I'll try to run it with perf and pprof and see if I can figure it out where the leak is coming from. I'm a Go newbie though so it might be hard ;-)
from scat.
OK, so I ran some initial tests with pprof.
I started with a simple proc of split | { checksum | index - }
, and the memory was increasing, although not as fast as in the original post. I fed it totally random data so there was no duplicate chunks.
I discovered it seems to be leaky by design: in procs/index.go:62, it's assigning the chunk-hash to an in-memory map. It later uses that map to see if the chunk was already processed. I originally imagined it wouldn't do that, as it can check that by seeing if the appropriate filename exists in the output directory.
So I don't see a simple way of fixing that, short of changing how it works and possibly making it slower in the process (although the filesystem checks can perhaps be cached by the OS).
I then ran it again with the original proc and some real data from tar
, and there was way more places where large chunks of data were allocated. Some of them were shrinking in the process, but overall the memory consumption grew, of course. Most notable were scat/split (*splitter) Next
, scat/stores/copies (*Reg) List
and scat/stores/copies (*List) Add
.
My plan is to rewrite it, so that it uses an on-disk database of chunks. I'll need this also for other features, such as being able to restore only particular files rather than an entire backup, or being able to keep track of tape/disk changes (i.e. backing up a huge filesystem to many smaller BluRays, tapes, or USB HDDs, only few of which are connected at a given time). This should also help with #23, as it'll be easier to rename the output chunks then (and group them into bigger ones, to also hide the individual chunks' sizes).
from scat.
Hi @goblin - glad you're still active on this project and thanks for having investigated the leak. I must admit though, I'm not using scat at the moment and most of the internals I have forgotten about, nor would I have the incentive to look at them in details. However from what I understand, I think your idea of rewriting procs/index to an on-disk database seems sensible. Index history would have to be stored within that database instead of git (since it wouldn't be a simple text file anymore), but other than that, why not. Good luck! I'd be curious to see if this this fixes the leak. Hopefully it will 🍀
May I add, I still do believe in the idea behind the project and still need such a tool. I've since fallen back to cleartext syncing to Google Drive 😫 to at least have some kind of backup despite the privacy issues and risks of loss. It's just that some open issues were preventing me from using scat as I initially envisioned it and I didn't have the guts to address them head on. I do have brewing in mind since the past few years to either give another go at it in the current code base, or rewrite the whole thing in Ruby. Yes, single-threaded, slow Matz Ruby - so enjoyable to code in that everything feels possible: easy to experiment, tinker with, tear apart and rewrite, or even... make performant, paradoxically. Should that last point prove infeasible, there's Crystal, hehe.
from scat.
I tried Ruby a few years ago, and I'm much more fond of learning Go at the moment ;-) Especially given that you've done so much work on it in Go.
from scat.
It's just that some open issues were preventing me from using scat as I initially envisioned it and I didn't have the guts to address them head on.
Which issues, specifically?
from scat.
Related Issues (20)
- Dockerfile
- New proc for rebuilding missing data/parity shards from old snapshots on new stores
- How to track block size pre-compression?
- Treat decryption failure as a read error? HOT 1
- Threshold secret sharing for data (or metadata) HOT 1
- Validating mode of operation for compression, dedup, encryption, ECC/parity, and storage HOT 1
- stores/stripe: quota-full remotes not considered for already existing data
- index file with a space in the name? HOT 1
- Is there a sensible way of only processing modified files? HOT 1
- Update restic backends HOT 6
- Faster checksumming
- Checksum before encryption breaks deniability
- Streaming file listing
- Finer grained quota filling for exclusive striping HOT 1
- Add missing unit tests
- godoc
- Logging
- Compute a restore proc from a backup proc string HOT 1
- rclone: avoid temp files HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scat.