roman2k / scat Goto Github PK

View Code? Open in Web Editor NEW

93.0 93.0 9.0 1.6 MB

Decentralized, trustless backup tool

Home Page: https://github.com/Roman2K/scat/issues/1

License: MIT License

Go 98.50% Shell 0.43% Ruby 1.07%

backup cloud decentralized pipeline raid reed-solomon

scat's People

Contributors

Stargazers

Watchers

Forkers

pbtrung harnish sshroot lavalamp pagosasia wprobot bellyfat showsmall

scat's Issues

Compute a restore proc from a backup proc string

Suggestion by @lavalamp originally posted in the defunct GitLab repo (old issue)

I realize there's no way to know the inverse of a user-specified command, but it would be pretty handy to at least know where I ought to plug it in :)

Threshold secret sharing for data (or metadata)

Divide the data into n chunks, any k of which are able to recover the information (but with fewer than k shares, no information on the original data can be discerned). An example is Shamir's secret sharing.

This could be combined with other processors. The data payload could be split between a few cloud providers. By metadata, I mean for example an index or an ephemeral symmetric key. But the more common case could be to apply threshold secret sharing to the stream.

Add missing unit tests

Parts that most necessitate unit tests should already be covered. Add tests for those uncovered areas. Notably:

stores.Mem
argparse.ArgFilter
argparse.ArgOr
procs.Chain: test err in ProcessFinal()
procs.Chain: test err in ProcessEnd()
procs/backlog
procs/concur_test: avoid durations to avoid race conditions

Finer grained quota filling for exclusive striping

Currently, if chunks are grouped before striping, the total size is used to determine if there's space available, instead of the size of each chunk individually.

Is there a sensible way of only processing modified files?

Since this is a stream-based backup solution, is there any sensible way of only processing modified files? I.E. Files that have either changed in size or timestamp?

The issue as of current is that because everything is based on deduplication, and a computationally expensive deduplication at that (As before it's checksummed to see if the block is a duplicate, it has to be read off disk, compressed, split, paritied, etc...), it can take quite a lot of time to even locally process a large backup array to see that nothing has changed (A 20GB on HDDs can take ~3 and a half minutes, scaling that to TB scale backups can take hours, which is getting to an impractical length for continuous, hell, even daily backups).

One could obviously do some trickery with scripting to store the timestamp of the last backup date, then only backup files with a modification date newer than that, but then the issue is restoration would be an absolute pain the backside (You'd have to basically restore every backup in chronological order to get a full restoration).

I'm simply wondering:

What was the decision to make this software stream-based?
What is the best way to avoid having to read, process, compress, checksum, split, and parity every chunk just to discard it because it's not new?

Thanks.

Update restic backends

Hi!

I noticed that restic is listed in the README, and that's interesting, but the list of backends for it is not accurate. Restic currently supports the following backends:

Local directory
sftp server (via SSH)
HTTP REST server (protocol rest-server)
AWS S3 (either from Amazon or using the Minio server)
OpenStack Swift
BackBlaze B2
Microsoft Azure Blob Storage
Google Cloud Storage

Perhaps you could update the list? :)

Reference: https://github.com/restic/restic#backends

How to track block size pre-compression?

Question by @lavalamp originally posted in the defunct GitLab repo (old issue):

rclone: avoid temp files

The rclone store is the only part we need temp files for. The rest works with stdin+stdout.

Waiting for rclone rcat:

Then get rid of all temp files-related code (unless adding a pathcmd proc: command proc that takes path(s) as input).

New proc for rebuilding missing data/parity shards from old snapshots on new stores

Currently, uparity recovers errors at restore time (failed integrity check, missing data) but in read-only mode: restored data is intact, but stores still contain bad or missing data.

On a consecutive backup, parity will create missing data and/or parity shards on new stores, for the newest data being piped in. But bad/lost shards from previous backups aren't recovered (rebuilt + stored) on new stores.

Add a new proc parscrub that rebuilds missing shards like uparity also writes them to specified stores, reusing data from previous stores, and writing to new stores so as to meet the min and excl requirements.

From question/request/suggestion by @Technifocal on reddit ([comment][redditcmt]):
[redditcmt]:https://www.reddit.com/r/golang/comments/5uthji/scat_decentralized_trustless_backup_tool/ddy25k8/

@Technifocal:

Third question, say I lose a store (provider/hard-drive/whatever), how do reshard/rebalance my data, either to the remaining stores, or by replacing the store? I understand new backups going forward will be correctly balanced, but what about my backlog of backups?

@Roman2K:

When you lose a store, you would do the equivalent of replacing a disk in a RAID array: in stripe() proc of the backup script, replace the line of the defective store with a new one, and re-run the backup. Chunks will be written to the new store in such a way as to satisfy the min and excl requirements reusing chunks from old stores.

@Technifocal:

I haven't done any testing yet (Sorry!), but I think you missed the point of my question, I'll try and explain below:

I have a directory, foobar, with two files in it, foo and bar. They are unique, no deduplication can occur.

I backup foobar with parity 2 1, 2 data shards and 1 parity shard, so that I can lose one store without issue

I now delete foo from the directory and add baz, now I have backed up foo and bar, but locally only have bar and baz

I lose one of my three stores, no problem, parity exists

I add a new store, which has no data

I backup foobar, which includes bar and baz, but not foo anymore. Any lost data of bar will be reproduced on the new store, and baz (First time being backed up) will be uploaded.

I lose another store (I'm terribly unlucky/clumsy)

At this point, unless I'm mistaken, I've now lost 2/3 stores for my old original backup (Of foo and bar), and 1/3 stores for my new backup (bar and baz), this surely means that now I can recover bar and baz, but foo is completely lost. Am I mistaken?

@Roman2K:

That's exactly right. You have lost 2/3 stores of the original backup and foo is now unrecoverable.

@Technifocal:

My question was is there anyway, after losing the first store, to retroactively go back and repair and reupload old backups without requiring the files locally? I understand this would use a lot of IO (Either net I/O in the case of a cloud provider (Downloading, repairing, uploading), or disk I/O in the case of local disks (Reading, processing, writing)) but I feel like it'd increase the longevity of backups, unless I am missing something.

@Roman2K:

There currently isn't a way to do the recovery retroactively without re-running the backup with the original data lost. But all the components are there, they just need to be assembled into a new proc, I propose "parscrub". I totally agree this is needed to increase longevity of old snapshots. You would need to run a new scrub script for the index file of each snapshot of which you want to ensure the longevity, on a regular basis.

@Technifocal:

I had an idea while I was sitting on the tube, would running a restore piped directly back into a backup solve this issue?

Something like:
for i in *.index; do
   scat -stats "unindex unparity unmake unstuff uncompress unencrypt unmagic unupload undone" < ${i} | scat -stats "index parity make stuff compress encrypt magic upload done";
done
That would go back and download all the old backups and reupload them with the new parity shared, and shouldn't(?) upload anything that wasn't corrupted because it was already all uploaded, is that correct?

If so, very elegant, just went a bit over my head.

@Roman2K:

That would work too. Though a bit convoluted: it's a shame to have to join and split right back after 🙈 But yes, the end result would be the same as a parscrub in a single scat run, unless I miss something.

Faster checksumming

Would encourage higher backup frequency, especially on Android. I'd be very excited to try something GPU-accelerated because there seems to be highly parallelizable computations.

Announcements

Subscribe to this issue for notifications about future developments. New version announcements will be made in the comments section.

(Closed on purpose so only I can post comments.)

Validating mode of operation for compression, dedup, encryption, ECC/parity, and storage

For most common proc strings, make it possible to run a simultaneous "counterpart" process that "proves" that the operation's reverse will succeed - inline.

Taking one operation in isolation to demonstrate (compression), the idea is to run the reverse process as the data is still being ingested to confirm that there won't be any issues with a future restore. Suppose a compression algorithm has a bug where input can't be decompressed properly (maybe it leads to a crash), and this isn't known until decompression time. This would catch that; also, in the worst case, one could keep around a copy of the scat binary and supporting libraries/programs to ensure that reconstruction is possible even amidst potential changes and updates of underlying libraries.

The compression validation would involve decompression and confirmation that the result equals the original input (after validation succeeds, the decompressed block can be discarded and the next block of the stream processed). A validation phase might "lag" behind the input, but arbitrarily sized input streams can be practically processed without requiring the entire stream to first be written in its entirety before validation starts. For example, to confirm that terabytes of input can be successfully decompressed, one should not need to first write the entire result before validating.

Similar validation can be put in place for encryption (the reverse being decryption, whether or not a symmetric or hybrid cryptosystem is in use). Likewise, ECC/parity can confirm that redundantly expanded data can be restored as the original when run in reverse. Deduplication can also be validated.

I wanted to especially highlight storage/rclone validation. When I write to a remote cloud provider, I'd like to have a separate thread reconfirm that the data validated is exactly the same as the blocks that have just been written. One method of validating is to "trust" a query to the cloud provider for the checksum of the block (if this information is available through their API); another approach involves fetching the payload in another thread and validating that it's returned as expected. Making it a separate thread and as part of an independent API query could avoid overly "trusting" a local cache. Without validation of this kind, there could be potential silent information loss.

At the top I suggested "make it possible", and by that, I mean even if it requires the user to custom tweak the chain construction to include this validation. In principle, the validating mode of operation could be automatically constructed from common pipeline usage patterns. If implemented, this would offer unique assurance that the data can be reconstructed.

Logging

Currently, there's next to no logging:

errors are the main way to report information back to the user when finally written to stderr in main
some messages may be printed to stderr when there's no caller to return an error to (ex: cascade in multireader, error correction in parity, write errors in ansirefresh)

Add a proper logger for:

those errors
tracing of normal activity
- begin/end of chunk processing
  - size before, after
  - elapsed time
- passing from one proc to the next
- within procs
  - reads, writes, downloads, uploads
  - error detection and correction
- etc.

godoc

Add source comments for godoc once the internal API has stabilized. Documenting public APIs for use as a library.

Streaming file listing

Lists of existing files are currently buffered as slices due to bad initial decision. Shouldn't have too much of an impact on memory usage below ~terabytes of data but still feels wrong.

Switch to channel-based streaming.

Increasing memory usage

Hi, thanks for this great backup solution! :-)

I'm wondering why is it consuming so much RAM during backup. I'm using this proc:

tar c somefiles | pv | scat -stats "split | 
  backlog 24 { 
    checksum | 
    index - | 
    gzip | 
    parity 3 1 | 
    checksum | 
    cmd gpg --batch -e -r ABCDEF01 -z 0 | 
    group 4 | 
    concur 4 stripe(1 3 zero=cp(/mnt/caddy/0 3) one=cp(/mnt/caddy/1 3) two=cp(/mnt/caddy/2 3) three=cp(/mnt/caddy/3 3)) 
}"

I'm currently at about 10 TiB of data from tar, and scat is now using 57GiB of RAM. This amount is always increasing, at about 5 TiB it was around 30 GiB. I've noticed zbackup behaved similarly (but it died with a stacktrace after about 6 TiB of input).

What's this RAM needed for?

The index produced on stdout is currently 2.5 GiB in size, so even if scat is storing all the checksums of all the chunks produced so far, it's still using over 40x as much memory as it should :-S But storing all the chunk checksums in RAM shouldn't be necessary, because the filesystem could be queried to see if they exist already...

Thanks :-)

stores/stripe: quota-full remotes not considered for already existing data

For a given chunk, potential remotes are filtered based on remaining quota, then again filtered via the stripe package. The result is that remotes which have their quota 100% filled, are never considered by stripe as potentially containing existing data.

Purge

Add a commad to free up space on stores by garbage-collecting chunks unreachable by given snapshot indexes. Equivalent of deleting a snapshot in restic and COW filesystems.

Dockerfile

Suggestion by @Technifocal originally posted in the defunct GitLab repo (old issue):

Implementing a dockerfile into this repo would allow this software to be run in a container, providing large amounts of benefits such as ease of deployment to multiple servers, easy manageable repos using volume mounting in docker, and various other benefits.

If at all possible (Although I am not sure if docker hub supports Gitlab), produce automatic builds on docker hub using webhooks from this repo.

This dockerfile should also include any dependencies scat might use, for example:

tar(? -- Should this be done in-container or passed in through another means, to support cases where the user doesn't want to to tar their files and instead use something else)

git(? -- Should this be done in-container or dealt with out of container, to support cases where the user does not wish to use git versioning)

ssh

rclone

gpg

pv(? Is this ever used by scat? Or is this just a method of piping in?)

gzip(? Is this ever used by scat? Or is this done programmatically?)

This dockerfile should also have proper volumes defined, such as:

Source data, if being done in-container

Destination local stores

rclone configuration data

gpg keys for signing and encrypting

Along with accepting defined environment variables, such as:

Proc string (Or possibly not? And just having the user define their own command using the Docker run 'COMMAND' argument?

I can look at attempting to make this dockerfile myself if preferred, I am not experienced in the field but the documentation doesn't look too hard, at which point it could possibly be merged into this repo.

index file with a space in the name?

It doesn't seem possible to pass a filename including a space to the index proc, or at least quoting it doesn't work. (I would never choose to name the file with a space but my script is using the name of the directory it's backing up, and naturally one of these has spaces in the name...)

Treat decryption failure as a read error?

Question by @lavalamp originally posted in the defunct GitLab repo (old issue):

I noticed while reading https://github.com/klauspost/reedsolomon:

The final (and important) part is to be able to reconstruct missing shards. For this to work, you need to know which parts of your data is missing. The encoder does not know which parts are invalid, so if data corruption is a likely scenario, you need to implement a hash check for each shard. If a byte has changed in your set, and you don't know which it is, there is no way to reconstruct the data set.

So I thought I'd give it a try and deliberately changed a bit in a test backup. Unfortunately, this resulted in the restore not working (removing the file entirely allowed the restore to succeed, as expected). It seems like the problem is that if the decrypt step fails, the entire restore is aborted. I guess ideally, the decryption failure ought to be treated the same as if the remote shard was missing. Maybe there's a way to fix my restore script?
    uindex | backlog 8 {
      backlog 4 multireader(
        a=cp(/path/to/a)
        b=cp(/path/to/b)
        c=cp(/path/to/c)
      ) |
      cmd gpg --args --to --decode |
      uchecksum |
      group 3 |
      uparity 2 1 |
      cmd unxz
    } |
    join -
(sorry I keep filing issues, I think the concept is pretty cool and I'm attempting to use scat to backup my own files...)

Checksum before encryption breaks deniability

A slight cryptographic problem: checksumming the chunk before encryption means that someone who has access to the chunk filenames can prove that you backed up some sensitive data. This does not provide deniability.

If the chunks could be checksummed after encryption (or renamed to something else), then the backup scheme would correctly hide whatever data we're encrypting.

In other words, naming the chunks with the checksum of the data before encryption allows Google/Amazon/Dropbox/wherever-you-store-your-encrypted-backup to ban you on the grounds of supposed copyright infringement. That's because if they know the copyrighted data, they can chunk it themselves and they will know by the name of your file and its size that you're potentially holding the same (copyrighted) data on their machines.

I'm thinking the index could be extended to also hold the checksum of the chunk after encryption, so that it'd provide a 1-1 mapping to the hidden filename. But this would require you to have all the previous index files when adding data to a backup :-(

roman2k / scat Goto Github PK

scat's People

Contributors

Stargazers

Watchers

Forkers

scat's Issues

Recommend Projects

Recommend Topics

Recommend Org