zrepl / zrepl Goto Github PK

View Code? Open in Web Editor NEW

897.0 897.0 86.0 2.66 MB

One-stop ZFS backup & replication solution

Home Page: https://zrepl.github.io

License: MIT License

Go 96.46% Shell 0.91% Makefile 1.44% Dockerfile 0.13% Jupyter Notebook 0.73% Jinja 0.08% Python 0.25%

backup bookmark go golang incremental replication snapshot zfs

zrepl's Introduction

zrepl

zrepl is a one-stop ZFS backup & replication solution.

User Documentation

User Documentation can be found at zrepl.github.io.

Bug Reports

If the issue is reproducible, enable debug logging, reproduce and capture the log.
Open an issue on GitHub, with logs pasted as GitHub gists / inline.

Feature Requests

Does your feature request require default values / some kind of configuration? If so, think of an expressive configuration example.
Think of at least one use case that generalizes from your concrete application.
Open an issue on GitHub with example conf & use case attached.
Optional: Post a bounty on the issue, or contact Christian Schwarz for contract work.

The above does not apply if you already implemented everything. Check out the Coding Workflow section below for details.

Building, Releasing, Downstream-Packaging

This section provides an overview of the zrepl build & release process. Check out docs/installation/compile-from-source.rst for build-from-source instructions.

Overview

zrepl is written in Go and uses Go modules to manage dependencies. The documentation is written in ReStructured Text using the Sphinx framework.

Install build dependencies using ./lazy.sh devsetup. lazy.sh uses python3-pip to fetch the build dependencies for the docs - you might want to use a venv. If you just want to install the Go dependencies, run ./lazy.sh godep.

The test suite is split into pure Go tests (make test-go) and platform tests that interact with ZFS and thus generally require root privileges (sudo make test-platform). Platform tests run on their own pool with the name zreplplatformtest, which is created using the file vdev in /tmp.

For a full code coverage profile, run make test-go COVER=1 && sudo make test-platform && make cover-merge. An HTML report can be generated using make cover-html.

Code generation is triggered by make generate. Generated code is committed to the source tree.

Build & Release Process

The Makefile is catering to the needs of developers & CI, not distro packagers. It provides phony targets for

local development (building, running tests, etc)
building a release in Docker (used by the CI & release management)
building .deb and .rpm packages out of the release artifacts.

Build tooling & dependencies are documented as code in lazy.sh. Go dependencies are then fetched by the go command and pip dependencies are pinned through a requirements.txt.

We use CircleCI for continuous integration. There are two workflows:

ci runs for every commit / branch / tag pushed to GitHub. It is supposed to run very fast (<5min and provides quick feedback to developers). It runs formatting checks, lints and tests on the most important OSes / architectures. Artifacts are published to minio.cschwarz.com (see GitHub Commit Status).
release runs
- on manual triggers through the CircleCI API (in order to produce a release)
- periodically on master Artifacts are published to minio.cschwarz.com (see GitHub Commit Status).

Releases are issued via Git tags + GitHub Releases feature. The procedure to issue a release is as follows:

Issue the source release:
- Git tag the release on the master branch.
- Push the tag.
- Run ./docs/publish.sh to re-build & push zrepl.github.io.
Issue the official binary release:
- Run the release pipeline (triggered via CircleCI API)
- Download the artifacts to the release manager's machine.
- Create a GitHub release, edit the changelog, upload all the release artifacts, including .rpm and .deb files.
- Issue the GitHub release.
- Add the .rpm and .deb files to the official zrepl repos, publish those.

Official binary releases are not re-built when Go receives an update. If the Go update is critical to zrepl (e.g. a Go security update that affects zrepl), we'd issue a new source release. The rationale for this is that whereas distros provide a mechanism for this ($zrepl_source_release-$distro_package_revision), GitHub Releases doesn't which means we'd need to update the existing GitHub release's assets, which nobody would notice (no RSS feed updates, etc.). Downstream packagers can read the changelog to determine whether they want to push that minor release into their distro or simply skip it.

Additional Notes to Distro Package Maintainers

Run the platform tests (Docs -> Usage -> Platform Tests) on a test system to validate that zrepl's abstractions on top of ZFS work with the system ZFS.
Ship a default config that adheres to your distro's hier and logging system.
Ship a service manager file and please try to upstream it to this repository.
- dist/systemd contains a Systemd unit template.
Ship other material provided in ./dist, e.g. in /usr/share/zrepl/.
Have a look at the Makefile's ZREPL_VERSION variable and how it passed to Go's ldFlags. This is how zrepl version knows what version number to show. Your build system should set the ldFlags flags appropriately and add a prefix or suffix that indicates that the given zrepl binary is a distro build, not an official one.
Make sure you are informed about new zrepl versions, e.g. by subscribing to GitHub's release RSS feed.

Contributing Code

Open an issue when starting to hack on a new feature
Commits should reference the issue they are related to
Docs improvements not documenting new features do not require an issue.

Breaking Changes

Backward-incompatible changes must be documented in the git commit message and are listed in docs/changelog.rst.

Glossary & Naming Inconsistencies

In ZFS, dataset refers to the objects filesystem, ZVOL and snapshot.
However, we need a word for filesystem & ZVOL but not a snapshot, bookmark, etc.

Toward the user, the following terminology is used:

filesystem: a ZFS filesystem or a ZVOL
filesystem version: a ZFS snapshot or a bookmark

Sadly, the zrepl implementation is inconsistent in its use of these words: variables and types are often named dataset when they in fact refer to a filesystem.

There will not be a big refactoring (an attempt was made, but it's destroying too much history without much gain).

However, new contributions & patches should fix naming without further notice in the commit message.

zrepl's People

Contributors

Stargazers

Watchers

zrepl's Issues

Ensure in-memory log in cmd.Daemon is bounded

Practically, it probably won't be a problem. But still, it would be nice to assert that the in-memory buffered log entries per task do not exceed a threshold.

Search for this issue in the codebase.

ref #10

Ideas

Find sensible max size for in-memory logs
Serialize logs to JSON []byte, keep level in separate value
Keep sum of len([]byte) arrays until max size is reached
Discard log messages lowest debug level first, then oldest

Docs - detail any dataset properties which are overridden

The zrepl documentation should have a section which details any dataset properties that are overridden.

There appears to be at least one, as I noticed the replicated datasets are not mounted.

This will be important for disaster recovery if it is ever needed - what do I need to put back to normal?

assert we use time.Time.Equal, not ==

prune: policy: grid: exception for keep=all

This makes sense but is currently not allowed:

  prune:
    policy: grid
    # allow one day of replication lag, then archive
    grid: 1x1d (keep=all) | 24x1h | 40x1d | 6x30d

provide default stdout logging configuration

proper SSH exit detection in OutgoingSSHByteStream

Hacky workaround in ca1a482

outlet errors: use formatters + log time

source job uses `datasets` instead of `filesystems`

syslog outlet should support minlevel

build: test if go generate would do anything

had to remove auto go-generate in 47726ad

Always do version checks on local control socket

For the local control socket, we should require the CLI version and daemon version to be the same.
As long as we use JSON over HTTP, a header like X-ZREPL-VERSION would suffice.

Release QA: run on test setup for 2 days

rpc endpoints: use string constants for endpoint names

include version information in build artifacts

Version information a la git describe --dirty

version subcommand
control version subcommand?
in documentation -> figure out how to do multi-version docs?
in log message on start INFO

formatting of non-string values in `human` log format

Example output (mind the total_rx field at EOL)

[INFO][hn1][pull][storage/backups/zrepl/pull/hn1/zroot/ROOT/default][storage/backups/zrepl/pull/hn1/zroot/ROOT/default => zroot/ROOT/default]: progress on receive operation total_rx="%!s(uint64=22101986)"

Docs: describe PermitRootLogin in sshd_config(5)

The current recommendation is for zrepl to run as root (until the finer details of what needs to be set to allow it to run as an unprivileged user are discovered and clearly documented).

For remote replication, this is hindered by the fact that sshd does not allow root login by default. To work around this without opening a large security hole, the following should be added to /etc/ssh/sshd_config:
PermitRootLogin forced-commands-only

For more info, users should be directed to the sshd_config(5) man page:
https://man.freebsd.org/sshd_config

Suggest this should be added to both the installation doc page and the tutorial.

Tutorial - error in authorized_keys examples

The tutorial example for the authorized_keys file appears to be wrong. Rather than "zrepl stdinserver backups.example.com" I think it should be "zrepl stdinserver prod1.example.com"

Reference DevSummit EuroBSDCon talk slides & reuse visualizations

Cannot compile from source

From GNU/Linux Ubuntu 17.04, following instructions:

./lazy.sh devsetup

returns

Collecting pygobject==3.22.0 (from -r /home/stephane/code/golang/gopath/src/github.com/zrepl/zrepl/docs/requirements.txt (line 17))
  Could not find a version that satisfies the requirement pygobject==3.22.0 (from -r /home/stephane/code/golang/gopath/src/github.com/zrepl/zrepl/docs/requirements.txt (line 17)) (from versions: )
No matching distribution found for pygobject==3.22.0 (from -r /home/stephane/code/golang/gopath/src/github.com/zrepl/zrepl/docs/requirements.txt (line 17))

and pip3 search pygobject says:

pygi-treeview-dnd (0.1.0)  - Workaround that allows PyGObject programs to use the high level TreeView DnD API

Cannot specify intervals with unit `d` for day

Should be fixed for both retention grid & interval specs -> time.ParseDuration is the problem

refs #13

Second replication begins if first replication is not finished

During first replication of many Gigabytes of data, I initially had the interval of the pull job set as 10m, and the first replication would not be finished by the time the second one was called to start. I checked the status many hours later and could see numerous ssh sessions running which led me to believe multiple replication jobs were now running at once (which I dont think should ever happen). I expected that if another replication job was called to start before the previous had finished, the new job would just be cancelled entirely.

I did not look into the state of my replicated data, or if the replications were proceeding ok. It was purely the fact that multiple zrepl ssh sessions were running that led me to believe this was the behaviour.

create bookmarks when snapshotting

If source and pulll diverge (replication lag) and source still has a bookmark of the latest state at puller, we don't have a conflict, just a gap in replication. It is better to resume replication then instead of just throwing Diverged errors.

Diffing & Replication logic support (was always supported)
Create Bookmarks
update docs warning about replication lag (still a thing, but mitigated by this feature)
Find sane default for pruning bookmarks (basically they cost nothing, just keep them around?)

disable / enable job interface on local control socket

Property Replication

Right now, all zfs send invocations on the sending side are without -p, meaning we do not replicate any properties.

On the receiving side, as a safeguard, we override mountpoint in order to protect ourselves from a malicious sender that is trying to mount over some filesystem on the backup server.

This is more of a theoretical advantage and I'm investigating better solutions, e.g. zfs receive -x all, see https://github.com/zrepl/zrepl/wiki/ZFS-Feature-Support-&-Wishlist#zfs-receive--x-all

Agenda

Find a way how property replication should work
- Where to store properties on receiver?
- Protect against malicious sender (zfs receive -x all)
- Think about restore procedures
Implement it
Document it (refs #23 )

Generic job control interface: stop|start autosnap|prune|replication

improve install from source procedure

oneoff script
build in docker
document it

include documentation in release process

Tutorial - initial confusion if config examples are for prod1 or backups

When reading the tutorial, I was initially confused about which PC should have the pull_prod1 job defined. It took a white for me to realize the tutorial section title "Configure backups" was referring to the "backups" server (even though it has a box around it, it wasn't immediately obvious). Perhaps this PC could be re-titled "backup_server" to make it more obvious?

It probably wasn't helped by the typo in "Analysis" section, which says the pull job is defined on prod1 (I believe this should have been "backups").

logging: make tcp outlet fully asynchronous

currently, a slow TCP connection will block the log call for retry_interval

additionally, the dialing / name resolution timeout is not set to retry_interval -> if no name resolution then log blocks for ~30s ?

dont' use NoFormatter

cmd/test.go:69

Maybe an 'interactive' default logger would be nice?

DOCS - describe "interval" and "grid" parameters

I was initially confused between the interaction between the "interval" parameters and the "grid" pruning parameters. I think I won't be the only one asking these questions, so it should be added to your documentation site. Some questions I had:

What happens if I define an source interval of 10m, a source grid with 4x15m, and a pull grid with 3x20m?
What is the difference between the source interval and pull interval? What happens if the source interval is different from the pull interval? I assume it would be normal to have the source interval quite frequent (e.g. 10m), and much less frequent on the pull job (e.g. 24h)?
For the grid parameters, what is the difference between 1x24d vs 24x1d
What does the (keep=all) setting do in the grid parameter?
For the interval parameter, I don't seem to be able to set "1d"?

DOCS - referencing example files in code should be a hyperlink to GitHub file

Where you have incomplete documentation, and the documentation references the examples in the code, it would be handy if "cmd/sampleconf/random/logging.yml" was a hyperlink to the file on GitHub.

document breaking config changes announcement policy

synchronize writing outlet errors

otherwise the output log gets mixed up

could extend logger to aquire its mutex

protect against stream with properties `zfs receive -x`

zrepl status subcommand

It can be annoying trying to read verbose log files. It would be nice to have a command "zrepl status" which outputs to the command line the status of any jobs currently running (including progress details?), and also "zrepl list" which outputs to the command line the full suite of snapshots (and size details) available on the PC it is being run (whether they are source snapshots from this machine or pull snapshots from other machines)

use buffer pool for rpc framebridgingreader / framebridgingwriter

Safeguard against misconfigured system time

Check if zrepl was down for more than X times the snapshot interval length
-> disable pruning in such a case (switch it to dry run?)
-> could detect such a case by searching for gaps in the snapshot list -> problem: gotta understand pruning policy (fading inudces gaps)
-> should work independently of pruning strategy
Optionally do time-check in zrepl and compare to system time and see how far off we are?

Profiler Labels in replication logic

https://rakyll.org/profiler-labels/

pull jobs don’t include zpool name in target filesystem name if filesystem path is simply “<“ in mapping

Mapping:

jobs:
- name: pull_app-srv
  ...
  type: pull
  mapping: {
    "<":    "storage/backups/app-srv",
  }

Current result:
zroot/var => storage/backups/app-srv/var

Expected result:
zroot/var => storage/backups/app-srv/zroot/var

build: check if dep ensure reports ok

docs: evaluate sphinx as alternative

readthedocs theme is better than customized docdock theme
evaluate cross referencing, but it can't be worse than hugo's
expressive, unified admonitions (docdock has pletora of notice, panels, alert, etc)
could reference go code easily using third party go domain

config parser: required fields + complain about unknown fields in yaml

test connect subcommand

for a job, should test if it can connect to the other side, maybe show the remote version

API
subcommand

evaluate `zfs list -o createtxg,guid` availability and stability

FreeBSD

11.X
10.3
9.X ? (not supported anymore)

ZoL

0.6.5.9_4.10.9_1-1 (Arch Linux, ZFS released Feb 3 2017)
0.6.4.2_3.16.39-1+deb8u2 (Debian Jessie, ZFS released June 26 2015)
0.6.3_3.16.39-1+deb8u2 (Debian Jessie, ZFS released June 12 2014)
ZFS: Loaded module v0.6.3-1.3, ZFS pool version 5000, ZFS filesystem version 5
0.6.x ?
0.5 ?

OS X

1.6.1, Sierra (2017-02-10)

illumos based distros

DOCS - does second replication begin if first replication is not finished?

I believe the documentation should explain the zrepl behaviour for what happens if the first replication is not finished at the time that the second replication is called to begin. This needs to cover the pull, push and local scenarios.

This is likely to occur during the first replication of many Gigabytes of data, if the interval of the pull job set as 10m. The first replication would not be finished by the time the second one was called to start. Given that this will be occurring to new users, it is important they are clear on the behaviour they can expect during this first time use.

config parser: evaluate switch to hcl which returns AST + line numbers

https://github.com/hashicorp/hcl

https://godoc.org/github.com/hashicorp/hcl#Parse

https://godoc.org/github.com/hashicorp/hcl/hcl/ast#File

ZFS channel program support

Feature detection
Use it in autosnapper (queue up snaps + bookmarks, then do all at once)
Use it in pruner (queue up destroys, fallback to individual destroy if atomically destroying all of them fails)
Use as replacement for complicated ZFS lists?

Error output when stopping zrepl 0.0.1

After updating to 0.0.1 release, I am getting some errors upon stopping zrepl:

[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x72a326]

goroutine 37 [running]:
github.com/zrepl/zrepl/logger.(*Logger).WithError(0xc4201aa080, 0x0, 0x0, 0x0)
        /wrkdirs/usr/ports/sysutils/zrepl/work/src/github.com/zrepl/zrepl/logger/logger.go:105 +0x26
github.com/zrepl/zrepl/cmd.(*ControlJob).JobStart(0xc4200f7060, 0xad5ec0, 0xc4201381b0)
        /wrkdirs/usr/ports/sysutils/zrepl/work/src/github.com/zrepl/zrepl/cmd/config_job_control.go:62 +0x3fc
github.com/zrepl/zrepl/cmd.(*Daemon).Loop.func1(0xad5ec0, 0xc4201381b0, 0xc4201b4000, 0xad2b80, 0xc4200f7060)
        /wrkdirs/usr/ports/sysutils/zrepl/work/src/github.com/zrepl/zrepl/cmd/daemon.go:82 +0x45
created by github.com/zrepl/zrepl/cmd.(*Daemon).Loop
        /wrkdirs/usr/ports/sysutils/zrepl/work/src/github.com/zrepl/zrepl/cmd/daemon.go:81 +0x367

zrepl list subcommand

see #10 for initial proposal & discussion