Recover properly from connection failure,about bolthole/zrep

Comments (16)

ppbrown commented on July 24, 2024

This sounds like a nasty OS bug.
either the command succeeds, or it fails.

If failure is signaled by the OS/ssh/zfs send, then shouldnt it trust that?
It seems a bit crazy to double check every time, "okay, did you really
fail?"

and how would it differentiate between "successful fail" or "true fail"?

Maybe this is more appropriate as a bug filed with your OS vendor? not sure.

On Thu, Aug 20, 2015 at 2:52 PM, gpothier [email protected] wrote:

Sometimes, syncing fails in a way that prevents further syncing: as far as
I understand, at some point a snapshot is correctly received by the
destination machine, but the reception is not acknowledged by the source
(probably because the connection dropped after sending the data but before
the diff is applied on the destination). In that case, during the next sync
zrep tries to upload the already applied snapshot, and thus fails (and if
the sync runs from a cron job, the system starts to accumulate lots of
snapshots).

In this case, in order to recover manually I have to roll back the
destination filesystem to a previous snapshot and sync again. It would be
nice if zrep could detect that condition and recover automatically.

—
Reply to this email directly or view it on GitHub
#5.

from zrep.

gpothier commented on July 24, 2024

I don't think it is an OS issue. If the connection drops just after zfs recv receives all the data but before it can respond, the command will succeed on the recv side and fail with a network error on the send side. There is no way I can think of to, as you say, differentiate between "successful fail" or "true fail", so we cannot avoid getting into this situation where a snapshot has been received by the destination, but still marked as not sent on the source.

What could be done, though, is to recover from this situation during the next sync attempt. I guess you would have to list the destination's snapshots and compare with the local ones. If the destination has more recent snapshots than the last one that is marked sent on the source, then roll back the destination to the last known sent snapshot. At least this is what I have been doing when manually recovering, but now that I think of it, it might be just a matter of marking it sent on the source?

from zrep.

ppbrown commented on July 24, 2024

So, two suggestions here for you:

I suggest you do more rigorous checking of your cron jobs for error
outputs ;-)
It might save you a little work, to know that you can force the slave
side to sync to a particular snapshot, from the master side.

So, once you know there has been a problem. you can then use the (new in
zrep 1.3.1! to be released in about 10 minutes :) )
zrep list -v fs/name/here
output, to easily see the exact name of the last snapshot synced
successfully. And then force sync.

so:

sx86test$ ./zrep list -v rpool/scratch1
rpool/scratch1:
zrep:savecount 5
zrep:dest-fs rpool/scratch2
zrep:src-host sx86test
zrep:master yes
zrep:src-fs rpool/scratch1
zrep:dest-host localhost
last snapshot synced: rpool/scratch1@zrep_000004

sx86test: ./zrep sync rpool/scratch1@zrep_000004
Validating remote snap
WARNING: We will be rolling back rpool/scratch2, on localhost
to zrep_000004, made at: Mon Aug 24 13:38 2015

All newer snapshots on remote side will be destroyed
You should have paused ongoing sync jobs for rpool/scratch2 before
continuing
Continuing in 20 seconds....
Continuing in 10 seconds....
localhost:rpool/scratch2 rolled back successfully to zrep_000004
Now cleaning up local snapshots

On Thu, Aug 20, 2015 at 2:52 PM, gpothier [email protected] wrote:

Sometimes, syncing fails in a way that prevents further syncing: as far as
I understand, at some point a snapshot is correctly received by the
destination machine, but the reception is not acknowledged by the source
(probably because the connection dropped after sending the data but before
the diff is applied on the destination). In that case, during the next sync
zrep tries to upload the already applied snapshot, and thus fails (and if
the sync runs from a cron job, the system starts to accumulate lots of
snapshots).

In this case, in order to recover manually I have to roll back the
destination filesystem to a previous snapshot and sync again. It would be
nice if zrep could detect that condition and recover automatically.

—
Reply to this email directly or view it on GitHub
#5.

from zrep.

ghormoon commented on July 24, 2024

hi, I did not try zrep (yet) to confirm it's the same case, but my experience with zfs send -I is that if you send multiple snaps at once and it fails in middle, some might actually get applied. i.e. you send two 50MB snaps, send fails at 70MB, first snapshot is applied correctly on the destination. That might cause the inconsistency.

from zrep.

ppbrown commented on July 24, 2024

Errr.. are you saying that you try sending two snapshots of the same
filesystem, at the same time??

On Fri, Jul 8, 2016 at 7:15 AM, ghormoon [email protected] wrote:

hi, I did not try zrep (yet) to confirm it's the same case, but my
experience with zfs send -I is that if you send multiple snaps at once and
it fails in middle, some might actually get applied. i.e. you send two 50MB
snaps, send fails at 70MB, first snapshot is applied correctly on the
destination. That might cause the inconsistency.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#5 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABpK-YP-qV4bK_77q48Cei-UIJ7-GFt9ks5qTltngaJpZM4FvfgD
.

from zrep.

ghormoon commented on July 24, 2024

If you have 3 snaps snap1 snap2 snap3 and do zfs send -i snap1 snap3, you
get only snap1 and snap3 on remote site. If you do zfs send -I snap1 snap3,
you get snap1, snap2, snap3 on the remote site. It already happened to me
that it died and there was already snapq snap2 on remote site. I'm on phone
now, I'll provide example at home

On Jul 8, 2016 4:32 PM, "ppbrown" [email protected] wrote:

Errr.. are you saying that you try sending two snapshots of the same
filesystem, at the same time??

On Fri, Jul 8, 2016 at 7:15 AM, ghormoon [email protected] wrote:

hi, I did not try zrep (yet) to confirm it's the same case, but my
experience with zfs send -I is that if you send multiple snaps at once
and
it fails in middle, some might actually get applied. i.e. you send two
50MB
snaps, send fails at 70MB, first snapshot is applied correctly on the
destination. That might cause the inconsistency.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#5 (comment), or
mute
the thread
<
https://github.com/notifications/unsubscribe/ABpK-YP-qV4bK_77q48Cei-UIJ7-GFt9ks5qTltngaJpZM4FvfgD

.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#5 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AIq8QJuEQps0BYKFyXaowrDPYOo-pErQks5qTl9ogaJpZM4FvfgD
.

from zrep.

ghormoon commented on July 24, 2024

how to reproduce this:
(just noting again, I had only fast peek into the code, I can't say if zrep does send each snapshot separately or not, but I assume there are cases where it sends multiple in one -I stream)

##create test dataset
root@A0-debian_hypervisor:~# zfs create A-ssd/test

##empty snapshot
root@A0-debian_hypervisor:~# zfs snapshot A-ssd/test@snap1

##fill with some data
root@A0-debian_hypervisor:~# dd if=/dev/urandom of=/mnt/A-ssd/test/random.file1 bs=1024 count=$((50*1024))
51200+0 records in
51200+0 records out
52428800 bytes (52 MB) copied, 4.47516 s, 11.7 MB/s

##snap it
root@A0-debian_hypervisor:~# zfs snapshot A-ssd/test@snap2

##fill with even more. the more data, the more time to interrupt it :)
root@A0-debian_hypervisor:~# dd if=/dev/urandom of=/mnt/A-ssd/test/random.file2 bs=1024 count=$((250*1024))
256000+0 records in
256000+0 records out
262144000 bytes (262 MB) copied, 20.3357 s, 12.9 MB/s

##snap it
root@A0-debian_hypervisor:~# zfs snapshot A-ssd/test@snap3

##initialize send
root@A0-debian_hypervisor:~# zfs send A-ssd/test@snap1 | zfs recv B-hdd/test

##send with -I notice it basically does the same as if you did -i twice with snap1->snap2 and snap2->snap3
root@A0-debian_hypervisor:~# zfs send -v -I A-ssd/test@snap1 A-ssd/test@snap3 | zfs recv B-hdd/test
send from @snap1 to A-ssd/test@snap2 estimated size is 50.1M
send from @snap2 to A-ssd/test@snap3 estimated size is 251M
total estimated size is 301M
TIME SENT SNAPSHOT
17:21:18 905K A-ssd/test@snap2
TIME SENT SNAPSHOT
17:21:20 906K A-ssd/test@snap3
17:21:21 906K A-ssd/test@snap3
17:21:22 906K A-ssd/test@snap3
17:21:23 906K A-ssd/test@snap3
^C
cannot receive incremental stream: invalid backup stream

##see the result
root@A0-debian_hypervisor:~# zfs list -t snap | grep test
A-ssd/test@snap1 64K - 96K -
A-ssd/test@snap2 64K - 50.2M -
A-ssd/test@snap3 0 - 300M -
B-hdd/test@snap1 64K - 96K -
B-hdd/test@snap2 0 - 50.2M -

from zrep.

ppbrown commented on July 24, 2024

you need to give full example, showing which filesystems are involved

On Fri, Jul 8, 2016 at 7:43 AM, ghormoon [email protected] wrote:

If you have 3 snaps snap1 snap2 snap3 and do zfs send -i snap1 snap3, you
get only snap1 and snap3 on remote site. If you do zfs send -I snap1 snap3,
you get snap1, snap2, snap3 on the remote site. It already happened to me
that it died and there was already snapq snap2 on remote site. I'm on phone
now, I'll provide example at home

On Jul 8, 2016 4:32 PM, "ppbrown" [email protected] wrote:

Errr.. are you saying that you try sending two snapshots of the same
filesystem, at the same time??

On Fri, Jul 8, 2016 at 7:15 AM, ghormoon [email protected]
wrote:

hi, I did not try zrep (yet) to confirm it's the same case, but my
experience with zfs send -I is that if you send multiple snaps at once
and
it fails in middle, some might actually get applied. i.e. you send two
50MB
snaps, send fails at 70MB, first snapshot is applied correctly on the
destination. That might cause the inconsistency.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#5 (comment), or
mute
the thread
<

https://github.com/notifications/unsubscribe/ABpK-YP-qV4bK_77q48Cei-UIJ7-GFt9ks5qTltngaJpZM4FvfgD

.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#5 (comment), or
mute
the thread
<
https://github.com/notifications/unsubscribe/AIq8QJuEQps0BYKFyXaowrDPYOo-pErQks5qTl9ogaJpZM4FvfgD

.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#5 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABpK-RxcIinp0LfvJErrBI80QOK8IZ25ks5qTmIRgaJpZM4FvfgD
.

from zrep.

ppbrown commented on July 24, 2024

ah, email overlap.

The lesson here is: "dont do this".

Dont do overlapping snapshots. As you note, this is a low level fs
behaviour, not a zrep thing.

zrep itself has locking that stops it from doing snapshots while sending
usually.
However, if you are perhaps using multiple zrep 'tags', to handle multiple
destinations, I could imagine some problems may arise.
To which I would say, "dont overlap zrep usage on the same filesystem at
the same time, even when using seperate tags".

Does that clear things up?

On Fri, Jul 8, 2016 at 8:28 AM, Philip Brown [email protected] wrote:

you need to give full example, showing which filesystems are involved

On Fri, Jul 8, 2016 at 7:43 AM, ghormoon [email protected] wrote:

If you have 3 snaps snap1 snap2 snap3 and do zfs send -i snap1 snap3, you
get only snap1 and snap3 on remote site. If you do zfs send -I snap1
snap3,
you get snap1, snap2, snap3 on the remote site. It already happened to me
that it died and there was already snapq snap2 on remote site. I'm on
phone
now, I'll provide example at home

On Jul 8, 2016 4:32 PM, "ppbrown" [email protected] wrote:

Errr.. are you saying that you try sending two snapshots of the same
filesystem, at the same time??

On Fri, Jul 8, 2016 at 7:15 AM, ghormoon [email protected]
wrote:

hi, I did not try zrep (yet) to confirm it's the same case, but my
experience with zfs send -I is that if you send multiple snaps at once
and
it fails in middle, some might actually get applied. i.e. you send two
50MB
snaps, send fails at 70MB, first snapshot is applied correctly on the
destination. That might cause the inconsistency.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#5 (comment),
or
mute
the thread
<

https://github.com/notifications/unsubscribe/ABpK-YP-qV4bK_77q48Cei-UIJ7-GFt9ks5qTltngaJpZM4FvfgD

.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#5 (comment), or
mute
the thread
<
https://github.com/notifications/unsubscribe/AIq8QJuEQps0BYKFyXaowrDPYOo-pErQks5qTl9ogaJpZM4FvfgD

.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#5 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABpK-RxcIinp0LfvJErrBI80QOK8IZ25ks5qTmIRgaJpZM4FvfgD
.

from zrep.

ghormoon commented on July 24, 2024

I don't have this issue with zrep, I was just considering it and looked at the issues and it sounded familiar. the point is, if you do a snap every 1 minute, but send after 5 minutes, you have 5 snapshots to send. can this happen in zrep?
if yes, this is the case I've shown. and if that send fails, it might apply few of the snapshots that managed to get through the network before failure.

from zrep.

ghormoon commented on July 24, 2024

maybe a real case example, if I set zrep to snap every 1 minute, but the remote server will be unavailable for an hour, what will happen, will I have only 1 snapshot or 60?
and if 60, will there be one zfs send -I fs@snap1 fs@snap60 or sixty like
zfs send fs@snap1 fs@snap2 .... zfs send fs@snap59 fs@snap60?

from zrep.

ghormoon commented on July 24, 2024

also what's the reason to stop doing snapshots when sending? I don't see a reason, you'll just send them next time ...

from zrep.

ppbrown commented on July 24, 2024

in that case, zrep will keep creating snapshots,and rename unsent ones,
with _unsent.

The reason for this, is for people who also use the zrep snapshots as a
local "oops" repository.

(it makes more sense in the context of syncing only once an hour)

If you notice the pileup before restoring connectivity, you always have the
option of

stop zrep sync job
zfs list -t snapshot|grep unsent
zfs destroy list from step 2.
continue zrep sync

However, your email suggests a misconception.
zrep does not do multiple sends, if there are multiple unsent snapshots. It
simply uses the -I option that "sends all intermedi-
ary snapshots from the first snapshot to the [last one]"

Among other reasons, that is in order to transfer any NON_zrep snapshots
that admins may have chosen to make.

On Fri, Jul 8, 2016 at 8:36 AM, ghormoon [email protected] wrote:

maybe a real case example, if I set zrep to snap every 1 minute, but the
remote server will be unavailable for an hour, what will happen, will I
have only 1 snapshot or 60?
and if 60, will there be one zfs send -I fs@snap1 fs@snap60 or sixty like
zfs send fs@snap1 fs@snap2 .... zfs send fs@snap59 fs@snap60?

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#5 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABpK-e7itYsOlZzOzSsYCMTGfrK70sEGks5qTm6FgaJpZM4FvfgD
.

from zrep.

ghormoon commented on July 24, 2024

that is what I expected an wnated to confirm. so in my case of snap every 1min ans send after an hour, it does one send with 60 snapshots included with -I.
therefore if this send fails in half, it applies 30 snapshots on the remote (and these are valid and correct) and fails to handle this case if tries to send again from snap 1 insted of snap 30.

from zrep.

ghormoon commented on July 24, 2024

in the example I've given above, if you kill the send after one of the snap,shots, sending again from snap1 to snap3 will complain there's a snap2. but sending snap2 to snap3 is a valid solution

from zrep.

ppbrown commented on July 24, 2024

I think you are just talking throughh a bunch of theories you came up with.
however, that is not what the github issue tracker is for.
(plus, they arent valid theories :-/ )

Please do not reply to this CLOSED issue any more (issue #5).

If you find an actual problem with zrep, rather than a theoretical issue,
please open up a NEW github issue,
(with specific zrep output!)

or send me a NEW email, with a fresh subject line.

Thanks.

On Fri, Jul 8, 2016 at 9:32 AM, ghormoon [email protected] wrote:

that is what I expected an wnated to confirm. so in my case of snap every
1min ans send after an hour, it does one send with 60 snapshots included
with -I.
therefore if this send fails in half, it applies 30 snapshots on the
remote (and these are valid and correct) and fails to handle this case if
tries to send again from snap 1 insted of snap 30.

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#5 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABpK-ejmRdfE70sYbhn-Q93I-NwUtvKYks5qTnuRgaJpZM4FvfgD
.

from zrep.

Recover properly from connection failure about zrep HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent