Giter VIP home page Giter VIP logo

Comments (7)

ahrtr avatar ahrtr commented on August 27, 2024 1

we tried but restoration failed with fatal error:

INFO[0002] successfully fetched data of base snapshot in 1.5047977750000001 seconds [CompressionPolicy:gzip]  actor=restorer
unexpected fault address 0x776bdb053000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x776bdb053000 pc=0xf19fd6]

goroutine 1 [running]:
runtime.throw({0x207bb9d?, 0x0?})

It means the snapshot operation actually failed. So the received snapshot isn't a completed snapshot.

from etcd.

ahrtr avatar ahrtr commented on August 27, 2024

There are two possible reasons,

  • The "snapshot" wasn't actually a snapshot, it might be just copied from the db file directly. Please run etcdutl snapshot restore path-2-snapshot --skip-hash-check=true to double check;
  • Some errors happened when generating the snapshot. The command mentioned above should fail. Did you see any error on either the client side or the server side when generating the snapshot?

from etcd.

ishan16696 avatar ishan16696 commented on August 27, 2024

The "snapshot" wasn't actually a snapshot, it might be just copied from the db file directly.

it was the snapshot as we call snapshot api to take the snapshot (named it as full-snapshot)

Please run etcdutl snapshot restore path-2-snapshot --skip-hash-check=true to double check;

we tried but restoration failed with fatal error:

INFO[0002] successfully fetched data of base snapshot in 1.5047977750000001 seconds [CompressionPolicy:gzip]  actor=restorer
unexpected fault address 0x776bdb053000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x776bdb053000 pc=0xf19fd6]

goroutine 1 [running]:
runtime.throw({0x207bb9d?, 0x0?})

Did you see any error on either the client side or the server side when generating the snapshot?

unfortunately I don't have logs, I saw this occurrence twice. First, in one of our test cluster which don't have observability stack, hence I'm unable to get logs and another occurrence is reported by one of our community user: gardener/etcd-backup-restore#749

from etcd.

ishan16696 avatar ishan16696 commented on August 27, 2024

So the received snapshot isn't a completed snapshot.

yes, it seems ... Is there a way to verify the integrity of snapshot either on etcd side before sending the snapshot or on etcd client side ?

For verifying the integrity of snapshot on etcd client side, I thought of this. It's a similar way how's restoration is verifying the snapshot before restoration.

Calculate the hash of snapshot (removing the appended hash), compare it with hash value which is appended by etcd on snapshot. If it matches then our snapshot integrity is intact else re-try to take the snapshot again till hash matches.

What do you think ?

from etcd.

ahrtr avatar ahrtr commented on August 27, 2024

You need to use the client side error to detect such failure.

resp, err := ss.Recv()
if err != nil {
switch err {
case io.EOF:
m.lg.Info("completed snapshot read; closing")
default:
m.lg.Warn("failed to receive from snapshot stream; closing", zap.Error(err))
}
pw.CloseWithError(err)
return
}

Also I just had a quick read on the server side implementation, it seems that there is a minor issue on the pb.SnapshotResponse.RemainingBytes. The value of total doesn't include the sha256 checksum, so when the RemainingBytes == 0, the server side may not have sent out the checksum yet. But your issue isn't caused by this minor issue.

total := snap.Size()
size := humanize.Bytes(uint64(total))
start := time.Now()
ms.lg.Info("sending database snapshot to client",
zap.Int64("total-bytes", total),
zap.String("size", size),
zap.String("storage-version", storageVersion),
)
for total-sent > 0 {
// buffer just holds read bytes from stream
// response size is multiple of OS page size, fetched in boltdb
// e.g. 4*1024
// NOTE: srv.Send does not wait until the message is received by the client.
// Therefore the buffer can not be safely reused between Send operations
buf := make([]byte, snapshotSendBufferSize)
n, err := io.ReadFull(pr, buf)
if err != nil && err != io.EOF && err != io.ErrUnexpectedEOF {
return togRPCError(err)
}
sent += int64(n)
// if total is x * snapshotSendBufferSize. it is possible that
// resp.RemainingBytes == 0
// resp.Blob == zero byte but not nil
// does this make server response sent to client nil in proto
// and client stops receiving from snapshot stream before
// server sends snapshot SHA?
// No, the client will still receive non-nil response
// until server closes the stream with EOF
resp := &pb.SnapshotResponse{
RemainingBytes: uint64(total - sent),
Blob: buf[:n],
Version: storageVersion,
}
if err = srv.Send(resp); err != nil {
return togRPCError(err)
}
h.Write(buf[:n])
}
// send SHA digest for integrity checks
// during snapshot restore operation
sha := h.Sum(nil)
ms.lg.Info("sending database sha256 checksum to client",
zap.Int64("total-bytes", total),
zap.Int("checksum-size", len(sha)),
)
hresp := &pb.SnapshotResponse{RemainingBytes: 0, Blob: sha, Version: storageVersion}

from etcd.

ishan16696 avatar ishan16696 commented on August 27, 2024

You need to use the client side error to detect such failure.

we do handle the error at client side while taking the etcd snapshot but I guess client side didn't throw any error.

from etcd.

ishan16696 avatar ishan16696 commented on August 27, 2024

The value of total doesn't include the sha256 checksum, so when the RemainingBytes == 0, the server side may not have sent out the checksum yet. But your issue isn't caused by this minor issue.

why this issue is not caused by this ? TBH, to me it feels it caused by this as it sends the snapshot but failed to send the sha256 checksum and due to this there was no client side error detected as it feels snapshot taken was successful but it fails during restoration as it fails hash check/validation.

from etcd.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.