Giter VIP home page Giter VIP logo

Comments (14)

petermattis avatar petermattis commented on August 20, 2024 1

Calling r.Recover unconditionally seems to be a bug. Elsewhere in this function it is only called if r.recovering == true.

diff --git a/internal/record/record.go b/internal/record/record.go
index 1a7d962..c6b4c0d 100644
--- a/internal/record/record.go
+++ b/internal/record/record.go
@@ -166,6 +166,9 @@ func (r *Reader) nextChunk(wantFirst bool) error {
                                        //
                                        // Set r.err to be an error so r.Recover actually recovers.
                                        r.err = errors.New("pebble/record: block appears to be zeroed")
+                                       if !r.recovering {
+                                               return r.err
+                                       }
                                        r.Recover()
                                        continue
                                }

This causes opening the database to fail with: pebble/record: block appears to be zeroed. I think RocksDB has logic to truncate the log in this case.

from pebble.

tbg avatar tbg commented on August 20, 2024

Repros easily with fresh directory:

tschottdorf@gceworker-tschottdorf:~/go/src/github.com/petermattis/pebble$ pebble sync -c 24 -d 0 bench
dir bench
concurrency 24
_elapsed____ops/sec__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
      1s     5452.7      4.5      5.0      5.2      7.9
      2s     5712.9      4.2      5.0      5.2      7.1
^C
_elapsed_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
    2.5s          14081         5605.4      4.3      4.5      5.0      5.5      7.9

tschottdorf@gceworker-tschottdorf:~/go/src/github.com/petermattis/pebble$ pebble sync -c 24 -d 0 bench | x
x: command not found
panic: pebble: inconsistent batch count

goroutine 1 [running]:
github.com/petermattis/pebble.(*memTable).apply(0xc0001120f0, 0xc0000f97a8, 0x124a3, 0xc0081a6fd0, 0x166)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/mem_table.go:147 +0x2d1
github.com/petermattis/pebble.(*DB).replayWAL(0xc0000e8800, 0xc0000f9a58, 0x659b20, 0x7ea320, 0xc00413e070, 0x10, 0x124a8, 0x0, 0x0)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/open.go:266 +0x3bd
github.com/petermattis/pebble.Open(0x7fffb5159606, 0x5, 0xc0000f2210, 0x0, 0x0, 0x0)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/open.go:158 +0xedd
main.runTest(0x7fffb5159606, 0x5, 0xc0000f9c70, 0xc0000f9c90, 0xc0000f9c80)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/cmd/pebble/test.go:193 +0x2a2
main.runSync(0x7c9800, 0xc0000e00a0, 0x1, 0x5)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/cmd/pebble/sync.go:30 +0xb1
github.com/petermattis/pebble/vendor/github.com/spf13/cobra.(*Command).execute(0x7c9800, 0xc0000e0050, 0x5, 0x5, 0x7c9800, 0xc0000e0050)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/vendor/github.com/spf13/cobra/command.go:766 +0x2cc
github.com/petermattis/pebble/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x7c9a60, 0x0, 0x625ce8, 0x16)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/vendor/github.com/spf13/cobra/command.go:852 +0x2fd
github.com/petermattis/pebble/vendor/github.com/spf13/cobra.(*Command).Execute(0x7c9a60, 0x7b2360, 0x6220cc)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/vendor/github.com/spf13/cobra/command.go:800 +0x2b
main.main()
        /home/tschottdorf/go/src/github.com/petermattis/pebble/cmd/pebble/main.go:49 +0x34a

from pebble.

petermattis avatar petermattis commented on August 20, 2024

The bench directory is not cleared between runs. A bad batch was persisted to the WAL and now it is causing a failure on every run (you can see this from the replayWAL). Can you do a little sleuthing of this batch to print out more info. What is the seqNum vs the expected seqNum? What is the count of items in the batch?

from pebble.

tbg avatar tbg commented on August 20, 2024

The batch at the offending entry is inconsistent in that it claims to contain 5 ops (b.count() == 5) but it actually only contains three:

panic: pebble: inconsistent batch count: last seqNum is 83033, expected 83035
history:
add at seqNum=83030
add at seqNum=83031
add at seqNum=83032
break

This seems reminiscient of examples discussed in https://blog.dgraph.io/post/alice/. I'm running with db.NoSync, so it looks like I'm just corrupting the WAL. This is expected, right? And couldn't this also happen if I decided to sync, though presumably more rarely?

from pebble.

petermattis avatar petermattis commented on August 20, 2024

The WAL format (internal/record) has a checksum. While the WAL can be truncated to record boundaries, I don't think a batch stored within the WAL can be truncated. Note that a WAL entry is composed of multiple records. Perhaps the error handling on checksum mismatch is faulty, so that a batch spanning multiple records has one of those records corrupt, but we don't mark the entire batch as bad. That's just speculation. Regardless, what you're seeing is unexpected.

from pebble.

tbg avatar tbg commented on August 20, 2024

Thanks. I'm going to dig some more.

from pebble.

tbg avatar tbg commented on August 20, 2024

We're hitting this code path:

https://github.com/petermattis/pebble/blob/841785a2b7ee00982eb7f2e8ac89587db9f64bee/internal/record/record.go#L162-L170

My reading of this is that we have a WAL entry that, say, spans two blocks and should contain five records. The first block is a-ok and contains three entries. The second block is all zeroes and this code allows it to be skipped without failing the operation, so now we've seen a WAL entry with five records but three actual records read.

from pebble.

petermattis avatar petermattis commented on August 20, 2024

Hmm, that doesn't seem right. Note that this code was taken from go-leveldb. Figuring out what RocksDB does in this instance will be fun. Are there existing tests in internal/record which exercise this code path?

from pebble.

tbg avatar tbg commented on August 20, 2024

I'll be away for a bit, but planning to take a closer look.

from pebble.

tbg avatar tbg commented on August 20, 2024

(Also the Recover path shouldn't be hit in this case, it's predicated on wantFirst which wouldn't be true for the second block)

from pebble.

petermattis avatar petermattis commented on August 20, 2024

RocksDB has knobs for tuning the recovery behavior:

enum class WALRecoveryMode : char {
  // Original levelDB recovery
  // We tolerate incomplete record in trailing data on all logs
  // Use case : This is legacy behavior
  kTolerateCorruptedTailRecords = 0x00,
  // Recover from clean shutdown
  // We don't expect to find any corruption in the WAL
  // Use case : This is ideal for unit tests and rare applications that
  // can require high consistency guarantee
  kAbsoluteConsistency = 0x01,
  // Recover to point-in-time consistency (default)
  // We stop the WAL playback on discovering WAL inconsistency
  // Use case : Ideal for systems that have disk controller cache like
  // hard disk, SSD without super capacitor that store related data
  kPointInTimeRecovery = 0x02,
  // Recovery after a disaster
  // We ignore any corruption in the WAL and try to salvage as much data as
  // possible
  // Use case : Ideal for last ditch effort to recover data or systems that
  // operate with low grade unrelated data
  kSkipAnyCorruptedRecords = 0x03,
};

We use the default in Cockroach: kPointInTimeRecovery.

from pebble.

petermattis avatar petermattis commented on August 20, 2024

Here is the RocksDB code to decide what to do when an error is encountered while replaying the WAL:

      if (immutable_db_options_.wal_recovery_mode ==
          WALRecoveryMode::kSkipAnyCorruptedRecords) {
        // We should ignore all errors unconditionally
        status = Status::OK();
      } else if (immutable_db_options_.wal_recovery_mode ==
                 WALRecoveryMode::kPointInTimeRecovery) {
        // We should ignore the error but not continue replaying
        status = Status::OK();
        stop_replay_for_corruption = true; 
        corrupted_log_number = log_number;
        ROCKS_LOG_INFO(immutable_db_options_.info_log,
                       "Point in time recovered to log #%" PRIu64
                       " seq #%" PRIu64,
                       log_number, *next_sequence);
      } else {
        assert(immutable_db_options_.wal_recovery_mode ==
                   WALRecoveryMode::kTolerateCorruptedTailRecords ||
               immutable_db_options_.wal_recovery_mode ==
                   WALRecoveryMode::kAbsoluteConsistency);
        return status;
      }

pebble/open.go always returns an error. That is, it seems to behave with kAbsoluteConsistency. Should be easy to fix. Testing will be the hardest part.

from pebble.

justinj avatar justinj commented on August 20, 2024

Looking into this, doing the same repro steps I ended up with a different panic message:

panic: allocation failed because arena is full

goroutine 1 [running]:
github.com/petermattis/pebble.(*DB).replayWAL(0xc000088c00, 0xc0000e9a58, 0x125d300, 0x13f1e80, 0xc0000a1380, 0x10, 0x706412, 0x0, 0x0)
        /Users/justin/go/src/github.com/petermattis/pebble/open.go:258 +0xa70
github.com/petermattis/pebble.Open(0x7ffeefbff8dc, 0x5, 0xc0000e0210, 0x0, 0x0, 0x0)
        /Users/justin/go/src/github.com/petermattis/pebble/open.go:158 +0xedd
main.runTest(0x7ffeefbff8dc, 0x5, 0xc0000e9c70, 0xc0000e9c90, 0xc0000e9c80)
        /Users/justin/go/src/github.com/petermattis/pebble/cmd/pebble/test.go:193 +0x2a2
main.runSync(0x13d17a0, 0xc0000d00a0, 0x1, 0x5)
        /Users/justin/go/src/github.com/petermattis/pebble/cmd/pebble/sync.go:30 +0xb1
github.com/petermattis/pebble/vendor/github.com/spf13/cobra.(*Command).execute(0x13d17a0, 0xc0000d0050, 0x5, 0x5, 0x13d17a0, 0xc0000d0050)
        /Users/justin/go/src/github.com/petermattis/pebble/vendor/github.com/spf13/cobra/command.go:766 +0x2cc
github.com/petermattis/pebble/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x13d1a00, 0x0, 0x1228dc9, 0x16)
        /Users/justin/go/src/github.com/petermattis/pebble/vendor/github.com/spf13/cobra/command.go:852 +0x2fd
github.com/petermattis/pebble/vendor/github.com/spf13/cobra.(*Command).Execute(0x13d1a00, 0x13ba300, 0x1225335)
        /Users/justin/go/src/github.com/petermattis/pebble/vendor/github.com/spf13/cobra/command.go:800 +0x2b
main.main()
        /Users/justin/go/src/github.com/petermattis/pebble/cmd/pebble/main.go:49 +0x34a

Presumably related but perhaps different?

edit:
Ah, I see:

			if err == arenaskl.ErrArenaFull {
				// TODO(peter): write the memtable to disk.
				panic(err)
			}

from pebble.

petermattis avatar petermattis commented on August 20, 2024

It's possibly different, but might have the same root cause of trying to interpret data in the WAL inappropriately. In Pebble, the memtable has a single fixed size and effort is taken to make sure we never try to insert an update that will not fit. In RocksDB, the memtable is backed by a linked list of arenas (this approach isn't possible in Go because we can't put pointers in the arena).

It is possible there is missing logic in replayWAL to "rotate" the memtable when it is full.

from pebble.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.