Started with <a class="issue-link js-issue-link" data-error-text="Failed to load title

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Couple of ideas, for my own reference: dig into why the Leader

Recovery process can result in old snapshot getting sent to node,about rqlite/rqlite

Comments (43)

otoolep commented on May 27, 2024 1

Not sure, perhaps blow away snapshots in every directory except one node. Not advocating this right now, I need to look into the code more.

from rqlite.

otoolep commented on May 27, 2024 1

Sync simply changes the timing of the whole test substantially, that is my suspicion for the difference. Sync absolutely is a read-only operation, and does not change the operation of a node.

from rqlite.

otoolep commented on May 27, 2024 1

Can you try bumping the timeout to 30s on two of the nodes? Run the test again, send me the logs if you hit it.

from rqlite.

otoolep commented on May 27, 2024

@dwco-z - this may not be solvable through code, and may require some changes to the Recovery process steps.

from rqlite.

dwco-z commented on May 27, 2024

@otoolep do you mean maybe I should adjust the way I'm doing the recovery? e.g., adding some random timeout during each node's startup to reduce the chances of the race condition?

from rqlite.

otoolep commented on May 27, 2024

@dwco-z -- maybe you can run the latest code, which will be available once this build completes: https://ci.appveyor.com/project/otoolep/rqlite/builds/49333481

I couldn't reproduce the issue locally, so your help would be good. Re-run your test, but with -raft-log-level=INFO, get the issue to occur, and then attach the logs to this bug.

from rqlite.

dwco-z commented on May 27, 2024

Sure, will do

from rqlite.

dwco-z commented on May 27, 2024

Here it is, I changed from-raft-log-level=DEBUG to -raft-log-level=INFO and captured the issue again:
node2.log
node3.log
node1.log

from rqlite.

otoolep commented on May 27, 2024

OK, as I suspect, more snapshot reaping going on.

It's not immediately clear to me how to solve this issue. I'll need to think about it.

from rqlite.

otoolep commented on May 27, 2024

Couple of ideas, for my own reference:

dig into why the Leader is sending an old snapshot to the follower
consider reaping the snapshot in the future, if a node gets an older snapshot. Need to be cautious, as this could be a sign that something has gone wrong.

from rqlite.

dwco-z commented on May 27, 2024

OK, as I suspect, more snapshot reaping going on.

It's not immediately clear to me how to solve this issue. I'll need to think about it.

Ok, anytime you need me to test changes just let me know. I believe you could not reproduce the bug because of some timing difference as you mentioned before.

from rqlite.

dwco-z commented on May 27, 2024

Just in case you need logs from the last created PR:
node2.log
node3.log
node1.log

from rqlite.

otoolep commented on May 27, 2024

Thanks, no surprises, what I expected is happening.

from rqlite.

otoolep commented on May 27, 2024

@dwco-z -- can you run again, using the latest build from the same PR (I've updated the code a little) and run with -raft-log-level=DEBUG? Then send me the logs again -- thanks.

#1719

from rqlite.

dwco-z commented on May 27, 2024

Sure

from rqlite.

otoolep commented on May 27, 2024

I've actually merged the PR to master, because until we get to the bottom of this, getting an older snapshot breaks rqlite. The Raft paper does reference this possibility (and suggests ways to deal with it), but either the Raft layer itself is sending it incorrectly, or rqlite needs to handle it better. I want to know which it is before removing this check.

from rqlite.

dwco-z commented on May 27, 2024

I've actually merged the PR to master, because until we get to the bottom of this, getting an older snapshot breaks rqlite. The Raft paper does reference this possibility (and suggests ways to deal with it), but either the Raft layer itself is sending it incorrectly, or rqlite needs to handle it better. I want to know which it is before removing this check.

That makes sense. As soon as AppVeyor completes I'll send you the logs.

from rqlite.

dwco-z commented on May 27, 2024

New logs
node2.log
node3.log
node1.log

from rqlite.

otoolep commented on May 27, 2024

OK, thanks, logs are good to see. However, this issue is quite complex, with a lot of moving parts. Unless I can get it to happen locally, it's going to be difficult to solve. It's very timing-dependent and I don't know if it's a bug in the Raft code, a bug in rqlite code, or just that recovering a cluster like this it at the edge of what Raft and rqlite support (Recovery is not something that is described in the Raft paper).

I'll continue to see if I can reproduce this locally, but unless I do, I do not expect to solve this anytime soon. A more robust Recovery process may simply be to always recover one node first, and then join new nodes to it.

from rqlite.

otoolep commented on May 27, 2024

@dwco-z -- when you run your script, does it exit automatically when the issue happens? Or are you just watching the logs?

from rqlite.

dwco-z commented on May 27, 2024

@dwco-z -- when you run your script, does it exit automatically when the issue happens? Or are you just watching the logs?

It exit automatically

leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
stopping rqlited...
stopped!
stopping rqlited...
stopped!
stopping rqlited...
stopped!
Traceback (most recent call last):
File "C:\Projects\rqlite-throubleshoot\main.py", line 85, in
start_all_instances(all_instances)
File "C:\Projects\rqlite-throubleshoot\main.py", line 55, in start_all_instances
future.result()
File "C:\Users\DayvidCastro\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "C:\Users\DayvidCastro\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures_base.py", line 401, in __get_result
raise self._exception
File "C:\Users\DayvidCastro\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Projects\rqlite-throubleshoot\main.py", line 49, in start_manager
manager.start_rqlited(joinAddresses)
File "C:\Projects\rqlite-throubleshoot\rqlite_manager.py", line 58, in start_rqlited
raise Exception('something is wrong, leader is not ready...')
Exception: something is wrong, leader is not ready...
→ C:\Projects\rqlite-throubleshoot [rqlite-8.22.1-issue]›

from rqlite.

dwco-z commented on May 27, 2024

I'll try running on my linux machine with python 3.8 to see if I can reproduce it there

from rqlite.

dwco-z commented on May 27, 2024

I pushed some changes to rqlite-throubleshooting to make it compatible with linux, and I was able to hit the issue in a linux mint using the latest release rqlite-v8.22.2-linux-amd64.tar.gz

from rqlite.

otoolep commented on May 27, 2024

Great, it's happening for me now with your latest script.

from rqlite.

otoolep commented on May 27, 2024

Took about 1-2 minutes.

from rqlite.

dwco-z commented on May 27, 2024

Nice!

from rqlite.

otoolep commented on May 27, 2024

Sometimes the script is exiting, but the logs don't show the issue. Just an FYI.

from rqlite.

otoolep commented on May 27, 2024

node1.log
node2.log
node3.log

from rqlite.

otoolep commented on May 27, 2024

The logs are helping to understand this issue.

I'm not sure yet, but the RecoverNode documentation (which I got from Consul) does state that uncommitted entries and data loss can result from this process. When the issue happens I am seeing confusing elections taking place, and the logs are not entirely clear -- it could be the side-effects of this process. It may be that the RecoverNode process is not entirely bulletproof, and is a best-effort thing. Even the Raft docs say they've had issues in the past with this area of the code.

I will continue to think about this, and have some ideas for a fix, but I'm reluctant to make changes at this time that affect the core of rqlite for an edge case (which may even be a best effort thing). The only truly robust way to recover a cluster that has suffered a big loss is to take the SQLite database out from under a node (or use a backup) and rebuild the cluster.

from rqlite.

dwco-z commented on May 27, 2024

Sometimes the script is exiting, but the logs don't show the issue. Just an FYI.

Judging from the node3.log, it seems node3 received a interrupt signal [rqlited] 2024/03/06 15:30:07 received signal "interrupt", but for some reason it cannot stop because it seems to be trying to contact the other two nodes that stopped, so 30 seconds later at 15:30:37 the RqliteManager.stop_rqlited timeouts and fallback to process.terminate() which for some reason didn't stop everything until the next start so the port was still being hold, I guess changing to process.kill() might avoid this

from rqlite.

dwco-z commented on May 27, 2024

One thing I find odd is that although snapshot and readyz?sync are two completely different concepts as you already mentioned, I cannot reproduce without the sync flag (I tried 260 times) while if I make other changes like inserting a random timeout before each node's start, I can still hit the issue eventually

from rqlite.

otoolep commented on May 27, 2024

So this is what is happening, according to logs from my latest run of this:

All 3 nodes start fresh.
Node 1 wins the initial election, increasing its term to 2.
Node 2 starts an election too, increases its term 3, and wins the election
Node 3 next starts an election, increases its term to 4, and wins the election

So basically all 3 nodes decide to start an election fairly quickly due to -raft-timeout being 2 seconds.

Then the nodes are stopped, and a recovery process is triggered. As part of the Recovery process, each node creates a snapshot.

Node 1 creates 2-2-TTTTT as its snapshot where TTTTT is a timestamp. 2-2 comes from the last log index and term in the node's log.
Node 2 creates 3-3-TTTTT as its snapshot.
Node 3 creates 4-4-TTTTT as its snapshot.

Node 2 then triggers an election, bumps its term to 5, and wins with votes from itself and Node 3 (Node 1 hasn't finished its recovery so doesn't vote). Node 2 starts replication of logs to other nodes (via AppendEntries requests). Node 2 checks if Node 3 has the log at Index 3, which Node 3 doesn't as its compacted its log during the Recovery process. This triggers Node 2 to send its snapshot 3-3-TTTTT to Node 3, which Node 3 rejects as it has a snapshot from a later term, 4-4-TTTTT.

And then we're stuck. This is because rqlite assumes that any snapshot received from another node, with instructions to install that snapshot, will be from index and term at least equal to that of the latest snapshot the receiving node has (this was made explicit on master recently, but was implicit in the Snapshot reaping logic). This is not true on Node 3.

One solution to this issue is to remove this assumption, and allow "older" snapshots to be installed. To allow this the Snapshot-reaping process would need to be changed so that it can reap snapshots both older and newer (in terms of index and term) than the most recently installed snapshot (today it just reaps everything but the "newest in index and term", which wipes out the most recently installed snapshot). The Raft paper indicates that this should be OK because it is Node 2 that is sending the snapshot, Node 2 is leader, and the Raft paper says that Node 2 couldn't become Leader in the first place if it didn't have all the log entries that every other node had too (section 5.4.1). So in that sense it should be "safe" and no data should be lost.

I'll think about this more, but I think Snapshot Reaping needs to be changed from "reap all but the latest snapshot" to "reap all but the most recently installed snapshot". This should solve this problem.

from rqlite.

dwco-z commented on May 27, 2024

Oh, now I can see why the timing is relevant in leading to this series of events, yeah that makes sense

from rqlite.

otoolep commented on May 27, 2024

Yeah, very timing dependent, that's why it takes a few attempts.

I can't rush into fixing this however. While in principle it's fairly easy to see a possible fix, the hard parts are dealing with crash scenarios. rqlite must handle crashing at anytime, particularly while taking a snapshot. There is important logic that rqlite runs at start-up, checking for inconsistencies. That logic today assumes that the most recent snapshot it finds is the desired one. That assumption could be changed by any fix for this particular issue -- and that might involve even more complicated consistency checking code, which can result in new bugs. Since crashing during snapshotting is much more likely to happen than this issue (rqlited can be killed at anytime by an operator), I have to consider changes very carefully.

from rqlite.

otoolep commented on May 27, 2024

By crashing I also mean "exiting hard unexpectedly", perhaps due to SIGKILL being sent to rqlited.

from rqlite.

otoolep commented on May 27, 2024

I've updated the node-recovery docs due to this issue.

from rqlite.

otoolep commented on May 27, 2024

@dwco-z -- if my reasoning is correct here, I suspect you'll see much robust behavior if you increase the election timeout, maybe try 5 seconds. We want to reduce the rate of elections during the recovery process.

from rqlite.

otoolep commented on May 27, 2024

@dwco-z -- correction, increase the election timeout for 2 out of the 3 nodes, thereby ensuring 1 node will be elected much faster than the other two nodes. Make two of the nodes, say, 10 seconds, and leave the other at 2 seconds.

Yes, this is lots of special casing, but Node Recovery, by its nature, is meant to be something done under very controlled conditions.

from rqlite.

dwco-z commented on May 27, 2024

I understand your point of view. I updated the rqlite-throubleshooting code with your recommendation of having one of the nodes to use a smaller raft-timeout (2s) while the others use a higher timeout (10s), but I couldn't notice any difference regarding making it more stable, it seems to be hitting the issue with roughly the same frequency. I'll try different settings to see if it improves.

from rqlite.

otoolep commented on May 27, 2024

having one of the nodes to use a smaller raft-timeout (2s) while the others use a higher timeout (10s), but I couldn't notice any difference regarding making it more stable,

That's interesting. I would definitely be interested in seeing the logs from such setup, where the problem happens. The logs should show clearly that the other 2 nodes take much longer to trigger an election.

from rqlite.

dwco-z commented on May 27, 2024

Sure, I just hit a snapshot issue in the second run:

node1.log

node3.log

node2.log

from rqlite.

dwco-z commented on May 27, 2024

node2.log
node3.log
node1.log

from rqlite.

otoolep commented on May 27, 2024

OK, thanks, well clearly I don't fully understand the innards of some of the Raft code because that flag is not having the effect I want, at least not for the very first election. The appears to be going straight to candidate.

I'll just need to keep digging into the Raft code.

from rqlite.

Recovery process can result in old snapshot getting sent to node about rqlite HOT 43 OPEN

Comments (43)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent