Comments (16)
How many nodes are you running in your cluster?
It takes a long time to select a new leader
Define long. How long is it taking? Please tell me in seconds.
Screenshots of logs are quite awkward. Can't you paste the logs into a gist?
from rqlite.
@otoolep Thank for your reply. I apologize for not providing detailed information to you.
I run 3 nodes in the cluster.
As can be seen from the screenshot of the log, it took 50s to elect the leader node. (09:46:32 received signal "terminated" - 09:47:22 eletion won term=1 tally=2)
from rqlite.
TBH your network looks like it's not working properly. All those IO timeouts imply the nodes can't connect within a few seconds to other nodes. I'm not sure I can do much about this. The logs show the nodes can't connect, and this doesn't look like any specific issue with rqlite. There have been zero other reports of anything like this with rqlite, so at this time I can only assume it's specific to your network.
You could try increasing your heartbeat and election timeouts (check the options you can pass to rqlited
) to see if that helps reduce churn and elect a leader faster.
from rqlite.
The reason for the IO timeout is that the existing nodes(node1 and node2) cannot connect to the original leader node(node3) because node3 cannot find a new leader in the cluster after restart, so it exit. But this is not the key point.
rqlite-node3-1 | [cluster-join] 2024/03/01 16:51:09 failed to join via node at 172.16.237.100:2380: no leader
rqlite-node3-1 | [cluster-join] 2024/03/01 16:51:09 failed to join cluster at [172.16.237.100:2380], after 5 attempt(s)
rqlite-node3-1 | [rqlited] 2024/03/01 16:51:09 clustering failure: failed to join cluster: failed to join cluster
rqlite-node1-1 | 2024-03-01T16:51:09.733Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-01T16:51:09.733Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-01T16:51:09.747Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-01T16:51:09.760Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 3 172.16.237.102:2380}" error="msgpack decode error [pos 632]: read tcp 172.16.237.100:43106->172.16.237.102:2380: read: connection reset by peer" term=0
rqlite-node3-1 exited with code 1
The confusing point is that after node3 restarts, the other two nodes cannot elect become leader node and instead print the following information repeatedly for 50s.
rqlite-node1-1 | 2024-03-01T16:51:57.389Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-01T16:51:57.389Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-01T16:51:57.402Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node2-1 | 2024-03-01T16:51:58.172Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-01T16:51:58.172Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-01T16:51:58.188Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
from rqlite.
That's a fair point. I still suggest you play around with your heartbeat and election timeouts, see if that helps your cluster settle faster. What you are seeing is just the Raft code in action, rqlite doesn't really do anything special.
from rqlite.
I just ran a test, 3 nodes running on a local machine. Kill the leader, new leader was elected in under a second.
Not saying you're not seeing long times, but without more information, and a way to reproduce what you're seeing, it's close to impossible to debug. So many things could be different about your network, resources, etc relative to mine.
That's why I suggest tuning the heartbeat and election timeout. Decreasing the heartbeat timeout will mean nodes will detect Leader failure faster (but may also result in false detection). Increasing the election timeout means that a node will wait longer for another node to become Leader before it starts an election (but may increase the time before a new Leader is actually elected). Try increasing the election timeout by 50%, see if that fixes the issue for you. Giving the cluster a couple more seconds to elect a new Leader may be all you need.
from rqlite.
I just ran a test, 3 nodes running on a local machine. Kill the leader, new leader was elected in under a second.
Thanks for your trying. Do you confirm that term is max uint64 before restarting the leader?
from rqlite.
Do you confirm that term is max uint64 before restarting the leader?
I don't understand what you're asking.
from rqlite.
This issue is reproduced as follows:
What did you do?
When the current term is 18446744073709551615 (max uint64) , restart current leader node to start a new election.
from rqlite.
How did you get your system into that state? This is why screenshots of logs are so difficult to work with. I cannot search the contents of those logs.
from rqlite.
OK, I now see it in the logs (but again, non-text logs are quite inconvenient). A "current term" value that high certainly doesn't look right, looks almost like a type issue: -1 as a signed 64 bit int, being interpreted as an unsigned int.
What you are seeing is in the Raft code that rqlite builds on, this does not appear to be a rqlite-specific issue. How did your system get into this state?
from rqlite.
I'll re-open this until we finish our discussion but I suspect this is a Raft issue, not really a rqlite-specific issue.
from rqlite.
How did your system get into this state?
I use a fault injection tool to modify the term field of AppendEntriesRequest
to max uint64.
It is indeed very likely an issue with Raft.
Here is some crucial text logs.
Initial state
rqlite-node2-1 | 2024-03-04T02:07:01.324Z [WARN] raft: failed to get previous log: previous-index=3 last-index=0 error="log not found"
rqlite-node1-1 | 2024-03-04T02:07:01.325Z [WARN] raft: appendEntries rejected, sending older logs: peer="{Voter 2 172.16.237.101:2380}" next=1
rqlite-node1-1 | 2024-03-04T02:07:01.329Z [INFO] raft: pipelining replication: peer="{Voter 2 172.16.237.101:2380}"
rqlite-node1-1 | 2024-03-04T02:07:01.666Z [INFO] raft: updating configuration: command=AddVoter server-id=3 server-addr=172.16.237.102:2380 servers="[{Suffrage:Voter ID:1 Address:172.16.237.100:2380} {Suffrage:Voter ID:2 Address:172.16.237.101:2380} {Suffrage:Voter ID:3 Address:172.16.237.102:2380}]"
rqlite-node1-1 | 2024-03-04T02:07:01.669Z [INFO] raft: added peer, starting replication: peer=3
rqlite-node3-1 | 2024-03-04T02:07:01.675Z [WARN] raft: failed to get previous log: previous-index=4 last-index=0 error="log not found"
rqlite-node1-1 | 2024-03-04T02:07:01.676Z [WARN] raft: appendEntries rejected, sending older logs: peer="{Voter 3 172.16.237.102:2380}" next=1
rqlite-node1-1 | [store] 2024/03/04 02:07:01 node with ID 3, at 172.16.237.102:2380, joined successfully as voter
rqlite-node3-1 | [rqlited] 2024/03/04 02:07:01 successfully joined cluster at 172.16.237.100:2380
rqlite-node3-1 | [rqlited] 2024/03/04 02:07:01 node HTTP API available at http://172.16.237.102:2379/
rqlite-node3-1 | [rqlited] 2024/03/04 02:07:01 connect using the command-line tool via 'rqlite -H 172.16.237.102 -p 2379'
rqlite-node1-1 | 2024-03-04T02:07:01.680Z [INFO] raft: pipelining replication: peer="{Voter 3 172.16.237.102:2380}"
Modify the term of AppendEntriesRequest
to max uint64 to make current term max
rqlite-node1-1 | 2024-03-04T02:07:16.536Z [ERROR] raft: peer has newer term, stopping replication: peer="{Voter 3 172.16.237.102:2380}"
rqlite-node1-1 | 2024-03-04T02:07:16.536Z [INFO] raft: aborting pipeline replication: peer="{Voter 3 172.16.237.102:2380}"
rqlite-node1-1 | 2024-03-04T02:07:16.536Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:07:16.536Z [INFO] raft: aborting pipeline replication: peer="{Voter 2 172.16.237.101:2380}"
rqlite-node1-1 | 2024-03-04T02:07:18.186Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:07:18.186Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=3
rqlite-node1-1 | 2024-03-04T02:07:18.199Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | 2024-03-04T02:07:18.504Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr=172.16.237.100:2380 last-leader-id=1
rqlite-node3-1 | 2024-03-04T02:07:18.504Z [INFO] raft: entering candidate state: node="Node at 172.16.237.102:2380 [Candidate]" term=18446744073709551615
rqlite-node3-1 | 2024-03-04T02:07:18.518Z [INFO] raft: election won: term=18446744073709551615 tally=2
rqlite-node3-1 | 2024-03-04T02:07:18.518Z [INFO] raft: entering leader state: leader="Node at 172.16.237.102:2380 [Leader]"
rqlite-node3-1 | 2024-03-04T02:07:18.518Z [INFO] raft: added peer, starting replication: peer=1
rqlite-node3-1 | 2024-03-04T02:07:18.518Z [INFO] raft: added peer, starting replication: peer=2
rqlite-node3-1 | 2024-03-04T02:07:18.520Z [INFO] raft: pipelining replication: peer="{Voter 2 172.16.237.101:2380}"
rqlite-node3-1 | 2024-03-04T02:07:18.521Z [INFO] raft: pipelining replication: peer="{Voter 1 172.16.237.100:2380}"
Restart new leader(node3)
rqlite-node3-1 | [rqlited] 2024/03/04 02:08:46 received signal "terminated", shutting down
rqlite-node3-1 | [http] 2024/03/04 02:08:46 closing HTTP service on 172.16.237.102:2379
rqlite-node3-1 | [http] 2024/03/04 02:08:46 HTTP service on 172.16.237.102:2379 stopped: http: Server closed
rqlite-node3-1 | [rqlited] 2024/03/04 02:08:46 stepping down as Leader before shutdown
rqlite-node1-1 | 2024-03-04T02:08:46.741Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node3-1 | 2024-03-04T02:08:46.742Z [INFO] raft: aborting pipeline replication: peer="{Voter 2 172.16.237.101:2380}"
rqlite-node3-1 | 2024-03-04T02:08:46.742Z [INFO] raft: aborting pipeline replication: peer="{Voter 1 172.16.237.100:2380}"
rqlite-node3-1 | [store] 2024/03/04 02:08:46 store closed with node ID 3, listening on 172.16.237.102:2380
rqlite-node3-1 | [rqlited] 2024/03/04 02:08:46 rqlite server stopped
rqlite-node1-1 | 2024-03-04T02:08:46.755Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:08:46.788Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 3 172.16.237.102:2380}" error="msgpack decode error [pos 8979]: read tcp 172.16.237.100:51420->172.16.237.102:2380: read: connection reset by peer" term=0
rqlite-node2-1 | 2024-03-04T02:08:47.986Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr=172.16.237.102:2380 last-leader-id=3
rqlite-node2-1 | 2024-03-04T02:08:47.986Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:08:47.998Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 exited with code 0
rqlite-node3-1 exited with code 0
rqlite-node3-1 | [store] 2024/03/04 02:08:48 first log index: 1, last log index: 5, last applied index: 5, last command log index: 0:
rqlite-node3-1 | [store] 2024/03/04 02:08:48 created on-disk database at open
rqlite-node3-1 | 2024-03-04T02:08:48.273Z [INFO] raft: initial configuration: index=4 servers="[{Suffrage:Voter ID:1 Address:172.16.237.100:2380} {Suffrage:Voter ID:2 Address:172.16.237.101:2380} {Suffrage:Voter ID:3 Address:172.16.237.102:2380}]"
rqlite-node3-1 | [rqlited] 2024/03/04 02:08:48 checking that any supplied join addresses don't serve HTTP(S)
rqlite-node3-1 | 2024-03-04T02:08:48.273Z [INFO] raft: entering follower state: follower="Node at 172.16.237.102:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | [mux] 2024/03/04 02:08:48 handler not registered for request from 172.16.237.102:57278: 71 (unsupported protocol?)
rqlite-node1-1 | [mux] 2024/03/04 02:08:48 handler not registered for request from 172.16.237.102:57280: 22 (unsupported protocol?)
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:48 failed to join via node at 172.16.237.100:2380: no leader
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:48 failed to join cluster at [172.16.237.100:2380], sleeping 3s before retry
rqlite-node1-1 | 2024-03-04T02:08:48.607Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:08:48.607Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:08:48.619Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node2-1 | 2024-03-04T02:08:49.085Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:08:49.085Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:08:49.096Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | 2024-03-04T02:08:49.591Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node3-1 | 2024-03-04T02:08:49.591Z [INFO] raft: entering candidate state: node="Node at 172.16.237.102:2380 [Candidate]" term=0
rqlite-node3-1 | 2024-03-04T02:08:49.603Z [INFO] raft: entering follower state: follower="Node at 172.16.237.102:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:08:49.779Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:08:49.779Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:08:49.791Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node2-1 | 2024-03-04T02:08:50.351Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:08:50.351Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:08:50.364Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | 2024-03-04T02:08:50.691Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node3-1 | 2024-03-04T02:08:50.691Z [INFO] raft: entering candidate state: node="Node at 172.16.237.102:2380 [Candidate]" term=0
rqlite-node3-1 | 2024-03-04T02:08:50.702Z [INFO] raft: entering follower state: follower="Node at 172.16.237.102:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:51 failed to join via node at 172.16.237.100:2380: no leader
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:51 failed to join cluster at [172.16.237.100:2380], sleeping 3s before retry
rqlite-node1-1 | 2024-03-04T02:08:51.567Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:08:51.567Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:08:51.579Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node2-1 | 2024-03-04T02:08:51.661Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:08:51.662Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:08:51.676Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | 2024-03-04T02:08:52.668Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node3-1 | 2024-03-04T02:08:52.668Z [INFO] raft: entering candidate state: node="Node at 172.16.237.102:2380 [Candidate]" term=0
rqlite-node3-1 | 2024-03-04T02:08:52.680Z [INFO] raft: entering follower state: follower="Node at 172.16.237.102:2380 [Follower]" leader-address= leader-id=
rqlite-node2-1 | 2024-03-04T02:08:52.892Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:08:52.892Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:08:52.909Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:08:53.517Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:08:53.517Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:08:53.529Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | 2024-03-04T02:08:53.716Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node3-1 | 2024-03-04T02:08:53.716Z [INFO] raft: entering candidate state: node="Node at 172.16.237.102:2380 [Candidate]" term=0
rqlite-node3-1 | 2024-03-04T02:08:53.728Z [INFO] raft: entering follower state: follower="Node at 172.16.237.102:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:54 failed to join via node at 172.16.237.100:2380: no leader
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:54 failed to join cluster at [172.16.237.100:2380], sleeping 3s before retry
rqlite-node2-1 | 2024-03-04T02:08:54.605Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:08:54.605Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:08:54.618Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:08:54.811Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:08:54.811Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:08:54.824Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | 2024-03-04T02:08:55.428Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node3-1 | 2024-03-04T02:08:55.428Z [INFO] raft: entering candidate state: node="Node at 172.16.237.102:2380 [Candidate]" term=0
rqlite-node3-1 | 2024-03-04T02:08:55.440Z [INFO] raft: entering follower state: follower="Node at 172.16.237.102:2380 [Follower]" leader-address= leader-id=
rqlite-node2-1 | 2024-03-04T02:08:55.659Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:08:55.659Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:08:55.671Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:08:56.369Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:08:56.370Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:08:56.382Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | 2024-03-04T02:08:56.846Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node3-1 | 2024-03-04T02:08:56.847Z [INFO] raft: entering candidate state: node="Node at 172.16.237.102:2380 [Candidate]" term=0
rqlite-node3-1 | 2024-03-04T02:08:56.859Z [INFO] raft: entering follower state: follower="Node at 172.16.237.102:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:57 failed to join via node at 172.16.237.100:2380: no leader
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:57 failed to join cluster at [172.16.237.100:2380], sleeping 3s before retry
rqlite-node3-1 | [cluster-join] 2024/03/04 02:09:00 failed to join via node at 172.16.237.100:2380: no leader
rqlite-node3-1 | [cluster-join] 2024/03/04 02:09:00 failed to join cluster at [172.16.237.100:2380], after 5 attempt(s)
rqlite-node3-1 | [rqlited] 2024/03/04 02:09:00 clustering failure: failed to join cluster: failed to join cluster
rqlite-node3-1 exited with code 1
node1 and node2 keep changing between candidate and follower states for a long time until the following logs
rqlite-node1-1 | 2024-03-04T02:12:13.701Z [INFO] raft: duplicate requestVote for same term: term=0
rqlite-node2-1 | 2024-03-04T02:12:13.703Z [INFO] raft: duplicate requestVote for same term: term=0
rqlite-node2-1 | 2024-03-04T02:12:11.707Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:12:11.707Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:12:11.720Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node2-1 | 2024-03-04T02:12:12.167Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 3 172.16.237.102:2380}" error="dial tcp 172.16.237.102:2380: connect: no route to host" term=0
rqlite-node2-1 | 2024-03-04T02:12:12.167Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 3 172.16.237.102:2380}" error="dial tcp 172.16.237.102:2380: connect: no route to host" term=0
rqlite-node1-1 | 2024-03-04T02:12:12.572Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:12:12.572Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:12:12.584Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:12:13.691Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:12:13.691Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:12:13.693Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:12:13.693Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:12:13.701Z [INFO] raft: duplicate requestVote for same term: term=0
rqlite-node2-1 | 2024-03-04T02:12:13.703Z [INFO] raft: duplicate requestVote for same term: term=0
rqlite-node1-1 | 2024-03-04T02:12:13.863Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 3 172.16.237.102:2380}" error="dial tcp 172.16.237.102:2380: connect: no route to host" term=0
rqlite-node1-1 | 2024-03-04T02:12:13.863Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 3 172.16.237.102:2380}" error="dial tcp 172.16.237.102:2380: connect: no route to host" term=0
rqlite-node1-1 | 2024-03-04T02:12:13.863Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 3 172.16.237.102:2380}" error="dial tcp 172.16.237.102:2380: connect: no route to host" term=0
rqlite-node1-1 | 2024-03-04T02:12:14.781Z [WARN] raft: Election timeout reached, restarting election
rqlite-node1-1 | 2024-03-04T02:12:14.781Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=1
rqlite-node2-1 | 2024-03-04T02:12:14.800Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:12:14.800Z [INFO] raft: election won: term=1 tally=2
rqlite-node1-1 | 2024-03-04T02:12:14.800Z [INFO] raft: entering leader state: leader="Node at 172.16.237.100:2380 [Leader]"
from rqlite.
I use a fault injection tool to modify the term field of AppendEntriesRequest to max uint64.
You should have said so earlier. :-)
It is indeed very likely an issue with Raft.
In that case you could file an issue in the Raft repo: https://github.com/hashicorp/raft -- it might be an edge case they wish to handle better. These systems should be robust in the face of faults -- but they might believe it is, since leader election does succeed -- but perhaps they would like it to elect faster. I think closing the issue now is OK, but you can tell me if you disagree.
from rqlite.
If you're interested in digging into this issue further you may wish to see if it happens with https://github.com/otoolep/hraftd
hraftd is a much simpler system I built, which also uses the Hashicorp Raft system. If you can bring out the same behaviour with hraftd it's going to be much easier for the Hashicorp team to investigate (if they choose do so).
from rqlite.
Thank you! I will try it.
from rqlite.
Related Issues (20)
- Add "vtypes" field to API response to indicate actual value types
- Read-only node showing failed to open store: set log info: failed to get last command index HOT 4
- Can't create a empty database in rqlite shell HOT 3
- failed to install snapshot HOT 26
- Recovery process can result in old snapshot getting sent to node HOT 43
- Rust sqlx-rqlite HOT 1
- CLI support Home and End key
- Build and upload binaries automatically HOT 3
- Which version of Sqlite dialect is supported? HOT 4
- PRAGMA foreign_keys are turning them off even we turn them on. HOT 4
- connect rqlite over unix sockets HOT 2
- how to debug random "database disk image is malformed" error? HOT 10
- rqlited + DNS client 100% CPU usage after network disconnecting (windows) HOT 21
- CTRL-C should stop the process HOT 1
- [FeatureRequest] Make number of retries configurable for /nodes and potentially other relevant http calls HOT 7
- Build from source fails in Windows VM HOT 7
- Synchronisation bug related to http.Server.AllowedOrigin HOT 1
- Dynamic backup file naming HOT 9
- Multiple issues with snapshot process HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rqlite.