Giter VIP home page Giter VIP logo

Comments (16)

otoolep avatar otoolep commented on May 24, 2024

How many nodes are you running in your cluster?

It takes a long time to select a new leader

Define long. How long is it taking? Please tell me in seconds.

Screenshots of logs are quite awkward. Can't you paste the logs into a gist?

from rqlite.

huanghj78 avatar huanghj78 commented on May 24, 2024

@otoolep Thank for your reply. I apologize for not providing detailed information to you.
I run 3 nodes in the cluster.
As can be seen from the screenshot of the log, it took 50s to elect the leader node. (09:46:32 received signal "terminated" - 09:47:22 eletion won term=1 tally=2)

from rqlite.

otoolep avatar otoolep commented on May 24, 2024

TBH your network looks like it's not working properly. All those IO timeouts imply the nodes can't connect within a few seconds to other nodes. I'm not sure I can do much about this. The logs show the nodes can't connect, and this doesn't look like any specific issue with rqlite. There have been zero other reports of anything like this with rqlite, so at this time I can only assume it's specific to your network.

You could try increasing your heartbeat and election timeouts (check the options you can pass to rqlited) to see if that helps reduce churn and elect a leader faster.

from rqlite.

huanghj78 avatar huanghj78 commented on May 24, 2024

The reason for the IO timeout is that the existing nodes(node1 and node2) cannot connect to the original leader node(node3) because node3 cannot find a new leader in the cluster after restart, so it exit. But this is not the key point.

rqlite-node3-1 | [cluster-join] 2024/03/01 16:51:09 failed to join via node at 172.16.237.100:2380: no leader
rqlite-node3-1 | [cluster-join] 2024/03/01 16:51:09 failed to join cluster at [172.16.237.100:2380], after 5 attempt(s)
rqlite-node3-1 | [rqlited] 2024/03/01 16:51:09 clustering failure: failed to join cluster: failed to join cluster
rqlite-node1-1 | 2024-03-01T16:51:09.733Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-01T16:51:09.733Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-01T16:51:09.747Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-01T16:51:09.760Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 3 172.16.237.102:2380}" error="msgpack decode error [pos 632]: read tcp 172.16.237.100:43106->172.16.237.102:2380: read: connection reset by peer" term=0
rqlite-node3-1 exited with code 1

The confusing point is that after node3 restarts, the other two nodes cannot elect become leader node and instead print the following information repeatedly for 50s.

rqlite-node1-1 | 2024-03-01T16:51:57.389Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-01T16:51:57.389Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-01T16:51:57.402Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node2-1 | 2024-03-01T16:51:58.172Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-01T16:51:58.172Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-01T16:51:58.188Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=

from rqlite.

otoolep avatar otoolep commented on May 24, 2024

That's a fair point. I still suggest you play around with your heartbeat and election timeouts, see if that helps your cluster settle faster. What you are seeing is just the Raft code in action, rqlite doesn't really do anything special.

from rqlite.

otoolep avatar otoolep commented on May 24, 2024

I just ran a test, 3 nodes running on a local machine. Kill the leader, new leader was elected in under a second.

Not saying you're not seeing long times, but without more information, and a way to reproduce what you're seeing, it's close to impossible to debug. So many things could be different about your network, resources, etc relative to mine.

That's why I suggest tuning the heartbeat and election timeout. Decreasing the heartbeat timeout will mean nodes will detect Leader failure faster (but may also result in false detection). Increasing the election timeout means that a node will wait longer for another node to become Leader before it starts an election (but may increase the time before a new Leader is actually elected). Try increasing the election timeout by 50%, see if that fixes the issue for you. Giving the cluster a couple more seconds to elect a new Leader may be all you need.

from rqlite.

huanghj78 avatar huanghj78 commented on May 24, 2024

@otoolep

I just ran a test, 3 nodes running on a local machine. Kill the leader, new leader was elected in under a second.

Thanks for your trying. Do you confirm that term is max uint64 before restarting the leader?

from rqlite.

otoolep avatar otoolep commented on May 24, 2024

Do you confirm that term is max uint64 before restarting the leader?

I don't understand what you're asking.

from rqlite.

huanghj78 avatar huanghj78 commented on May 24, 2024

This issue is reproduced as follows:

What did you do?
When the current term is 18446744073709551615 (max uint64) , restart current leader node to start a new election.

from rqlite.

otoolep avatar otoolep commented on May 24, 2024

How did you get your system into that state? This is why screenshots of logs are so difficult to work with. I cannot search the contents of those logs.

from rqlite.

otoolep avatar otoolep commented on May 24, 2024

OK, I now see it in the logs (but again, non-text logs are quite inconvenient). A "current term" value that high certainly doesn't look right, looks almost like a type issue: -1 as a signed 64 bit int, being interpreted as an unsigned int.

What you are seeing is in the Raft code that rqlite builds on, this does not appear to be a rqlite-specific issue. How did your system get into this state?

from rqlite.

otoolep avatar otoolep commented on May 24, 2024

I'll re-open this until we finish our discussion but I suspect this is a Raft issue, not really a rqlite-specific issue.

from rqlite.

huanghj78 avatar huanghj78 commented on May 24, 2024

@otoolep

How did your system get into this state?

I use a fault injection tool to modify the term field of AppendEntriesRequest to max uint64.
It is indeed very likely an issue with Raft.
Here is some crucial text logs.
Initial state

rqlite-node2-1 | 2024-03-04T02:07:01.324Z [WARN] raft: failed to get previous log: previous-index=3 last-index=0 error="log not found"
rqlite-node1-1 | 2024-03-04T02:07:01.325Z [WARN] raft: appendEntries rejected, sending older logs: peer="{Voter 2 172.16.237.101:2380}" next=1
rqlite-node1-1 | 2024-03-04T02:07:01.329Z [INFO] raft: pipelining replication: peer="{Voter 2 172.16.237.101:2380}"
rqlite-node1-1 | 2024-03-04T02:07:01.666Z [INFO] raft: updating configuration: command=AddVoter server-id=3 server-addr=172.16.237.102:2380 servers="[{Suffrage:Voter ID:1 Address:172.16.237.100:2380} {Suffrage:Voter ID:2 Address:172.16.237.101:2380} {Suffrage:Voter ID:3 Address:172.16.237.102:2380}]"
rqlite-node1-1 | 2024-03-04T02:07:01.669Z [INFO] raft: added peer, starting replication: peer=3
rqlite-node3-1 | 2024-03-04T02:07:01.675Z [WARN] raft: failed to get previous log: previous-index=4 last-index=0 error="log not found"
rqlite-node1-1 | 2024-03-04T02:07:01.676Z [WARN] raft: appendEntries rejected, sending older logs: peer="{Voter 3 172.16.237.102:2380}" next=1
rqlite-node1-1 | [store] 2024/03/04 02:07:01 node with ID 3, at 172.16.237.102:2380, joined successfully as voter
rqlite-node3-1 | [rqlited] 2024/03/04 02:07:01 successfully joined cluster at 172.16.237.100:2380
rqlite-node3-1 | [rqlited] 2024/03/04 02:07:01 node HTTP API available at http://172.16.237.102:2379/
rqlite-node3-1 | [rqlited] 2024/03/04 02:07:01 connect using the command-line tool via 'rqlite -H 172.16.237.102 -p 2379'
rqlite-node1-1 | 2024-03-04T02:07:01.680Z [INFO] raft: pipelining replication: peer="{Voter 3 172.16.237.102:2380}"

Modify the term of AppendEntriesRequest to max uint64 to make current term max

rqlite-node1-1 | 2024-03-04T02:07:16.536Z [ERROR] raft: peer has newer term, stopping replication: peer="{Voter 3 172.16.237.102:2380}"
rqlite-node1-1 | 2024-03-04T02:07:16.536Z [INFO] raft: aborting pipeline replication: peer="{Voter 3 172.16.237.102:2380}"
rqlite-node1-1 | 2024-03-04T02:07:16.536Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:07:16.536Z [INFO] raft: aborting pipeline replication: peer="{Voter 2 172.16.237.101:2380}"
rqlite-node1-1 | 2024-03-04T02:07:18.186Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:07:18.186Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=3
rqlite-node1-1 | 2024-03-04T02:07:18.199Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | 2024-03-04T02:07:18.504Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr=172.16.237.100:2380 last-leader-id=1
rqlite-node3-1 | 2024-03-04T02:07:18.504Z [INFO] raft: entering candidate state: node="Node at 172.16.237.102:2380 [Candidate]" term=18446744073709551615
rqlite-node3-1 | 2024-03-04T02:07:18.518Z [INFO] raft: election won: term=18446744073709551615 tally=2
rqlite-node3-1 | 2024-03-04T02:07:18.518Z [INFO] raft: entering leader state: leader="Node at 172.16.237.102:2380 [Leader]"
rqlite-node3-1 | 2024-03-04T02:07:18.518Z [INFO] raft: added peer, starting replication: peer=1
rqlite-node3-1 | 2024-03-04T02:07:18.518Z [INFO] raft: added peer, starting replication: peer=2
rqlite-node3-1 | 2024-03-04T02:07:18.520Z [INFO] raft: pipelining replication: peer="{Voter 2 172.16.237.101:2380}"
rqlite-node3-1 | 2024-03-04T02:07:18.521Z [INFO] raft: pipelining replication: peer="{Voter 1 172.16.237.100:2380}"

Restart new leader(node3)

rqlite-node3-1 | [rqlited] 2024/03/04 02:08:46 received signal "terminated", shutting down
rqlite-node3-1 | [http] 2024/03/04 02:08:46 closing HTTP service on 172.16.237.102:2379
rqlite-node3-1 | [http] 2024/03/04 02:08:46 HTTP service on 172.16.237.102:2379 stopped: http: Server closed
rqlite-node3-1 | [rqlited] 2024/03/04 02:08:46 stepping down as Leader before shutdown
rqlite-node1-1 | 2024-03-04T02:08:46.741Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node3-1 | 2024-03-04T02:08:46.742Z [INFO] raft: aborting pipeline replication: peer="{Voter 2 172.16.237.101:2380}"
rqlite-node3-1 | 2024-03-04T02:08:46.742Z [INFO] raft: aborting pipeline replication: peer="{Voter 1 172.16.237.100:2380}"
rqlite-node3-1 | [store] 2024/03/04 02:08:46 store closed with node ID 3, listening on 172.16.237.102:2380
rqlite-node3-1 | [rqlited] 2024/03/04 02:08:46 rqlite server stopped
rqlite-node1-1 | 2024-03-04T02:08:46.755Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:08:46.788Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 3 172.16.237.102:2380}" error="msgpack decode error [pos 8979]: read tcp 172.16.237.100:51420->172.16.237.102:2380: read: connection reset by peer" term=0
rqlite-node2-1 | 2024-03-04T02:08:47.986Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr=172.16.237.102:2380 last-leader-id=3
rqlite-node2-1 | 2024-03-04T02:08:47.986Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:08:47.998Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 exited with code 0
rqlite-node3-1 exited with code 0
rqlite-node3-1 | [store] 2024/03/04 02:08:48 first log index: 1, last log index: 5, last applied index: 5, last command log index: 0:
rqlite-node3-1 | [store] 2024/03/04 02:08:48 created on-disk database at open
rqlite-node3-1 | 2024-03-04T02:08:48.273Z [INFO] raft: initial configuration: index=4 servers="[{Suffrage:Voter ID:1 Address:172.16.237.100:2380} {Suffrage:Voter ID:2 Address:172.16.237.101:2380} {Suffrage:Voter ID:3 Address:172.16.237.102:2380}]"
rqlite-node3-1 | [rqlited] 2024/03/04 02:08:48 checking that any supplied join addresses don't serve HTTP(S)
rqlite-node3-1 | 2024-03-04T02:08:48.273Z [INFO] raft: entering follower state: follower="Node at 172.16.237.102:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | [mux] 2024/03/04 02:08:48 handler not registered for request from 172.16.237.102:57278: 71 (unsupported protocol?)
rqlite-node1-1 | [mux] 2024/03/04 02:08:48 handler not registered for request from 172.16.237.102:57280: 22 (unsupported protocol?)
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:48 failed to join via node at 172.16.237.100:2380: no leader
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:48 failed to join cluster at [172.16.237.100:2380], sleeping 3s before retry
rqlite-node1-1 | 2024-03-04T02:08:48.607Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:08:48.607Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:08:48.619Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node2-1 | 2024-03-04T02:08:49.085Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:08:49.085Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:08:49.096Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | 2024-03-04T02:08:49.591Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node3-1 | 2024-03-04T02:08:49.591Z [INFO] raft: entering candidate state: node="Node at 172.16.237.102:2380 [Candidate]" term=0
rqlite-node3-1 | 2024-03-04T02:08:49.603Z [INFO] raft: entering follower state: follower="Node at 172.16.237.102:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:08:49.779Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:08:49.779Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:08:49.791Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node2-1 | 2024-03-04T02:08:50.351Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:08:50.351Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:08:50.364Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | 2024-03-04T02:08:50.691Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node3-1 | 2024-03-04T02:08:50.691Z [INFO] raft: entering candidate state: node="Node at 172.16.237.102:2380 [Candidate]" term=0
rqlite-node3-1 | 2024-03-04T02:08:50.702Z [INFO] raft: entering follower state: follower="Node at 172.16.237.102:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:51 failed to join via node at 172.16.237.100:2380: no leader
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:51 failed to join cluster at [172.16.237.100:2380], sleeping 3s before retry
rqlite-node1-1 | 2024-03-04T02:08:51.567Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:08:51.567Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:08:51.579Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node2-1 | 2024-03-04T02:08:51.661Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:08:51.662Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:08:51.676Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | 2024-03-04T02:08:52.668Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node3-1 | 2024-03-04T02:08:52.668Z [INFO] raft: entering candidate state: node="Node at 172.16.237.102:2380 [Candidate]" term=0
rqlite-node3-1 | 2024-03-04T02:08:52.680Z [INFO] raft: entering follower state: follower="Node at 172.16.237.102:2380 [Follower]" leader-address= leader-id=
rqlite-node2-1 | 2024-03-04T02:08:52.892Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:08:52.892Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:08:52.909Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:08:53.517Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:08:53.517Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:08:53.529Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | 2024-03-04T02:08:53.716Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node3-1 | 2024-03-04T02:08:53.716Z [INFO] raft: entering candidate state: node="Node at 172.16.237.102:2380 [Candidate]" term=0
rqlite-node3-1 | 2024-03-04T02:08:53.728Z [INFO] raft: entering follower state: follower="Node at 172.16.237.102:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:54 failed to join via node at 172.16.237.100:2380: no leader
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:54 failed to join cluster at [172.16.237.100:2380], sleeping 3s before retry
rqlite-node2-1 | 2024-03-04T02:08:54.605Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:08:54.605Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:08:54.618Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:08:54.811Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:08:54.811Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:08:54.824Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | 2024-03-04T02:08:55.428Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node3-1 | 2024-03-04T02:08:55.428Z [INFO] raft: entering candidate state: node="Node at 172.16.237.102:2380 [Candidate]" term=0
rqlite-node3-1 | 2024-03-04T02:08:55.440Z [INFO] raft: entering follower state: follower="Node at 172.16.237.102:2380 [Follower]" leader-address= leader-id=
rqlite-node2-1 | 2024-03-04T02:08:55.659Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:08:55.659Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:08:55.671Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:08:56.369Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:08:56.370Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:08:56.382Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | 2024-03-04T02:08:56.846Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node3-1 | 2024-03-04T02:08:56.847Z [INFO] raft: entering candidate state: node="Node at 172.16.237.102:2380 [Candidate]" term=0
rqlite-node3-1 | 2024-03-04T02:08:56.859Z [INFO] raft: entering follower state: follower="Node at 172.16.237.102:2380 [Follower]" leader-address= leader-id=
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:57 failed to join via node at 172.16.237.100:2380: no leader
rqlite-node3-1 | [cluster-join] 2024/03/04 02:08:57 failed to join cluster at [172.16.237.100:2380], sleeping 3s before retry
rqlite-node3-1 | [cluster-join] 2024/03/04 02:09:00 failed to join via node at 172.16.237.100:2380: no leader
rqlite-node3-1 | [cluster-join] 2024/03/04 02:09:00 failed to join cluster at [172.16.237.100:2380], after 5 attempt(s)
rqlite-node3-1 | [rqlited] 2024/03/04 02:09:00 clustering failure: failed to join cluster: failed to join cluster
rqlite-node3-1 exited with code 1

node1 and node2 keep changing between candidate and follower states for a long time until the following logs
rqlite-node1-1 | 2024-03-04T02:12:13.701Z [INFO] raft: duplicate requestVote for same term: term=0
rqlite-node2-1 | 2024-03-04T02:12:13.703Z [INFO] raft: duplicate requestVote for same term: term=0

rqlite-node2-1 | 2024-03-04T02:12:11.707Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:12:11.707Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:12:11.720Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node2-1 | 2024-03-04T02:12:12.167Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 3 172.16.237.102:2380}" error="dial tcp 172.16.237.102:2380: connect: no route to host" term=0
rqlite-node2-1 | 2024-03-04T02:12:12.167Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 3 172.16.237.102:2380}" error="dial tcp 172.16.237.102:2380: connect: no route to host" term=0
rqlite-node1-1 | 2024-03-04T02:12:12.572Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:12:12.572Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:12:12.584Z [INFO] raft: entering follower state: follower="Node at 172.16.237.100:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:12:13.691Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node1-1 | 2024-03-04T02:12:13.691Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=0
rqlite-node2-1 | 2024-03-04T02:12:13.693Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
rqlite-node2-1 | 2024-03-04T02:12:13.693Z [INFO] raft: entering candidate state: node="Node at 172.16.237.101:2380 [Candidate]" term=0
rqlite-node1-1 | 2024-03-04T02:12:13.701Z [INFO] raft: duplicate requestVote for same term: term=0
rqlite-node2-1 | 2024-03-04T02:12:13.703Z [INFO] raft: duplicate requestVote for same term: term=0
rqlite-node1-1 | 2024-03-04T02:12:13.863Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 3 172.16.237.102:2380}" error="dial tcp 172.16.237.102:2380: connect: no route to host" term=0
rqlite-node1-1 | 2024-03-04T02:12:13.863Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 3 172.16.237.102:2380}" error="dial tcp 172.16.237.102:2380: connect: no route to host" term=0
rqlite-node1-1 | 2024-03-04T02:12:13.863Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 3 172.16.237.102:2380}" error="dial tcp 172.16.237.102:2380: connect: no route to host" term=0
rqlite-node1-1 | 2024-03-04T02:12:14.781Z [WARN] raft: Election timeout reached, restarting election
rqlite-node1-1 | 2024-03-04T02:12:14.781Z [INFO] raft: entering candidate state: node="Node at 172.16.237.100:2380 [Candidate]" term=1
rqlite-node2-1 | 2024-03-04T02:12:14.800Z [INFO] raft: entering follower state: follower="Node at 172.16.237.101:2380 [Follower]" leader-address= leader-id=
rqlite-node1-1 | 2024-03-04T02:12:14.800Z [INFO] raft: election won: term=1 tally=2
rqlite-node1-1 | 2024-03-04T02:12:14.800Z [INFO] raft: entering leader state: leader="Node at 172.16.237.100:2380 [Leader]"

from rqlite.

otoolep avatar otoolep commented on May 24, 2024

I use a fault injection tool to modify the term field of AppendEntriesRequest to max uint64.

You should have said so earlier. :-)

It is indeed very likely an issue with Raft.

In that case you could file an issue in the Raft repo: https://github.com/hashicorp/raft -- it might be an edge case they wish to handle better. These systems should be robust in the face of faults -- but they might believe it is, since leader election does succeed -- but perhaps they would like it to elect faster. I think closing the issue now is OK, but you can tell me if you disagree.

from rqlite.

otoolep avatar otoolep commented on May 24, 2024

If you're interested in digging into this issue further you may wish to see if it happens with https://github.com/otoolep/hraftd

hraftd is a much simpler system I built, which also uses the Hashicorp Raft system. If you can bring out the same behaviour with hraftd it's going to be much easier for the Hashicorp team to investigate (if they choose do so).

from rqlite.

huanghj78 avatar huanghj78 commented on May 24, 2024

Thank you! I will try it.

from rqlite.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.