Comments (11)
Hi Tobias,
There are three consistency levels that a server can operate in: low, medium, or high.
- In low, all reads are served regardless of leadership status.
- In medium, only the leader serves reads, but there is a soft leader check. This can cause a read to be stale in small cases, like during a leadership change.
- In high, the leader will apply the read command through the raft log just like a write, except the command will contain only enough data to increment the log index. This ensures that the read happens in sync with any write command and that there's a consensus prior to responding to the client.
Writes are always consistent, but reads are optional based on the --consistency
flag. By default SummitDB sets this flag to high
.
SummitDB uses Finn to handle all raft logic, Finn in turn relies on Hashicorp Raft.
The primary code that handles applying commands is done by nodeApplier.Apply. This function handles read or write commands. Write commands are always sent through the raft log. For read commands the mutate
param is nil
and the logic is then handed by Node. raftLevelGuard. This function determines how to handle the read based on the required consistency level. When the level is High
the read is sent through the log. not the entire read command is sent through the log though, only enough data to increment the log index
The leadership changes section made me nervous because it looks like there is some reliance on the nodes perceived Raft status (which may not be the Raft status).
When a server receives a command that it cannot apply because it is not the leader, that server will suggest that the client try it on the leader node:
> SET x y
-TRY 127.0.0.1:7481
This is only a suggestion based on what is known about the cluster at the time the command was attempted to be applied.
When the client retries the command on the suggested server, and for whatever reason that that server is no longer the leader, then another TRY
response will occur. This could go on until the correct leader handles the command. In most cases a TRY response would happen infrequently and typically following leadership changes.
I feel the solution I have in place is working well, but if you see a fault in my logic please let me know right away. I spent quite a bit of time around this specific problem.
Thanks for your feedback.
from summitdb.
I'm closing this issue for now. Please feel free to reopen this issue if you run into any problems regarding this topic.
from summitdb.
Thanks @tidwall and apologies for not getting back to you earlier. I didn't realize you were sending reads through the Raft log in high
, but that makes sense.
When a server receives a command that it cannot apply because it is not the leader, that server will suggest that the client try it on the leader node:
The interesting bit here is that a node may think it's the leader but isn't in fact the leader. I assume that this is the case with medium
for which you may serve a stale read. For high
this shouldn't matter, correct? Even if you propose to Raft from the non-leader (who thinks it's the leader) the command will be assigned a log index. I'm asking because that case is sometimes subtle depending on how the implementation works. Nodes need to hold on to pending proposals and periodically resubmit them until they show up in a log entry, which then necessitates handling of entry duplication (only relevant for non-idempotent writes). If that doesn't happen, proposals could be dropped (i.e. they never show up in the Raft log) or apply multiple times (giving unexpected results that the client didn't ask for). Anyway, you've addressed my fundamental concerns, thanks for that.
from summitdb.
I assume that this is the case with medium for which you may serve a stale read. For high this shouldn't matter, correct?
That's right. When set to high
the leader increments the log index for all read commands ensuring no stale reads.
Even in the case when a client sends a read to a node that is perceived the leader by both the node and the client (but in fact is in the middle of a leadership change), the node will attempt to apply the command to Raft, which in turn will fail with "not the leader" error. The node will then discover the new leader and notify the client with a -TRY
response.
In the case with medium
, being perceived as the leader by both the node and client is enough to process the read.
You ask very good questions. I should probably create a wiki page describing the process in more detail. Until then I think I'll keep this issue open in case others might find it useful.
from summitdb.
I forgot to respond to this:
Nodes need to hold on to pending proposals and periodically resubmit them until they show up in a log entry, which then necessitates handling of entry duplication (only relevant for non-idempotent writes). If that doesn't happen, proposals could be dropped (i.e. they never show up in the Raft log) or apply multiple times (giving unexpected results that the client didn't ask for).
In both cases "dropped" and "apply multiple times" should be covered because high
requires a new log index for all read/write commands prior to responding to the client.
from summitdb.
In both cases "dropped" and "apply multiple times" should be covered because high requires a new log index for all read/write commands prior to responding to the client.
I'm not sure I follow. More concretely, assume the following:
- node 1 thinks it's the leader, but it isn't (i.e. a majority of the cluster is at a higher term and has elected a different leader)
- a client wants to execute
Increment(key1)
on node 1 at consistency levelhigh
. - node 1 proposes the command. What that means exactly depends on the implementation. I would assume it adds it to its local log and tries to send appends to what it still thinks are its followers.
- its appends are denied, node 1 learns of the real leader, and it then discards its uncommitted log (including the proposal our client is waiting for).
- client never hears back about its proposal, unless there's some clever mechanism that notifies the client when the proposal is dropped (i.e. it would receive a
TRY
message).
If one instead allows all nodes to propose commands (by relaying to the leader) there are various situations in which the Increment
could end up in the logs multiple times.
These are hopefully concerns which are handled by the underlying Raft implementation (i.e. not in your code), which I haven't looked at.
from summitdb.
node 1 proposes the command. What that means exactly depends on the implementation. I would assume it adds it to its local log and tries to send appends to what it still thinks are its followers.
The underlying implementation is Hashicorp Raft. SummitDB sends commands using Raft.Apply, which the Hashicorp doc state:
"Apply is used to apply a command to the FSM in a highly consistent manner. This returns a future that can be used to wait on the application. An optional timeout can be provided to limit the amount of time we wait for the command to be started. This must be run on the leader or it will fail."
This is the basic flow that summitdb uses for handling a client command:
future := raft.Apply(cmd, timeout)
if err := future.Error(); err != nil{
if isNotLeaderError(err){
// figure out who the leader is.
respond("-TRY leader_addr")
} else{
respond("-ERR "+err.Error())
}
} else{
respond("+OK")
}
client never hears back about its proposal, unless there's some clever mechanism that notifies the client when the proposal is dropped (i.e. it would receive a TRY message).
As I understand it Raft.Apply
should gracefully handle the dropped command scenario that you describe. ApplyFuture.Error()
is then immediately called and it waits for Raft to fully process the command. The godoc states:
"ApplyFuture is used for Apply() and can returns the FSM response."
I'm placing a fair amount of trust on the Apply
function as it specifically states: "Apply is used to apply a command to the FSM in a highly consistent manner" and "This must be run on the leader or it will fail".
Hopefully I'm not misinterpreting their documentation. I haven't run into any consistency issues... yet
from summitdb.
Ah, I see. One caveat here is that when you receive an error, the proposal could still have applied. That is, the burden of figuring out what happened is shifted to the client. That's reasonable if that situation is rare enough. However, I think even in the case in which -TRY
is returned the command could have applied (and the client would follow the advice to retry and apply it again). This (or something like this) would happen if the leader manages to append the command on some followers, but then steps down and can't complete the proposal, unless the Raft implementation gets very fancy (I don't think so, but haven't checked).
from summitdb.
This (or something like this) would happen if the leader manages to append the command on some followers, but then steps down and can't complete the proposal
Correct me if I'm wrong, but wouldn't the log entry for that command be uncommitted at the point that the leader steps down? Then when a new leader goes online, the term is incremented. The followers join the new leader and any uncommitted log entries are rolled backed to match the new leaders committed log. In the meantime the client would have received an error?
from summitdb.
Consider the following:
- Node 1 is leader in Term 11 and successfully sends appends to the other two nodes (but does not hear back)
- Node 2 campaigns and wins the election for Term 2 (all nodes have the same log, in particular containing the just-received entry). In particular, Node 1 votes for Node 2 (and will presumably tell the client "I'm not the leader")
- Node 3 commits a new command (which in particular also commits the first entry by leader completeness property)
The key here is that the node which is elected leader doesn't purge its log (it can only commit previous entries when it manages to commit an entry in its own term, but that's a technicality that doesn't touch this argument at all).
from summitdb.
Got it and I'm in full agreement. After perusing the Raft paper and the Hashicorp implementation it's clear that there can be fringe cases where a client sends a command to a server, receive an error, and yet the command is fully applied to the logs. While rare it's possible that a log entry may be duplicated if the client retries the command.
A retry is quite often not a problem for most summitdb commands such as SET, DEL, GET. But there are commands such as APPEND and BITOP that it would be very bad if it were to be applied more than once.
the only outcomes that clients see are "don't know" and "completed"
What is certain is that when a client sends a command to a leader and the leader provides a successful response, that command is fully replicated to the cluster. So false-positives don't seem to be possible.
False-negatives on the other hand... The raft paper (pg. 13, sec. 8) suggests serializing every command. Which may be the way to go. I'll need to investigate the various options.
SummitDB has the FENCE command which generates a unique token for distributed tasks. It's sorta like a log index, but for the client application. It's intended to help solve the Redis distributed locking problem. Yet I can see it as being helpful for generating a unique serial number for commands too.
Anyhow... there's lots to think about here.
Thanks a ton for your insights and time on the matter.
from summitdb.
Related Issues (20)
- Q: read the last FENCE token without incrementing? HOT 1
- missing server means new leader complains forever; needs to avoid spamming its logs HOT 2
- Q: transactions or pipelined commands? HOT 3
- can't create cluster over localhost:7777 tunneled connection HOT 5
- Unable to join cluster HOT 9
- Is it (already) possible to retrieve a list of all the peers for a cluster? HOT 5
- Getting Started - FreeBSD Incorrect Download Path HOT 6
- support for list data structure HOT 3
- Can't join cluster: "peer already known" HOT 2
- how to implement distributed lock with fencing tokens? HOT 1
- Kubernetes Statefulset
- Will you support authentication and TLS? HOT 1
- active project? HOT 1
- FYI: works fine on AArch64... HOT 1
- Q: SummitDB as embedded DB? HOT 1
- Question: what happened when out of memory? HOT 2
- Is it neccessary to open buntdb in file mode? HOT 3
- What happend when a command is committed but has errors when apply it? HOT 3
- The raft.db will grow unlimitedly? HOT 1
- Docs mention Lua eventhough Javascript is used
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from summitdb.