I implemented a partially-async replication patch for the original Disque project and

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Your ideas in <a class="issue-link js-issue-link" data-error-text="Failed to load titl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Multi-DC / eventual / partially-async replication about disque-module HOT 6 OPEN

vaizki commented on June 16, 2024

Multi-DC / eventual / partially-async replication

from disque-module.

Comments (6)

antirez commented on June 16, 2024

Hello @vaizki, I find this feature very interesting, it could be one of the first additions I do once the original tests will pass all and the persistence gets implemented. Thanks.

from disque-module.

antirez commented on June 16, 2024

@vaizki please also check #5

from disque-module.

vaizki commented on June 16, 2024

Your ideas in #5 look good for synchronous multi-AZ replication but my patch was trying to resolve a different set of problems:

When doing a synchronous REPLICATE across AZs, the RTT between them (over the Atlantic for example) was too much for the original project's replication timers. I explained some of it in antirez/disque#189 (comment)
Even with the timers adjusted to prevent retries etc, a 100ms RTT will result in a maximum client command rate of 10 ADDJOBs per second if confirmation is needed from both AZ/DCs.
Sometimes you only have 2x AZ (my usual use case) and one of them is not available. In this case you cannot use sync replication that needs confirmation from both AZs.

So in essence, I am proposing separating desired replication and write concern in ADDJOB.

Desired replication defines the number of replicas / AZs that the cluster will attempt to achieve as quickly as possible
Write concern defines the replication that needs to be confirmed before the command returns success to the client

I did also consider a model where the ADDJOB timeout would tie into this, allowing for a client to wait for full replication to happen or if the timeout is reached and partial replication has been achieved then a "semi-success" is returned and the replication completed asynchronously asap. But these ideas did not really work with the original project and I needed to move on.

from disque-module.

antirez commented on June 16, 2024

@vaizki thanks, I understood the problem, but what is not completely clear to me is how useful it is a desired replication level that sometimes is not reached. Does not offer any true guarantee, it can just improve the jobs rate survival in case of failures. May still be interesting, and I didn't mean for my issue to replace the functionality of the one that you proposed. #5 is more like a step towards providing more guarantees. #4 instead attempts to improve the percentage of jobs lost that are replicated with weaker guarantees. Both have their place I guess, but #4 is more like a "best effort safety" feature.

from disque-module.

vaizki commented on June 16, 2024

At least for me, doing synchronous writes to two AZ with 100ms+ RTT between them is just not an option. It will kill the performance and occasionally kill the whole application if the other AZ is not responding. It would be like writing to REDIS key and waiting for a slave (maybe running or not somewhere far away) to ack the replication. Of course I would prefer 3x AZ with minimal latency between them but this is not an available option.

An availability + performance comparison can also be drawn with RAID1 disk mirroring; if one side of the mirror is down the remaining side will still accept writes just as quickly as before but as soon as the faulty side recovers it is synced and N=2 redundancy achieved again. So for replication, it is about being able to operate in a degraded state while requesting that eventually the desired replication will be achieved.

Because Disque is a message passing system which by nature decouples producers and consumers, there is not as much worry about split-brain or diverging data in cluster members as there is with RAID1, REDIS master-slave or an ACID RDBMS cluster. Working on the message level with the immutable nature of messages guarantees that IF there is a message in this cluster member, then it has the correct content so I can process it.

And that is exactly why I want to use Disque and I need this tunable "write concern", as I need availability and partition tolerance above all else and to get that I can tolerate for example the same job being processed more than once.

from disque-module.

antirez commented on June 16, 2024

It's not a matter of consistency indeed, the replication factor in Disque tunes the ability to survive after failures. In that regard, to tell: give me the green light when the message was replicated 2 times but actually try to replicate it 5 times, can be seen as a minimum guarantee of durability versus optimal durability versus latency concerns. So indeed it makes sense but still the messages will be guarantee you survive a single node failure. However with this option, if there is a two nodes failure the percentage of messages that will get lost will be minor.

from disque-module.

Multi-DC / eventual / partially-async replication about disque-module HOT 6 OPEN

Comments (6)

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent