I have been studying the recluster source code recently.It's a great module to control

Backoff configuration may not guarantee the max time between respawns when workers die? about recluster HOT 4 CLOSED

frankLife commented on June 28, 2024

Backoff configuration may not guarantee the max time between respawns when workers die?

from recluster.

Comments (4)

spion commented on June 28, 2024

recluster maintains optrespawn, a badly named internal variable that keeps the current respawn delay

When a worker dies, we runs workerReplace to replace the worker. At the beginning, in workerReplace we fist cap the respawn delay to not be greater than the backoff option

optrespawn = Math.min(optrespawn, opt.backoff);

then we calculate the next moment that a worker is allowed to respawn, based on the time of the previous respawn plus the respawn delay.

Of course, that moment cannot be in the past. If the last respawn was hours ago, we need to make sure that the next one will be at least now via Math.max. Combining the two, we get:

nextSpawn = Math.max(now, lastSpawn + optrespawn * 1000)

Then we can calculate the delay of the timer that will respawn the process. If the last respawn was indeed ages ago, the next one will happen immediately (now - now)

time = nextSpawn - now;

Then we can update the moment of the last spawn for next time

lastSpawn = nextSpawn;

Once everything is calculated, we multiply the current delay by two (it will be capped the next time a respawn happens, so we need not worry about it overshooting) and run a timer to reset it back via delayedDecreaseBackoff.

Finally, we proceed with logging the delay and running the respawn timer

delayedDecreaseBackoff (better name: debouncedDecreaseBackoff) is a debounced function in the sense of lodash's _.debounce. Its effect is applied opt.backoff seconds after its last called, but if called again sooner then that, it will reset the timer.

The timer is set to the maximum possible delay (opt.backoff) to ensure that the backoff is not decreased if respawns keep happening at intervals shorter than opt.backoff. Once the delay hits maximum, it will be kept at that maximum (1) as long as worker respawns are requested, within opt.backoff seconds or less.

If respawn requests stop happening, delayedDecreaseBackoff will have a chance to execute and decrease the current delay by half. It ensures that the delay is not smaller than the minimum. Otherwise, if its larger,it schedules itself again (it needs to be decreased further)

Example: Given opt.respawn = 1s, opt.backoff = 10s, we should get something like this if workers keep dying:

1 -> 2 -> 4 -> 8 -> 10 -> 10 -> 10 -> ...

Once workers stop dying, the delay will start decreasing gradually by 1/2 every opt.backoff = 10 seconds:

10 -> 5 -> 2.5 -> 1.25 -> 1 (timer stops)

I'm not sure whats going on in your case. Since the current delay is always capped before being used and displayed, I don't have an idea where the bug might be

(1): well, it will oscillate between max and 1/2 max, given that workers don't die immediately but only after running for at least a while. So when the delay is max, the next worker death happens at max+someDelta and the timer gets a chance to run and decrease the backoff by 1/2. If we added a couple of seconds to opt.backoff for this case, then we can ensure that the delay stays at max

from recluster.

frankLife commented on June 28, 2024

It's a great detailed explanation. I totally understand the debouncedDecreaseBackoff (delayedDecreaseBackoff) function.

In this case, I find out the reason why the respawn time exceed backoff is that when the workers dies immediatly, the lastSpawn will be assign the calculated value and the optrepsawn will be multiplied by 2. the same process will run many times in few time.Finally ,It will resuilt in:

optrespawn time will be set max value(backoff)
the lastSpawn is based on the last lastSpawn but last lastSpawn process may not finish,So the interval time will accumulate more and more.

When I set the time of the server error timeout is 5000 instead of 500 and the number of workers is 1 instead of 4 or error time of 10000、 workers of 2 , the result will be right.

Maybe this particular condition(server dies too fast) will trigger this result.

Can we add

    if(opt.backoff) {
        nextSpawn = Math.min(opt.backoff * 1000 + now,nextSpawn);
    }

after

    var nextSpawn = Math.max(now, lastSpawn + optrespawn * 1000);

to ensure the time between respawns not to exceed backoff?

from recluster.

spion commented on June 28, 2024

I think I see what you mean. Your issue happens with more workers, after a few of them die. Since the death occurred between now and lastSpawn, the next spawn is scheduled even later in the future.

But thats the correct behavior, isn't it?

Lets say recluster is already at maximum delay and two workers die at almost the same time. Since a respawn is already scheduled in, (e.g.) 10 seconds, we must schedule the next one in 20 seconds to ensure 10s breathing space inbetween. Otherwise, if 16 workers die in a row, there will be a flood of 16 respawns in a row (all happening at once after 10 seconds), which is precisely what we want to avoid... This way (at maximum delay), they will be evenly spaced opt.backoff seconds apart.

The delay shown by the logger isn't between the two respawns - its between now and the currently scheduled respawn. That value may get bigger than opt.respawn, but the time between respawns doesn't.

from recluster.

frankLife commented on June 28, 2024

Yes,I think the I make sence of what happen. I mistaked the time meaning in log.You really help me out of this understanding of problem.

Thank you ;)

from recluster.

Backoff configuration may not guarantee the max time between respawns when workers die? about recluster HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent