doxout / recluster Goto Github PK
View Code? Open in Web Editor NEWNode clustering library with support for zero downtime reloading
Node clustering library with support for zero downtime reloading
If I add two instances of recluster and run both of them, respawning will work incorrectly.
$ git clone https://github.com/tarmolov/recluster-test-app.git
$ cd recluster-test-app
$ npm install
$ npm start
> [email protected] start /Users/hevil/Yandex/recluster-test-app
> node server.js
app1
app2
/Users/tarmolov/recluster-test-app/app1.js:4
throw new Error('Error in app1');
^
Error: Error in app1
at Timeout.setTimeout (/Users/tarmolov/recluster-test-app/app1.js:4:11)
at tryOnTimeout (timers.js:232:11)
at Timer.listOnTimeout (timers.js:202:5)
[71627] worker (0:1) must be replaced, respawning in 0
app2
The synthetic error is thrown in app1
and I expect that the app1 will be respawned. However, It respawns app2
instead.
Is there a working example of using domains with recluster to gracefully restart on unhanded exceptions?
Hi,
It's not possible to do logging correctly without.
The problem is, each worker trying to log to the same file will fail because the file is locked.
The solution is to allow the cluster file do all the logging, but for that, he need to listen on the workers stdout and stderr.
hi,
Throws the following since Node v0.12.0
assert.js:86
throw new assert.AssertionError({
^
AssertionError: false == true
at SharedHandle.add (cluster.js:97:3)
at queryServer (cluster.js:480:12)
at Worker.onmessage (cluster.js:438:7)
at ChildProcess. (cluster.js:692:8)
at ChildProcess.emit (events.js:129:20)
at handleMessage (child_process.js:324:10)
at Pipe.channel.onread (child_process.js:352:11)
Now README.md
says that cluster.activeWorkers()
function
Returns a hash of all worker slots (0 <= WORKER_ID < N).
But actually it returns ArrayLike
object, which have length
property and index singnature:
interface ActiveWorkers {
length: number;
[index: number]: Worker | null | undefined;
}
So, if use Object.entries(activeWokers)
(or similar way for iterate keys) we can get length
as key and this is unexpected behavior, because length
is not a worker id.
I think documentation should be more clear in this place.
We can listen for ready
events on the cluster
object, but these fire once for each worker.
How can we trigger some handler once all the workers are ready? Is there any such event or hook, public or private?
My use case is notifying PM2 that the cluster is ready, for graceful reloading. So this is somewhat related to #35 which is concerned about how to gracefully shut down a cluster.
I have this naive custom function to return a promise that resolves when all workers are ready, but I think with this there is the possibility that, before all workers are ready, a worker dies and gets replaced, and the promise will resolve too early:
function allWorkersReady (cluster) {
return new Promise((resolve, reject) => {
let pendingWorkersCount = cluster.workers().length
cluster.on('ready', onReady)
function onReady (worker) {
if (--pendingWorkersCount === 0) {
cluster.removeListener('ready', onReady)
resolve()
}
}
})
}
https://github.com/superjoe30/naught
looks like we should merge projects, yes?
I have a lot of such messages in stderr.log:
(node) warning: possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit.
Trace
at EventEmitter.addListener (events.js:175:15)
at module.exports.self.reload.i (/usr/lib/frontend/server/node_modules/recluster/index.js:230:21)
at Array.forEach (native)
at EventEmitter.module.exports.self.reload (/usr/lib/frontend/server/node_modules/recluster/index.js:211:22)
at process.<anonymous> (/usr/lib/frontend/server/index.js:53:13)
at process.EventEmitter.emit (events.js:93:17)
at SignalWatcher.startup.processSignalHandlers.process.on.process.addListener.w.callback (node.js:487:45)
/usr/lib/frontend/server/index.js:53:13
51 process.on('SIGUSR2', function() {
52 console.log('%s [master] Got SIGUSR2, reloading cluster...', prefix);
53 cluster.reload();
54 });
Possible It will be enough to set emitter.setMaxListeners(options.workers)
This is more of a conceptual question than an issue with recluster per se, so sorry about that, but any guidance would be appreciated. If I am managing deployment by directories that contain releases, then when I do a new release, how do I tell cluster to fork the new workers from the path to the new release code?
In other words:
Say release A is in /var/deploy/app/A
, and symlink /var/deploy/app/current/
points to it. My app lives in app.js so my file
argument to recluster is a relative path:
var cluster = recluster("app.js");
.
My simple cluster master script lives in cluster.js. I start up a master process via cd /var/deploy/app/A ; node cluster.js
which forks a few app.js workers using the code in /var/deploy/app/A. The cwd of the master and the workers is /var/deploy/app/A.
Now I deploy release B to /var/deploy/app/B
and update /var/deploy/app/current
to point there. I send SIGUSR2 to master. master's cwd is still /var/deploy/app/A, so it is going to fork new workers from there rather than from /var/deploy/app/B.
One solution to this is:
cd /var/deploy/app/current ; node /var/deploy/app/current/cluster.js
.file
argument to cluster using the symlink from the master process:var workerPath = process.argv[1].replace("cluster", "app");
var cluster = recluster(workerPath);
Because the symlink is updated to the new release before SIGUSR2 is sent, the new workers will be forked from the new release.
In practice, I am using Chef and Upstart which complicates things somewhat. But ultimately I want to send a signal to the master process (via service <app name> reload
) that will cause it to fork new workers from the new code release that Chef just deposited onto the server. Without the symlink, I haven't thought of a way to keep the same master process around but get it to fork workers from the new code. Do you have any thoughts on this?
Is there a way on cluster.reload() to check if there is an error and then revert to the original code?
Document and expose eventemitter API. To do this, all usages of self.on
will need to be replaced with something else.
Had migrated to new version of this library 1.0.0
and I have noticed that by default execArgv is empty array that causes errors with "simple" replacement and nonintuitive.
The current tests are flaky; they rely on timings and sometimes randomly pass or fail. Need to identify which timeouts can be replaced with events on the eventemitter; implement those events and update the tests
In a project I need to call recluster 2 times because there are 2 separate servers. After calling recluster 2 times I found the first call's workers can't respawn correctly because the exec file was changed by the second call. I find it's easy to allow this to be happen( see 7nights@909f71d ). Can this feature be supported?
Does it support windows ? Or only POSIX Operating Systems. I would like to use this for a product which should run cross platform. I used naught and it does not seem to work (As clearly mentioned in their documentation).
Thanks !
During testing, backoff seems to work properly with 1 worker, but with 4 workers, the backoff (max respawn time) is exceeding the set time.
For example:
var opt = {
workers: 4,
/seconds/
timeout: 300,
respwan: 2,
backoff: 10
};
If the app is crashed repeatedly (for testing), the respawn time ends up being over 30 seconds and more, exceeding the 10 second backoff setting. I'll be happy to review the code to track down the problem, but I wanted to first make sure this was in fact a bug and not a misunderstanding on my part.
Thank you for a great module.
Hi,
When I signal the master using SIGUSR2, new workers are started but the old ones never die.
var timeout = setTimeout(worker.kill.bind(worker), opt.timeout * 1000);
worker.on('disconnect', clearTimeout.bind(this, timeout));
// possible leftover worker that has no channel estabilished will throw
try { worker.disconnect(); } catch (e) { }
cluster.removeListener('listening', stopOld);
What I think is happening is that the worker.disconnect() call is generating a disconnect event that clears the timeout before it has a chance to execute. Am I reading this correct?
I am running node 0.10.0, so it could be that this event wasn't generated in your version of node?
Hi,
I am using recluster and nodemon on Mac OS X High Sierra.
Recluster seems to work fine, when sending kill signal CLI.
Now, I am combining recluster with nodemon, so nodeon is responsible to restart as soon as a file changes. Worked fine on my old Windows machine, but on Mac it seems to have difficulties with this setup.
I run the following command: nodemon -w server --exec node server/main.js
nodemon -w server --exec node server/main.js
[nodemon] 1.14.12
[nodemon] to restart at any time, enter rs
[nodemon] watching: /.../server/**/*
[nodemon] starting node server/main.js
spawned cluster, kill -s SIGUSR2 10479 to reload
CONSUMER 10482
server listening on 3000 (NODE_ENV=development)
[nodemon] restarting due to changes...
Got SIGUSR2, reloading cluster...
[10480] worker (0:1) must be replaced, respawning in 0
[10481] worker (0:2) must be replaced, respawning in 1996
[10482] worker (0:3) must be replaced, respawning in 5994
CONSUMER 10490
CONSUMER 10491
CONSUMER 10488
CONSUMER 10489
CONSUMER 10493
CONSUMER 10494
Any ideas on this?
send SIGUSR2 & wait, my 2 processes come back up as 2 process every time. Using siege, availability stays at 100%
Send SIGUSR2 twice rapidly, my 2 processes turn to 5 processes and I get error in console Error [ERR_IPC_DISCONNECTED]: IPC channel is already disconnected
Using siege, it shows availability drops from 100% to around 90%. Eventually number of running processes decrease back down to 2.
I would have expected it to kill a worker before spawning a replacement, instead it appears it spawns replacements before killing the worker being replaced, this means total processes sometimes exceed number of CPUs (undesired). Also 90% availability instead of 100% when issuing restarts in quick succession is an issue. A quick fix could be to toggle a flag & block restarts if theres already one pending.
Hi, I made a comparison of several similar modules such as [1] [2] [3] [4] and I liked yours the most because it comes without bloat!
I think there are two more important use-case that should be described in README.md:
"To gracefully shutdown the whole cluster (workers and master)"
kill -s TERM <cluster_pid>
"To forcefully kill the whole cluster (workers and master)"
kill -s KILL <cluster_pid>
What do you think about adding it?
[1] https://github.com/strongloop/strong-cluster-control
[2] https://github.com/andrewrk/naught
[3] https://github.com/ql-io/cluster2
[4] https://github.com/Unitech/pm2
Hi, I see there is a problem on node 9.2.1 while sending kill -s SIGUSR2 51865
my server.js file is
const express = require('express')
const app = express()
app.get('/', (req, res) => res.send('Hello World!'))
app.listen(3000, () => console.log('Example app listening on port 3000!'))
and cluster.js taken from your example
$ node cluster.js
spawned cluster, kill -s SIGUSR2 51865 to reload
Example app listening on port 3000!
Example app listening on port 3000!
Example app listening on port 3000!
Example app listening on port 3000!
Example app listening on port 3000!
Example app listening on port 3000!
Example app listening on port 3000!
Example app listening on port 3000!
Got SIGUSR2, reloading cluster...
Example app listening on port 3000!
Example app listening on port 3000!
Example app listening on port 3000!
Example app listening on port 3000!
Example app listening on port 3000!
Example app listening on port 3000!
Example app listening on port 3000!
Example app listening on port 3000!
events.js:136
throw er; // Unhandled 'error' event
^
Error [ERR_IPC_CHANNEL_CLOSED]: Channel closed
at ChildProcess.target.send (internal/child_process.js:606:16)
at Worker.send (internal/cluster/worker.js:40:28)
at killTimeout (/Users/jankowalski/3prty/JS/my-recluster/node_modules/recluster/index.js:134:24)
at EventEmitter.workerReplaceTimeoutTerminate (/Users/jankowalski/3prty/JS/my-recluster/node_modules/recluster/index.js:111:9)
at EventEmitter.emit (events.js:159:13)
at emit (/Users/jankowalski/3prty/JS/my-recluster/node_modules/recluster/index.js:53:19)
at EventEmitter.workerDisconnect (/Users/jankowalski/3prty/JS/my-recluster/node_modules/recluster/index.js:148:36)
at EventEmitter.emit (events.js:159:13)
at ChildProcess.worker.process.once (internal/cluster/master.js:207:13)
at Object.onceWrapper (events.js:254:19)
works fine on node 8.2.1
I just want to know that does recluster maintains sticky session or not i.e for example if user 1's first request went to the first worker then whether recluster ensures that the next subsequent requests will also go to the same first worker?
The question is about emit()
function.
var self = new EE();
var channel = new EE();
function emit() {
channel.emit.apply(self, arguments);
self.emit.apply(channel, arguments);
}
According to the source code,We get two instances of EventEmitter
class and use emit function to redirect messages.
But why we use channel
object to emit events of self
object whereas use self
object to emit events of channel
object?
It seems like that we don't bind any event on self object.
I change the function as below
function emit() {
channel.emit.apply(channel, arguments);
}
the module also can invoke run()
and reload()
function.
If there is no other design about it,maybe this is a better clear way.
A nice feature would be to allow using this together w/ other popular deployment tools such as capistrano. There is a problem with node-cluster itself with cwd, documented here http://clarkdave.net/2013/02/node-js-and-cluster-cwd-dirname-shenanigans/
A nice feature would be to solve for that here in recluster. Apparently doing this when new workers are spawned would solve the issue.
// http://clarkdave.net/2013/02/node-js-and-cluster-cwd-dirname-shenanigans/
if (__dirname !== process.cwd()) {
process.chdir(__dirname);
}
Also see #8
I am running NODE_ENV=development node backend-cluster.js
backend-cluster.js
var recluster = require('recluster'),
path = require('path');
process.title = 'backend'
// With relative path, it will only follow the symlink 1x.
// use absolute path to symlink. cuz we want it to follow the symlink every restart
var cluster = recluster(path.join('/var/www/html/app/current/repos/backend.js'), {
timeout: 30
});
cluster.run();
backend.js
setInterval(() => console.log('foo'),1000)
After editing backend.js and removing the code / changing the log message, I send the kill signal, recluster spawns replacement workers but the old workers do not die & "foo" keeps getting sent to console. I expect the old workers to be killed after 1m since NODE_ENV is dev. I also added the explicit timeout opt for 30 seconds, but "foo" keeps getting sent to console for a long time longer (maybe 1hour?)
If I'm reading the code correctly it is possible that the some requests could fail if they occur between shutdown and the fork loop at the end.
Douglas
As a user of this module, I'm asking: Is this module still maintained?
thx
Hi!
I'm trying to use recluster
to get zero-downtown deploys for my hapi
web server. It works most of the time, but sometimes I'm seeing a worker process that continues receiving and responding to requests even after it has been given a disconnect
message. As far as I understand, recluster
should not be routing requests to the old workers anymore?
In particular, I'm using hapi
to proxy a CouchDB _changes
feed, which is a longpoll feed that should last for 25 seconds at a shot. It is these longpolls that I'm primarily having an issue with.
I can see that the workers are still getting requests after the disconnect
message, as I'm logging the pid
when the hapi
server receives requests and when it responds.
I'm directly running the following cluster.js
:
var recluster = require('recluster'),
path = require('path');
var cluster = recluster(path.join(__dirname, 'index.js'), {
timeout: 120, // seconds
workers: 2,
readyWhen: 'ready',
log: {
respawns: true
}
});
cluster.run();
process.on('SIGUSR2', function() {
console.log('Got SIGUSR2, reloading cluster...' + new Date());
cluster.reload(function() {
console.log("done reloading cluster");
});
});
console.log("spawned cluster, kill -s SIGUSR2", process.pid, "to reload");
In my hapi
server, I have:
server.start( function () {
process.send({cmd: 'ready'});
server.log('info', 'Server running at: ' + server.info.uri + ' with PID #' + process.pid + " at " + new Date() );
});
And to handle the disconnect
message:
process.on('message', function(m) {
if (m.cmd == 'disconnect') {
// disconnected from master, no more clients. clean up.
console.log("close message for PID#" + process.pid + " at " + new Date());
server.stop({
timeout: 60000
}, function() {
console.log('hapi stopped with PID#' + process.pid + " at " + new Date());
process.exit();
});
}
});
I'm sorry I don't yet have a reduced test case; I can try work on one if you don't see anything immediately obviously wrong with my configuration. ;-)
Hi, I want to tell you that I really like your module and your work.
I use recluster for my project and I am really impressed :)
I would like to know if it's possible to add an option for launching recluster as a daemon or if I should use a node module for this?
I thought that maybe start-stop-daemon could be a solution. I would love to know your opinion about it.
Thank you
I have been studying the recluster source code recently.It's a great module to control node process.But I have some questions wiht the backoff
configuration.
There are two files here:
index.js:
var path = require('path');
var recluster = require('recluster');
var opt = {
respawn: 0.5
,
backoff: 2
}
var cluster = recluster(path.join(__dirname,'server.js'),opt);
cluster.run();
server.js:
var http = require('http');
var s = http.createServer(function(req, res) {
var params = req.url.split('/').slice(1);
setTimeout(function() {
res.writeHead(200);
res.end("hello world\n");
}, params[0] || 1);
});
s.listen(8000);
setTimeout(function() {
throw new Error("Unclean exit!");
}, 500);
I have done node index.js
some times ,the result will show in the terminal:
I find that the respawn time was increasing from 0 to 6293ms(6s).
But the repository README says:
opt.backoff
Maximum respawn time (reached via exponential backoff). Set to 0 or undefined to disable exponential backoff.
I look up the part about backoff in source code.Here time is the fork
timeout .
But as mentioned above,let backoff is set 2s, the actual result (2959ใ4854) exceed backoff
configuration.
According the source code, as long as setting the backoff
,it will invoke delayedDecreaseBackoff and respwan time will be divided by 2 in a backoff time later.Before that the respawn time would be multiplied by 2.
But this exponential backoff
seems like not to guarantee the maximum time between respawns when workers die or I can't make sence of this logic.
Would you talk about how this design of logic run :)
Hi!
I've been using your module for a while and I think it's doing its job very well and it has a small, adequate interface. Kudos!
I was refactoring one of my project's boot.js file (which is initiating a recluster instance and firing up the application) and I thought it would be very helpful to be able to execute code when the cluster terminates. Something like this. That way you can assign handlers to standard terminating signals, in them you can terminate the cluster and when it's ready, exit the process. Like so:
var cluster = recluster({....});
function shutdown() {
cluster.terminate(process.exit);
}
process
.on('SIGUSR2', cluster.reload)
.on('SIGINT', shutdown)
.on('SIGTERM', shutdown);
cluster.run();
I can send a pull request with the changes and tests, if you are interested.
This still stands between me and zero downtime: on process uncaughtException
, I do server.close
which means this worker is no longer accepting new connections. However, it may take some time before exiting to wind down existing connections, clean up, etc. During this time, recluster will not fork a new worker, which means I'm down one worker. On a two-core machine, should another uncaught exception be thrown (perhaps there is a bad request that triggered the first, and now the client is resending it, so the second worker gets it), I will have no workers accepting new connections.
It seems that in this case, master will queue new connections until a worker accepts them, though I don't know for how long it will hold them, nor how many it will queue. But basically, this is downtime.
On recluster.reload
, old workers are disconnected and new workers to replace them are immediately forked, which works great for zero downtime. However, could there be some way for a worker to tell master "I am no longer listening for new connections" which would be another trigger for master to fork a new worker? Maybe master could listen for a message "stoppedListening" from the worker, that I could send in my uncaught exception handler?
Just throwing this out there for discussion. I know you said you would be busy with a project so no worries if you don't have time for this now, but interested to hear your thoughts whenever you are able!
As with issue #1, I'm having difficulties getting old workers to die. On my dev machine, everything works fine. But in production when I send a SIGUSR2, new workers are started but the old ones remain. The old ones appear to continue doing work too.
What information/debugging can I provide to help solve this?
I have a shutdown handler in child.js
let makeCloseHandler = (sig) => {
return () => {
console.log(`Received signal ${sig}`);
server.close(() => {
console.log('Server closed');
});
};
};
process.on('SIGINT', makeCloseHandler('SIGINT'));
process.on('SIGTERM', makeCloseHandler('SIGTERM'));
And cluster.js
const recluster = require('recluster');
const path = require('path');
const cluster = recluster(path.join(__dirname, 'index.js'), {
timeout: 120
});
const workerEvent = function(ev) {
cluster.on(ev, function(worker) {
console.log('Worker ' + worker.id + ' [' + worker.process.pid + '] ' + ' ' + ev + '.');
});
};
['online', 'listening', 'disconnect', 'exit'].forEach(function(ev) {
workerEvent(ev);
});
cluster.run();
console.log('Master ' + process.pid + ' started.');
let makeCloseHandler = (sig) => {
return () => {
console.log(`Cluster received signal ${sig}`);
cluster.terminate(() => {
console.log('Cluster closed');
});
};
};
process.on('SIGINT', makeCloseHandler('SIGINT'));
process.on('SIGTERM', makeCloseHandler('SIGTERM'));
But I see those logs when I stop the node cluster.js
^CReceived signal SIGINT
Received signal SIGINT
Cluster received signal SIGINT
Received signal SIGINT
Server closed
Server closed
Server closed
Received signal SIGINT
Server closed
Cluster closed
It seems like child are receiving SIGINT
before master does. So I'm confused on how grace shutdown are handled here. What's the best way to ensure we don't drop connect halfway in a request?
Another one from me. How about when timeout > 0, sending a "disconnecting" message to workers so they have a chance to gracefully cleanup and close out existing connections. I understand that once disconnected they will not receive new connections from master, but their servers may continue to serve existing long-lived connections for a while before abruptly exiting when timeout is over.
Then in my app, I can do process.on("message", function(msg) { if(msg == "disconnecting") cleanUp(); });
. As-is, the app knows nothing about what is going on until it receives SIGTERM from worker.kill
and by then existing clients may have been getting served by old workers for as long as the timeout -- in production, you default to 1 hour for this.
What do you think? I an opening an issue rather than a PR because I am not sure I am thinking about this the right way and not missing something. But for code, I think it could be as simple as:
diff --git a/index.js b/index.js
index 42c3fbc..bc52213 100644
--- a/index.js
+++ b/index.js
@@ -157,6 +157,7 @@ module.exports = function(file, opt) {
if (opt.timeout > 0) {
var timeout = setTimeout(killfn, opt.timeout * 1000);
worker.on('exit', clearTimeout.bind(this, timeout));
+ worker.send('disconnecting');
} else {
killfn();
}
I just ran a test and it seems to demonstrate that reload trashes already established connections.
If I have a worker that is misconfigured (e.g., missing an env setting), and it never gets to the ready
state, it would be nice if the entire cluster could shutdown, instead of going into an endless loop trying and failing to start the workers. If the cluster master could be told to listen for a worker to get to the ready state at least once, then we'd know it's probably useful to respawn. If, however, we don't get to ready, it's probably a good sign that respawning isn't going to help much.
I can see times where you might have workers that you want to keep kicking, so maybe this could be optional, opts.respawnExpectsReady
or the like.
My server relies on passportjs session information for some functionality. However when I run with recluster, req.session.passport is null. Is there any workaround or alternate approach for this to work?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.