We have 5 server in Cronicle cluster: <a target="_blank" rel="noopener noreferrer

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

coredump occurs around 4 o'clock on master,about jhuckaby/cronicle

Comments (13)

jhuckaby commented on May 26, 2024 2

and totally 308w+ completed Jobs so far （maybe 4w+ per day）：

What does "w" mean here? Not sure what that letter means in this context. "k" usually means thousand, "m" means million, but "w" means what????

From Storage.log, it seems that there are a lot of "completed log" items loading to memory at that time (may be in the maintenance operation?),

Yes, 4:00 AM local server time is generally when Cronicle runs its daily maintenance (see maintenance to configure when this happens), where it prunes lists that have grown too big, especially the logs/completed list. I've honestly never had this run out of memory before. It looks like you are running so many jobs per day that this operation has exceeded Node's default 1 GB RAM ceiling.

There's only one thing I can suggest at this time. Please edit this file on all your Cronicle servers:

/opt/cronicle/bin/control.sh

Locate this line inside the file:

BINARY="node --expose_gc --always_compact $HOMEDIR/lib/main.js"

And change it to this:

BINARY="node --expose_gc --always_compact --max_old_space_size=4031 $HOMEDIR/lib/main.js"

Basically, we're adding in this Node.js command-line argument here: --max_old_space_size=4031

Then restart Cronicle on all your servers. This should allow Node.js to use up to 4 GB of RAM (the default is 1 GB).

If this doesn't work, then I have no idea how to help you. You may simply be pushing this system far beyond where it was designed to go.

Good luck!

- Joe

from cronicle.

jhuckaby commented on May 26, 2024 1

Whew, that's a relief! 😌

Let's hope the new code in v0.6.9 keeps it under control from now on. It should prune all the lists down to 10,000 max items (configurable) every night.

Thanks for the issue report, and sorry about the core dumps and corrupted log data.

from cronicle.

iambocai commented on May 26, 2024

What does "w" mean here? Not sure what that letter means in this context. "k" usually means thousand, "m" means million, but "w" means what????

Sorry, 1w = 10k, w is the abbreviation of wan[万], A unit of magnitude commonly used in China. :)

I have changed the control script and restarted the service, hope it works.

This is the monitoring graph trend at that time :

Can I disable the daily maintenance in configuration, and do the maintenance by a separate script in system crontab? this should can avoid service OOM.

from cronicle.

jhuckaby commented on May 26, 2024

Oh my... So if 1w = 10k (10,000), and you're doing 4w per day, then that's 40,000 jobs per day??? That's extraordinarily excessive. My jaw is on the floor right now. Unfortunately this is way, way more than the system was ever designed to handle. That's very likely why you're running out of memory.

Okay, so at this sheer magnitude of jobs, the standard listSplice() call, which is used to prune the lists during the daily maintenance, isn't going to work for you. It loads many pages into memory. That's simply too many items for it to splice at once. You're better off just clearing the lists every night. You'll lose the history of the completed items and activity, but I can't think of anything else, without redesigning the system somehow.

You can try sticking these commands in a shell script, activated by a crontab (or Cronicle I guess). Make sure you only run this on your Cronicle master server (not all the servers).

#!/bin/bash
/opt/cronicle/bin/storage-cli.js list_delete logs/complete
/opt/cronicle/bin/storage-cli.js list_delete logs/activity

These are the two largest lists, and probably the ones causing all the problems for you. If you delete these every night before the maintenance runs (maybe 3 AM?) it might work around your problem.

I'm worried about race conditions, however, because the CLI script will be "fighting" with Cronicle to read/write to these same lists as they are being deleted, especially if you are launching a job every 2 seconds (~40K per day). We may run into data corruption here -- but I don't know what else to try. What you may have to do is stop Cronicle, run the commands, then start it again. Like this:

#!/bin/bash
/opt/cronicle/bin/control.sh stop
/opt/cronicle/bin/storage-cli.js list_delete logs/complete
/opt/cronicle/bin/storage-cli.js list_delete logs/activity
/opt/cronicle/bin/control.sh start

Again, please note that this should only be executed on the Cronicle master server (which is the only server that should ever "write" to the data store).

Good luck, and sorry. Please understand the Cronicle is not designed for this level of scale (not even close), and is still in pre-release beta (not yet even v1.0). I've never ran more than 500 jobs in a day before. You're doing 80 times that amount.

- Joe

from cronicle.

iambocai commented on May 26, 2024

Thank you for your advice, we have a dozen of event runs every minute, so ... hah~

when execute "/opt/cronicle/bin/storage-cli.js list_delete logs/complete", we got an exception:

[root@master bin]# ./storage-cli.js list_delete logs/complete
/opt/cronicle/bin/storage-cli.js:342
if (err) throw err;
^

Error: Failed to fetch key: logs/complete: File not found
at ReadFileContext.callback (/opt/cronicle/node_modules/pixl-server-storage/engines/Filesystem.js:255:6)
at FSReqWrap.readFileAfterOpen [as oncomplete] (fs.js:335:13)

and "get" command's output like this:

config.json ‘s Storage section:

from cronicle.

iambocai commented on May 26, 2024

Oh， I know, it's logs/completed not logs/complete.

Stop all masters and clear is a bit unfriendly for production user, especially when the user uses the master/backup mode, they have to stop all servers in master group to ensure no writing activity in progress.
Can we provide a "pause/continue" feature which will temporarily pause/continue all event's new job scheduling(pause command check status every 30s and exit when all old running jobs are finished) ? so we can do the manual maintenance more safely and do not need to stop all our masters (which will produce so many job fail email ).

from cronicle.

jhuckaby commented on May 26, 2024

@iambocai I do apologize, that was a typo on my part. It is indeed logs/completed, as you said.

At this point I recommend we try the Node.js memory increase (I think you did this already -- please let me know the results), and if that doesn't work, try the nightly crontab delete.

I understand that stopping and starting the service is not ideal. You can skip this part, and just do the delete "hot" (while the service is running). We're most likely already in a situation where your data is a bit corrupted, because we've core dumped at least once in the middle of a maintenance run. So at this point I'd say, if the Node.js memory increase doesn't work, put in a crontab that runs this every night at 3 AM on your Master server:

#!/bin/bash
/opt/cronicle/bin/storage-cli.js list_delete logs/completed
/opt/cronicle/bin/storage-cli.js list_delete logs/activity

Also, please allow me to make another recommendation. Since you are running 40K jobs per day, you should probably greatly decrease the job_data_expire_days config setting from 180 days down to maybe 7 days. This is because we're pruning the main lists down to 10,000 items every night, so we're essentially losing the history of all your jobs after one day. So there's no need to keep the job logs around for much longer. However, you can still drill down into the job history for individual events, which are stored as different lists, so a few days is probably still good.

I am working on a new version of Cronicle which will attempt to prune lists without blowing out memory. I'll let you know when this is finished and released.

from cronicle.

jhuckaby commented on May 26, 2024

Hey @iambocai,

I just released Cronicle v0.6.9, which should fix this bug, if it is what I think it is. The nightly maintenance should now use far less memory when chopping old items off huge lists.

https://github.com/jhuckaby/Cronicle/releases/tag/v0.6.9

Hope this helps.

from cronicle.

iambocai commented on May 26, 2024

At this point I recommend we try the Node.js memory increase (I think you did this already -- please let me know the results)

Yes, I have changed it yesterday, the good news is, it did not make core last night; the bad news is, it also seems did not delete completed log successful.

(Ah, yes there's 3219k+ completed logs in file storage now. )

And when I try to do the delete "hot", it kept threws exception, guess we met just what you have said "CLI script fighting with Cronicle" condition.

I'm know nothing about pixl-server-storage, do we have ways to delete them with out suffers from read/write conflict (by page number/last update time)? should only delete logs/completed a few days ago be helpful，for example ?

Again, Thank you very much for helping me over the last few days. you are so nice !~ : )

[root@master bin]# node storage-cli.js list_delete logs/completed
[ERROR] Failed to delete object: logs/completed/-62935: File not found
/opt/cronicle/bin/storage-cli.js:342
if (err) throw err;
^

Error: ENOENT: no such file or directory, unlink '/home/homework/data/cronicle/logs/6a/fc/e6/6afce60cf3ff8c95c37153b5f48b43cf.json'
[root@master bin]# node storage-cli.js list_delete logs/completed
[ERROR] Failed to delete object: logs/completed/-62935: File not found
/opt/cronicle/bin/storage-cli.js:342
if (err) throw err;
^

Error: ENOENT: no such file or directory, unlink '/home/homework/data/cronicle/logs/6a/fc/e6/6afce60cf3ff8c95c37153b5f48b43cf.json'

from cronicle.

jhuckaby commented on May 26, 2024

Oh wow. That is insane. Okay, the situation has grown beyond my ability to "fix" it, so we have to export all your critical data, delete your entire data directory contents, and then restore the backup. This is actually very easy, but you will have to temporarily stop the Master server. As long as you have Cronicle v0.6.9 or higher, this issue shouldn't happen again, after we restore your data into a fresh directory.

Here is what the backup will include, meaning all the following data will be preserved and restored:

All your users
All your API keys
All your Plugins
All your server groups
All your event categories
All your schedule entries
All your servers (cluster info)

Here is what will be lost completely (no choice really):

All historical (completed) jobs logs will be lost
All event history (stats) will be lost
The "Activity" log contents in the Admin tab will be lost
All user session cookies (meaning you will have to login with your user/password again)

I am sorry about this, but keep in mind that Cronicle is still in pre-release (v0.6) and is just not designed for your immense scale, or for repairing corrupted data on disk. I will work to make this system more robust before I ever release v1.0.

Here are the steps. Please do this on your master server only (and make sure you don't have any secondary backup servers that will take over as soon as the master goes down). Make sure you are the root user (i.e. superuser).

FYI, this is documented here: Data Import and Export

cd /opt/cronicle
./bin/control.sh stop
./bin/control.sh export data-backup.txt --verbose

You should now have a backup of all your critical data in /opt/cronicle/data-backup.txt. For safety I don't recommend you hard delete the data directory at this point. Instead, you can rename or tarball it if you want. Renaming is faster, so we minimize your downtime:

mv /home/homework/data /home/homework/data-OLD
mkdir /home/homework/data

So this has moved the old data directory aside, renaming it to data-OLD, and we created a fresh new empty data directory. This way you can still revert if you want (i.e. just stop, swap the directories back, then start).

If you are not sure, now is a good time to make sure that Cronicle is on the latest version (i.e. v0.6.9), which has the nightly memory fix in it.

./bin/control.sh upgrade

Now let's restore your backup into the fresh new /home/homework/data directory we just created, and start the service:

./bin/control.sh import data-backup.txt
./bin/control.sh start

That should be it. Please let me know if any step fails or emits errors. I am hoping and praying that all the critical data contained in the backup doesn't have any corruption. With any luck the damaged records are all within the huge completed logs, which we are not bringing over.

Fingers crossed.

- Joe

from cronicle.

iambocai commented on May 26, 2024

Yeah it works! about 49GB history data has gone with wind, :D

from cronicle.

iambocai commented on May 26, 2024

I have kept watching the system for two weeks, the logs/completed and activity list's daily maintenance works fine now. but unfortunately, the jobs dir seems still growth fast：

in config.conf:

"job_data_expire_days": 3,

disk usage:

There are 893k files under jobs dir now , far more than 120k (40k jobs per day * 3days) :

from cronicle.

jhuckaby commented on May 26, 2024

That's odd, it's supposed to automatically delete those. However, please note that changing job_data_expire_days only takes effect for new jobs, meaning it doesn't retroactively expire older ones, from before you made the change.

However, if this directory keeps growing out of control, you can put in a cronjob to delete old files from it, at least until I can find the bug:

find /home/homework/data/jobs -type f -mtime 3d -exec rm {} \;

Unfortunately your job volume is so extreme that I don't know of any feasible way to examine the nightly maintenance logs on your server. It would just be noise on top of all the other jobs your servers are running. However, when I have some time I will try to recreate this issue on a test server.

from cronicle.

coredump occurs around 4 o'clock on master about cronicle HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent