Comments (13)
and totally 308w+ completed Jobs so far (maybe 4w+ per day):
What does "w" mean here? Not sure what that letter means in this context. "k" usually means thousand, "m" means million, but "w" means what????
From Storage.log, it seems that there are a lot of "completed log" items loading to memory at that time (may be in the maintenance operation?),
Yes, 4:00 AM local server time is generally when Cronicle runs its daily maintenance (see maintenance to configure when this happens), where it prunes lists that have grown too big, especially the logs/completed
list. I've honestly never had this run out of memory before. It looks like you are running so many jobs per day that this operation has exceeded Node's default 1 GB RAM ceiling.
There's only one thing I can suggest at this time. Please edit this file on all your Cronicle servers:
/opt/cronicle/bin/control.sh
Locate this line inside the file:
BINARY="node --expose_gc --always_compact $HOMEDIR/lib/main.js"
And change it to this:
BINARY="node --expose_gc --always_compact --max_old_space_size=4031 $HOMEDIR/lib/main.js"
Basically, we're adding in this Node.js command-line argument here: --max_old_space_size=4031
Then restart Cronicle on all your servers. This should allow Node.js to use up to 4 GB of RAM (the default is 1 GB).
If this doesn't work, then I have no idea how to help you. You may simply be pushing this system far beyond where it was designed to go.
Good luck!
- Joe
from cronicle.
Whew, that's a relief! 😌
Let's hope the new code in v0.6.9 keeps it under control from now on. It should prune all the lists down to 10,000 max items (configurable) every night.
Thanks for the issue report, and sorry about the core dumps and corrupted log data.
from cronicle.
What does "w" mean here? Not sure what that letter means in this context. "k" usually means thousand, "m" means million, but "w" means what????
Sorry, 1w = 10k, w is the abbreviation of wan[万], A unit of magnitude commonly used in China. :)
I have changed the control script and restarted the service, hope it works.
This is the monitoring graph trend at that time :
Can I disable the daily maintenance in configuration, and do the maintenance by a separate script in system crontab? this should can avoid service OOM.
from cronicle.
Oh my... So if 1w = 10k (10,000), and you're doing 4w per day, then that's 40,000 jobs per day??? That's extraordinarily excessive. My jaw is on the floor right now. Unfortunately this is way, way more than the system was ever designed to handle. That's very likely why you're running out of memory.
Okay, so at this sheer magnitude of jobs, the standard listSplice() call, which is used to prune the lists during the daily maintenance, isn't going to work for you. It loads many pages into memory. That's simply too many items for it to splice at once. You're better off just clearing the lists every night. You'll lose the history of the completed items and activity, but I can't think of anything else, without redesigning the system somehow.
You can try sticking these commands in a shell script, activated by a crontab (or Cronicle I guess). Make sure you only run this on your Cronicle master server (not all the servers).
#!/bin/bash
/opt/cronicle/bin/storage-cli.js list_delete logs/complete
/opt/cronicle/bin/storage-cli.js list_delete logs/activity
These are the two largest lists, and probably the ones causing all the problems for you. If you delete these every night before the maintenance runs (maybe 3 AM?) it might work around your problem.
I'm worried about race conditions, however, because the CLI script will be "fighting" with Cronicle to read/write to these same lists as they are being deleted, especially if you are launching a job every 2 seconds (~40K per day). We may run into data corruption here -- but I don't know what else to try. What you may have to do is stop Cronicle, run the commands, then start it again. Like this:
#!/bin/bash
/opt/cronicle/bin/control.sh stop
/opt/cronicle/bin/storage-cli.js list_delete logs/complete
/opt/cronicle/bin/storage-cli.js list_delete logs/activity
/opt/cronicle/bin/control.sh start
Again, please note that this should only be executed on the Cronicle master server (which is the only server that should ever "write" to the data store).
Good luck, and sorry. Please understand the Cronicle is not designed for this level of scale (not even close), and is still in pre-release beta (not yet even v1.0). I've never ran more than 500 jobs in a day before. You're doing 80 times that amount.
- Joe
from cronicle.
Thank you for your advice, we have a dozen of event runs every minute, so ... hah~
when execute "/opt/cronicle/bin/storage-cli.js list_delete logs/complete", we got an exception:
[root@master bin]# ./storage-cli.js list_delete logs/complete
/opt/cronicle/bin/storage-cli.js:342
if (err) throw err;
^Error: Failed to fetch key: logs/complete: File not found
at ReadFileContext.callback (/opt/cronicle/node_modules/pixl-server-storage/engines/Filesystem.js:255:6)
at FSReqWrap.readFileAfterOpen [as oncomplete] (fs.js:335:13)
and "get" command's output like this:
config.json ‘s Storage section:
from cronicle.
Oh, I know, it's logs/completed not logs/complete.
Stop all masters and clear is a bit unfriendly for production user, especially when the user uses the master/backup mode, they have to stop all servers in master group to ensure no writing activity in progress.
Can we provide a "pause/continue" feature which will temporarily pause/continue all event's new job scheduling(pause command check status every 30s and exit when all old running jobs are finished) ? so we can do the manual maintenance more safely and do not need to stop all our masters (which will produce so many job fail email ).
from cronicle.
@iambocai I do apologize, that was a typo on my part. It is indeed logs/completed
, as you said.
At this point I recommend we try the Node.js memory increase (I think you did this already -- please let me know the results), and if that doesn't work, try the nightly crontab delete.
I understand that stopping and starting the service is not ideal. You can skip this part, and just do the delete "hot" (while the service is running). We're most likely already in a situation where your data is a bit corrupted, because we've core dumped at least once in the middle of a maintenance run. So at this point I'd say, if the Node.js memory increase doesn't work, put in a crontab that runs this every night at 3 AM on your Master server:
#!/bin/bash
/opt/cronicle/bin/storage-cli.js list_delete logs/completed
/opt/cronicle/bin/storage-cli.js list_delete logs/activity
Also, please allow me to make another recommendation. Since you are running 40K jobs per day, you should probably greatly decrease the job_data_expire_days config setting from 180 days down to maybe 7 days. This is because we're pruning the main lists down to 10,000 items every night, so we're essentially losing the history of all your jobs after one day. So there's no need to keep the job logs around for much longer. However, you can still drill down into the job history for individual events, which are stored as different lists, so a few days is probably still good.
I am working on a new version of Cronicle which will attempt to prune lists without blowing out memory. I'll let you know when this is finished and released.
from cronicle.
Hey @iambocai,
I just released Cronicle v0.6.9, which should fix this bug, if it is what I think it is. The nightly maintenance should now use far less memory when chopping old items off huge lists.
https://github.com/jhuckaby/Cronicle/releases/tag/v0.6.9
Hope this helps.
from cronicle.
At this point I recommend we try the Node.js memory increase (I think you did this already -- please let me know the results)
Yes, I have changed it yesterday, the good news is, it did not make core last night; the bad news is, it also seems did not delete completed log successful.
(Ah, yes there's 3219k+ completed logs in file storage now. )
And when I try to do the delete "hot", it kept threws exception, guess we met just what you have said "CLI script fighting with Cronicle" condition.
I'm know nothing about pixl-server-storage, do we have ways to delete them with out suffers from read/write conflict (by page number/last update time)? should only delete logs/completed a few days ago be helpful,for example ?
Again, Thank you very much for helping me over the last few days. you are so nice !~ : )
[root@master bin]# node storage-cli.js list_delete logs/completed
[ERROR] Failed to delete object: logs/completed/-62935: File not found
/opt/cronicle/bin/storage-cli.js:342
if (err) throw err;
^Error: ENOENT: no such file or directory, unlink '/home/homework/data/cronicle/logs/6a/fc/e6/6afce60cf3ff8c95c37153b5f48b43cf.json'
[root@master bin]# node storage-cli.js list_delete logs/completed
[ERROR] Failed to delete object: logs/completed/-62935: File not found
/opt/cronicle/bin/storage-cli.js:342
if (err) throw err;
^Error: ENOENT: no such file or directory, unlink '/home/homework/data/cronicle/logs/6a/fc/e6/6afce60cf3ff8c95c37153b5f48b43cf.json'
from cronicle.
Oh wow. That is insane. Okay, the situation has grown beyond my ability to "fix" it, so we have to export all your critical data, delete your entire data
directory contents, and then restore the backup. This is actually very easy, but you will have to temporarily stop the Master server. As long as you have Cronicle v0.6.9 or higher, this issue shouldn't happen again, after we restore your data into a fresh directory.
Here is what the backup will include, meaning all the following data will be preserved and restored:
- All your users
- All your API keys
- All your Plugins
- All your server groups
- All your event categories
- All your schedule entries
- All your servers (cluster info)
Here is what will be lost completely (no choice really):
- All historical (completed) jobs logs will be lost
- All event history (stats) will be lost
- The "Activity" log contents in the Admin tab will be lost
- All user session cookies (meaning you will have to login with your user/password again)
I am sorry about this, but keep in mind that Cronicle is still in pre-release (v0.6) and is just not designed for your immense scale, or for repairing corrupted data on disk. I will work to make this system more robust before I ever release v1.0.
Here are the steps. Please do this on your master server only (and make sure you don't have any secondary backup servers that will take over as soon as the master goes down). Make sure you are the root user (i.e. superuser).
FYI, this is documented here: Data Import and Export
cd /opt/cronicle
./bin/control.sh stop
./bin/control.sh export data-backup.txt --verbose
You should now have a backup of all your critical data in /opt/cronicle/data-backup.txt
. For safety I don't recommend you hard delete the data
directory at this point. Instead, you can rename or tarball it if you want. Renaming is faster, so we minimize your downtime:
mv /home/homework/data /home/homework/data-OLD
mkdir /home/homework/data
So this has moved the old data
directory aside, renaming it to data-OLD
, and we created a fresh new empty data
directory. This way you can still revert if you want (i.e. just stop, swap the directories back, then start).
If you are not sure, now is a good time to make sure that Cronicle is on the latest version (i.e. v0.6.9), which has the nightly memory fix in it.
./bin/control.sh upgrade
Now let's restore your backup into the fresh new /home/homework/data
directory we just created, and start the service:
./bin/control.sh import data-backup.txt
./bin/control.sh start
That should be it. Please let me know if any step fails or emits errors. I am hoping and praying that all the critical data contained in the backup doesn't have any corruption. With any luck the damaged records are all within the huge completed logs, which we are not bringing over.
Fingers crossed.
- Joe
from cronicle.
Yeah it works! about 49GB history data has gone with wind, :D
from cronicle.
I have kept watching the system for two weeks, the logs/completed and activity list's daily maintenance works fine now. but unfortunately, the jobs dir seems still growth fast:
in config.conf:
"job_data_expire_days": 3,
There are 893k files under jobs dir now , far more than 120k (40k jobs per day * 3days) :
from cronicle.
That's odd, it's supposed to automatically delete those. However, please note that changing job_data_expire_days
only takes effect for new jobs, meaning it doesn't retroactively expire older ones, from before you made the change.
However, if this directory keeps growing out of control, you can put in a cronjob to delete old files from it, at least until I can find the bug:
find /home/homework/data/jobs -type f -mtime 3d -exec rm {} \;
Unfortunately your job volume is so extreme that I don't know of any feasible way to examine the nightly maintenance logs on your server. It would just be noise on top of all the other jobs your servers are running. However, when I have some time I will try to recreate this issue on a test server.
from cronicle.
Related Issues (20)
- AggregateError on Ubuntu with HTTP Request Plugin HOT 1
- Failed to fetch job log: Error: ENOENT: no such file or directory, stat 'logs/jobs/jlq6docgk16.log' for running jobs. HOT 3
- user api to add worker server HOT 1
- Mail configuration is a disaster! HOT 4
- No such file or directory (spawn /tmp/cronicle-script-temp-jlre950l608.sh ENOENT) HOT 7
- Not running on custom domain
- Make Category ID changeable HOT 2
- Missing configuartion varaiable custom_live_log_socket_url HOT 1
- Unable to call POST methods against the API in multi-server Cronicle cluster HOT 1
- Getting Session Time out very quickly and getting logged out HOT 2
- Log live watcher will not connect HOT 2
- api call: set event timing to "on demand" HOT 1
- Puppeteer Cron Job Event HOT 1
- Multiplex job environmental variables HOT 3
- How to properly update plugins? HOT 1
- https behind haproxy HOT 1
- Worker behaving strange after copying the server HOT 9
- how to clean log HOT 2
- Problem using advanced features of shell plugin on Ubuntu server HOT 1
- Shell plugin does not work on AIX. Cannot execute shell. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cronicle.