Dashboard and monitoring system for the UK Web Archive
Note Default organisation name used within Grafana still needs to be set manually via the Grafana UI. See grafana/grafana#2908 for details/progress.
Dashboard and monitoring system for the UK Web Archive
Dashboard and monitoring system for the UK Web Archive
Note Default organisation name used within Grafana still needs to be set manually via the Grafana UI. See grafana/grafana#2908 for details/progress.
Currently, we only have a functioning H020 scraper/exporter in the form of ukwa/hdfs-exporter.
Needs implementing at ukwa/hdfs-exporter#1 before it can be hooked in here.
I think this is a bug in the Stats Pusher, in that the same metric turns up multiple times with different labels, or maybe I'm using it wrong?
After specifying one additional metric, it turns up twice, once under each top-level key:
trackdb_numFound_rr_logs{instance="solr8",job="cdx_oa_wayback",label="logs"}
trackdb_numFound_rr_logs{instance="solr8",job="trackdb",label="logs"}
We need to be able to tag things as production
, beta
, development
or offline
services so we can report them separately. Also, only production
or beta
should be actively monitored and cause alerts etc. but I think we can do this in the monitoring layer as long as the services are marked up.
Airflow on Ingest has been integrated with monitoring, as in we are recording metrics, e.g.
Where the airflow_dag_last_status
metric records the outcome of the most recent run for each workflow a.k.a. DAG. We have an alert for this, but it doens't fire because the for: 2hr
period is too long:
Could you tweak it down to for: 5m
so we know sooner if jobs are failing.
Having to check alertmanager seperately is a bit of a pain. Recent versions of Grafan include an AlertManager data source module, which could be use to integrate alert inspection and management into a Grafana dashboard. The module is in alpha but when it's ready, we should take advantage of this and make a new status dashboard that includes it.
Note also that there's a new Prometheus implementation that might be worth checking out: https://cortexmetrics.io/
Given it's proven difficult to keep an eye on what's going on on AWS, we should integrate some monitoring into our stack. The best way to do this seems to be to use the cloudwatch_exporter, as that's supported by the core Prometheus team.
There are quite a lot of example configurations that could be tweaked, e.g. this bit measures EC2 CPU
Turns out we can use Prometheus and the Blackbox Exporter to track/look out for expiring SSL certificates. e.g.
(probe_ssl_earliest_cert_expiry - time())/(60*60*24)
This query will report days until expiration for each HTTPS service in the Blackbox Exporter. Therefore we an add an alert that fires when there's 30 days remaining.
We should also add the alpha
and beta
services to the system, at least to the extend that our production systems can see them. We still need to exactly resolve how to monitor the production systems.
It's often handy to use the same metric with different labels, as it makes breakdown analysis easier. I wanted to do something like this:
trackdb_numFound = #######
trackdb_numFound{label='warcs'} = #####
trackdb_numFound{label='cdx'} = ####
i.e. use the label to count sub-sets of the total number of files found in TrackDB. However, because the metric name (trackdb_numFound
) is constructed from the JSON dictionary keys, and we only have one metric definition per key, we can only use each metric name once. The meant to get it to work I had to use:
trackdb_numFound = ######
trackdb_numFound_warcs = #####
trackdb_numFound_cdx = ####
Which is not so easy to use.
The DLS relies on the HttpFS service with the domain name dls.httpfs.wa.bl.uk
, which is hard-coded within DLS to refer to Nellie's 194.
IP address. The firewall is set up along this route.
The monitor system should check that the relevant HttpFS service is running port 14000. Note also that our patched version of HttpFS has been modified so that the Content-Length
is declared. Ideally we should also check for that, in case we 'downgrade' by accident at some point in the future.
The current gluster-filling-up alert uses delta()
and I'm not sure what it's doing because it's not very easy to interpret. See e.g. this comparison with delta()
and deriv()
That link includes this alternative implementation:
deriv(node_filesystem_free_bytes{instance="gluster-fuse:9100",mountpoint="/mnt/gluster/fc"}[24h]) > 1e6
This seems a but more stable, so I suggest we switch to that instead of
The current Stats Pusher stat_values.py
calls sys.exit()
if one of the HTTP calls fails. However, this kills the whole script, meaning that all subsequent metrics are no longer collected. This in turn means many false alerts get fired.
ukwa-monitor/stat-pusher/script/stat_values.py
Lines 28 to 37 in cc9f9b0
The loop that goes through the checks should continue on to the next one if there is a problem. There is already a catch-all Exception handler at the loop level, so the best plan would seem to be to raise
the Exception
up the chain rather than swallow it locally.
Currently, any WARC/log backlog on Gluster is identified by Gluster filling up overall, but Gluster contains other files and this metric turns out to be a poor indicator of issues, leading to a lot of false alarms:
ukwa-monitor/monitor/prometheus/alert.rules.yml
Lines 141 to 148 in 5565d0a
Rather, we should focus on specific alerts for logs and WARCs. There's already a daily log file alert:
ukwa-monitor/monitor/prometheus/alert.rules.yml
Lines 96 to 103 in 5565d0a
So this should be checked and another alert based on newly-established metrics should be used instead:
delta(ukwa_files_count{fs="gluster", job="warc_tidy", kind="warcs"}[12h] ) > 0
This metric detects whether the number of WARCs on gluster has increased in the last 12 hours. Looking back at this metric, we can tell that if this is the case for more 12 hours, there is a problem shifting data off Gluster.
Crawls are now launched with Airflow, so alertname=crawl_launcher_has_not_run
alerts are no longer necessary. Airflow itself manages jobs and alerts on failures, so no replacement alert is needed in Prometheus itself.
Switching away from using custom code approach here, to configuring off the shelf monitoring tools.
Prometheus and Grafana as overall monitoring of statistics and alerts.
See https://github.com/ukwa/ukwa-documentation/blob/master/Monitoring-Services.md for details.
Areas to monitor:
Logs and other events (like crawl events) routed from servers e.g. using filebeat
into a monitoring-events Kafka that logstash
can consume and push to elasticsearch
. This acts as a 'debugging console' where last few days of logs are kept and can be used to debug what's happening.
To Do:
logstash
data schema consistent with the Kafka crawl log feed.logstash-http-poller
or Prometheus blackbox_exporter
to poll HTTP endpoints.There's some useful example Docker stuff here
The JSON configuration of the stats pusher does not currently support responses that contain arrays. e.g. we could also query the CDX indexes directly (both crawl-time and access-time), like this:
However, this returns an array, not a dictionary, and wa_stats
can't be configured to deal with that.
We should have some checks that will throw out alerts if it looks like certain processes haven't run.
To start with, the W3ACT production database should be backed up onto HDFS once a day (by this task. A monitoring task could check if the output is present for yesterday, and raise an exception if not.
Currently, the backup path is:
/2_backups/crawler01/pulsefeprod_postgres_1/w3act.pgdump-20170714
and you should be able to check the task is complete using e.g.
yesterday = datetime.date.today() - datetime.timedelta(days=1)
target = BackupProductionW3ACTPostgres(date=yesterday)
if not target.complete():
raise Exception("BLAH BLAH BLAH")
To do this, we need to add python-shepherd
as a dependency for this project so it can inspect the tasks.
This metric can be used to spot if content is backing-up on Gluster:
-delta(node_filesystem_free_bytes{mountpoint='/mnt/gluster/fc',instance='gluster-fuse:9100'}[12h]) > 0
e.g see production Prometheus chart
This should be monitoried, and if it goes on for a day, raise an alert.
We need a new alert, alongside this one that is based on what's on HDFS
The new alert should be based on that, but use this metric, which spots when the tidy-logs
job has noted that the crawl log is missing or not growing.
delta(ukwa_crawler_log_size_bytes{log='crawl.log'}[1h]) == 0 or absent(ukwa_crawler_log_size_bytes{log='crawl.log'})
If this condition is active for: 1h
then an alert should inform us that the crawl_job_name
crawl is not writing to it's crawl.log
.
The current check for whether the CDX is up to date uses the BBC robots.txt
file as a check, but (I think) Wayback is configured to omit revisits, so this only reports the date correctly if the robots.txt
is changed. It would make more sense to use https://www.bbc.co.uk/news
as the sensor URL.
If we can switch monitoring/Prometheus to use host
networking mode, this will use the local /etc/hosts
file and so avoid having to duplicate hosts mappings in the Docker setup.
The crawl logs in ElasticSearch sometimes have gaps, because Logstash gets stuck on some Kafka error that appears to be transient. We need some hook to check there are recent logs in ElasticSearch, or maybe monitoring Logstash itself.
e.g. https://github.com/alxrem/prometheus-logstash-exporter ?
or https://medium.com/@malone.spencer/logstash-events-to-prometheus-912d7ac43a74
Currently the system runs it's own simple (if rather dense) dashboard, as a basic Python Flask app and a simple template that formats the result of the monitoring tasks (stored on disk).
Originally, we were planning to stick with this approach, but move towards a schematic representation of our service using CSS to indicate status (see overview.svg). Perhaps enhanced with a few trend plots (e.g. using plot.ly) as required.
An alternative is to just use some off-the-shelf dashboard that's fancier and configurable:
There's various things we'd ideally like to add to our monitoring dashboard, like screenshots of crawled pages, which would lean towards having our own simple dashboard. However, perhaps this an unrealistic amount of effort? It's difficult to see the pay-off unless these are systems we can use to debug problems, not just be alerted to them.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.