ukwa / ukwa-monitor Goto Github PK

Dashboard and monitoring system for the UK Web Archive

Python 78.27% Shell 21.27% Dockerfile 0.46%

ukwa-monitor's Introduction

ukwa-monitor

Dashboard and monitoring system for the UK Web Archive

Note Default organisation name used within Grafana still needs to be set manually via the Grafana UI. See grafana/grafana#2908 for details/progress.

ukwa-monitor's People

Contributors

Watchers

Forkers

min2ha anjackson gilhoggarth uk-gov-mirror ldbiz

ukwa-monitor's Issues

Add in a H3 Exporter

Currently, we only have a functioning H020 scraper/exporter in the form of ukwa/hdfs-exporter.

Needs implementing at ukwa/hdfs-exporter#1 before it can be hooked in here.

Stats Pusher submits multiple metrics

I think this is a bug in the Stats Pusher, in that the same metric turns up multiple times with different labels, or maybe I'm using it wrong?

After specifying one additional metric, it turns up twice, once under each top-level key:

trackdb_numFound_rr_logs{instance="solr8",job="cdx_oa_wayback",label="logs"}
trackdb_numFound_rr_logs{instance="solr8",job="trackdb",label="logs"}

Add an deployment flag to the service list, and only monitor important ones

We need to be able to tag things as production, beta, development or offline services so we can report them separately. Also, only production or beta should be actively monitored and cause alerts etc. but I think we can do this in the monitoring layer as long as the services are marked up.

Add alerts for Airflow

Airflow on Ingest has been integrated with monitoring, as in we are recording metrics, e.g.

http://monitor-prometheus.api.wa.bl.uk/graph?g0.expr=airflow_dag_last_status&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h

Where the airflow_dag_last_status metric records the outcome of the most recent run for each workflow a.k.a. DAG. We have an alert for this, but it doens't fire because the for: 2hr period is too long:

ukwa-monitor/monitor/prometheus/alert.rules.yml

Line 133 in 79ccc4b

for: 2h

Could you tweak it down to for: 5m so we know sooner if jobs are failing.

Move alerts into a Grafana dashboard

Having to check alertmanager seperately is a bit of a pain. Recent versions of Grafan include an AlertManager data source module, which could be use to integrate alert inspection and management into a Grafana dashboard. The module is in alpha but when it's ready, we should take advantage of this and make a new status dashboard that includes it.

Note also that there's a new Prometheus implementation that might be worth checking out: https://cortexmetrics.io/

Add AWS CloudWatch monitoring

Given it's proven difficult to keep an eye on what's going on on AWS, we should integrate some monitoring into our stack. The best way to do this seems to be to use the cloudwatch_exporter, as that's supported by the core Prometheus team.

There are quite a lot of example configurations that could be tweaked, e.g. this bit measures EC2 CPU

Add LDL services and add a cert expiration alert

Turns out we can use Prometheus and the Blackbox Exporter to track/look out for expiring SSL certificates. e.g.

(probe_ssl_earliest_cert_expiry - time())/(60*60*24)

This query will report days until expiration for each HTTPS service in the Blackbox Exporter. Therefore we an add an alert that fires when there's 30 days remaining.

We should also add the alpha and beta services to the system, at least to the extend that our production systems can see them. We still need to exactly resolve how to monitor the production systems.

Stats Pusher to allow the same metric with different labels

It's often handy to use the same metric with different labels, as it makes breakdown analysis easier. I wanted to do something like this:

trackdb_numFound = #######
trackdb_numFound{label='warcs'} = #####
trackdb_numFound{label='cdx'} = ####

i.e. use the label to count sub-sets of the total number of files found in TrackDB. However, because the metric name (trackdb_numFound) is constructed from the JSON dictionary keys, and we only have one metric definition per key, we can only use each metric name once. The meant to get it to work I had to use:

trackdb_numFound = ######
trackdb_numFound_warcs = #####
trackdb_numFound_cdx = ####

Which is not so easy to use.

Add a check that the main HttpFS service is running

The DLS relies on the HttpFS service with the domain name dls.httpfs.wa.bl.uk, which is hard-coded within DLS to refer to Nellie's 194. IP address. The firewall is set up along this route.

The monitor system should check that the relevant HttpFS service is running port 14000. Note also that our patched version of HttpFS has been modified so that the Content-Length is declared. Ideally we should also check for that, in case we 'downgrade' by accident at some point in the future.

Move Gluster-filling-up metric to something more reliable

The current gluster-filling-up alert uses delta() and I'm not sure what it's doing because it's not very easy to interpret. See e.g. this comparison with delta() and deriv()

That link includes this alternative implementation:

deriv(node_filesystem_free_bytes{instance="gluster-fuse:9100",mountpoint="/mnt/gluster/fc"}[24h]) > 1e6

This seems a but more stable, so I suggest we switch to that instead of

ukwa-monitor/monitor/prometheus/alert.rules.yml

Line 141 in 5c6f589

 expr: -delta(node_filesystem_free_bytes{mountpoint='/mnt/gluster/fc',instance='gluster-fuse:9100'}[12h]) > 0 

Stats Pusher stops gathering metrics if one of them fails

The current Stats Pusher stat_values.py calls sys.exit() if one of the HTTP calls fails. However, this kills the whole script, meaning that all subsequent metrics are no longer collected. This in turn means many false alerts get fired.

ukwa-monitor/stat-pusher/script/stat_values.py

Lines 28 to 37 in cc9f9b0

 r = requests.get(uri) 

 logger.debug(f"Response code [{r.status_code}]") 

 r.raise_for_status() 

 response = r.json() 

 except HTTPError as he: 

 logger.error(f"HTTP error trying to get [{uri}]\n[{he}]") 

 sys.exit() 

 except Exception as e: 

 logger.error(f"Failed to get [{uri}]\n[{e}]") 

 sys.exit()

The loop that goes through the checks should continue on to the next one if there is a problem. There is already a catch-all Exception handler at the loop level, so the best plan would seem to be to raise the Exception up the chain rather than swallow it locally.

Improve WARC backlog alerting

Currently, any WARC/log backlog on Gluster is identified by Gluster filling up overall, but Gluster contains other files and this metric turns out to be a poor indicator of issues, leading to a lot of false alarms:

ukwa-monitor/monitor/prometheus/alert.rules.yml

Lines 141 to 148 in 5565d0a

 - alert: gluster_fc_filling_up 

 expr: deriv(node_filesystem_free_bytes{instance="gluster-fuse:9100",mountpoint="/mnt/gluster/fc"}[24h]) > 1e6 

 for: 12h 

 labels: 

 severity: severe 

 annotations: 

 summary: "The FC Gluster volume is filling up" 

 description: "The Gluster volume that stores FC output is filling up. This likely means there's a problem with the move-to-hdfs process and WARCs are backing up."

Rather, we should focus on specific alerts for logs and WARCs. There's already a daily log file alert:

ukwa-monitor/monitor/prometheus/alert.rules.yml

Lines 96 to 103 in 5565d0a

 - alert: trackdb_no_new_npld_crawl_logs 

 expr: absent(trackdb_last_timestamp{label="npld.crawl-logs"}) or (time() - trackdb_last_timestamp{label="npld.crawl-logs"}) / (60 * 60) > 24 

 for: 8h 

 labels: 

 severity: severe 

 annotations: 

 summary: "No new NPLD crawl logs on HDFS!" 

 description: "According to TrackDB, no new NPLD crawl logs have turned up on HDFS lately."

So this should be checked and another alert based on newly-established metrics should be used instead:

delta(ukwa_files_count{fs="gluster", job="warc_tidy", kind="warcs"}[12h] ) > 0

This metric detects whether the number of WARCs on gluster has increased in the last 12 hours. Looking back at this metric, we can tell that if this is the case for more 12 hours, there is a problem shifting data off Gluster.

Drop old alert related to Luigi crawl launcher

Crawls are now launched with Airflow, so alertname=crawl_launcher_has_not_run alerts are no longer necessary. Airflow itself manages jobs and alerts on failures, so no replacement alert is needed in Prometheus itself.

Document monitoring architecture and plan

Switching away from using custom code approach here, to configuring off the shelf monitoring tools.

Prometheus and Grafana as overall monitoring of statistics and alerts.

See https://github.com/ukwa/ukwa-documentation/blob/master/Monitoring-Services.md for details.

Areas to monitor:

Check HTTP-based services are up and responsive (in groups)
Check HDFS storage status and increase. (Means scraping HTML tags)
Check Crawler status (Means scanning Docker containers and re-formatting JSON)
Check Crawler local disk space etc.
Check AMQP or Kafka queues.

Logs and other events (like crawl events) routed from servers e.g. using filebeat into a monitoring-events Kafka that logstash can consume and push to elasticsearch. This acts as a 'debugging console' where last few days of logs are kept and can be used to debug what's happening.

To Do:

Should make Heritrix3 logstash data schema consistent with the Kafka crawl log feed.
Should use logstash-http-poller or Prometheus blackbox_exporter to poll HTTP endpoints.
Should work out how to expose crawl engine metrics for Prometheus. We could write an exporter , like this one

There's some useful example Docker stuff here

Stats Pusher to support array responses

The JSON configuration of the stats pusher does not currently support responses that contain arrays. e.g. we could also query the CDX indexes directly (both crawl-time and access-time), like this:

http://cdx.api.wa.bl.uk/data-heritrix?sort=reverse&output=jsondict&limit=1&url=https://www.bl.uk/robots.txt

However, this returns an array, not a dictionary, and wa_stats can't be configured to deal with that.

Add a task to check that the W3ACT production database has been backed up

We should have some checks that will throw out alerts if it looks like certain processes haven't run.

To start with, the W3ACT production database should be backed up onto HDFS once a day (by this task. A monitoring task could check if the output is present for yesterday, and raise an exception if not.

Currently, the backup path is:

/2_backups/crawler01/pulsefeprod_postgres_1/w3act.pgdump-20170714

and you should be able to check the task is complete using e.g.

yesterday = datetime.date.today() - datetime.timedelta(days=1)
target = BackupProductionW3ACTPostgres(date=yesterday)
if not target.complete():
    raise Exception("BLAH BLAH BLAH")

To do this, we need to add python-shepherd as a dependency for this project so it can inspect the tasks.

Add metric and alert to spot WARC backlog

This metric can be used to spot if content is backing-up on Gluster:

-delta(node_filesystem_free_bytes{mountpoint='/mnt/gluster/fc',instance='gluster-fuse:9100'}[12h]) > 0

e.g see production Prometheus chart

This should be monitoried, and if it goes on for a day, raise an alert.

Additional alert if the crawl log(s) are not being written.

We need a new alert, alongside this one that is based on what's on HDFS

The new alert should be based on that, but use this metric, which spots when the tidy-logs job has noted that the crawl log is missing or not growing.

delta(ukwa_crawler_log_size_bytes{log='crawl.log'}[1h]) == 0 or absent(ukwa_crawler_log_size_bytes{log='crawl.log'})

If this condition is active for: 1h then an alert should inform us that the crawl_job_name crawl is not writing to it's crawl.log.

Fix problem with CDX-up-to-date check

The current check for whether the CDX is up to date uses the BBC robots.txt file as a check, but (I think) Wayback is configured to omit revisits, so this only reports the date correctly if the robots.txt is changed. It would make more sense to use https://www.bbc.co.uk/news as the sensor URL.

Switch networking modes to simplify configuration and maintenance

If we can switch monitoring/Prometheus to use host networking mode, this will use the local /etc/hosts file and so avoid having to duplicate hosts mappings in the Docker setup.

Monitor that Logstash is still running.

The crawl logs in ElasticSearch sometimes have gaps, because Logstash gets stuck on some Kafka error that appears to be transient. We need some hook to check there are recent logs in ElasticSearch, or maybe monitoring Logstash itself.

e.g. https://github.com/alxrem/prometheus-logstash-exporter ?

or https://medium.com/@malone.spencer/logstash-events-to-prometheus-912d7ac43a74

Decide whether to enhance the dashboard or switch to an off-the-shelf system

Currently the system runs it's own simple (if rather dense) dashboard, as a basic Python Flask app and a simple template that formats the result of the monitoring tasks (stored on disk).

Originally, we were planning to stick with this approach, but move towards a schematic representation of our service using CSS to indicate status (see overview.svg). Perhaps enhanced with a few trend plots (e.g. using plot.ly) as required.

An alternative is to just use some off-the-shelf dashboard that's fancier and configurable:

Kibana
Grafana
Prometheus
...and many more...
Or just to integrate with the existing Nagios installation.

There's various things we'd ideally like to add to our monitoring dashboard, like screenshots of crawled pages, which would lean towards having our own simple dashboard. However, perhaps this an unrealistic amount of effort? It's difficult to see the pay-off unless these are systems we can use to debug problems, not just be alerted to them.

	r = requests.get(uri)
	logger.debug(f"Response code [{r.status_code}]")
	r.raise_for_status()
	response = r.json()
	except HTTPError as he:
	logger.error(f"HTTP error trying to get [{uri}]\n[{he}]")
	sys.exit()
	except Exception as e:
	logger.error(f"Failed to get [{uri}]\n[{e}]")
	sys.exit()

	- alert: gluster_fc_filling_up
	expr: deriv(node_filesystem_free_bytes{instance="gluster-fuse:9100",mountpoint="/mnt/gluster/fc"}[24h]) > 1e6
	for: 12h
	labels:
	severity: severe
	annotations:
	summary: "The FC Gluster volume is filling up"
	description: "The Gluster volume that stores FC output is filling up. This likely means there's a problem with the move-to-hdfs process and WARCs are backing up."

	- alert: trackdb_no_new_npld_crawl_logs
	expr: absent(trackdb_last_timestamp{label="npld.crawl-logs"}) or (time() - trackdb_last_timestamp{label="npld.crawl-logs"}) / (60 * 60) > 24
	for: 8h
	labels:
	severity: severe
	annotations:
	summary: "No new NPLD crawl logs on HDFS!"
	description: "According to TrackDB, no new NPLD crawl logs have turned up on HDFS lately."

ukwa / ukwa-monitor Goto Github PK

ukwa-monitor's Introduction

ukwa-monitor

ukwa-monitor's People

Contributors

Watchers

Forkers

ukwa-monitor's Issues

Recommend Projects

Recommend Topics

Recommend Org