Giter VIP home page Giter VIP logo

ukwa-monitor's Introduction

ukwa-monitor

Dashboard and monitoring system for the UK Web Archive

Note Default organisation name used within Grafana still needs to be set manually via the Grafana UI. See grafana/grafana#2908 for details/progress.

ukwa-monitor's People

Contributors

gilhoggarth avatar anjackson avatar ldbiz avatar

Watchers

 avatar James Cloos avatar  avatar  avatar  avatar Nicola Bingham avatar Dorota Walker avatar  avatar  avatar

ukwa-monitor's Issues

Stats Pusher submits multiple metrics

I think this is a bug in the Stats Pusher, in that the same metric turns up multiple times with different labels, or maybe I'm using it wrong?

After specifying one additional metric, it turns up twice, once under each top-level key:

trackdb_numFound_rr_logs{instance="solr8",job="cdx_oa_wayback",label="logs"}
trackdb_numFound_rr_logs{instance="solr8",job="trackdb",label="logs"}

Add alerts for Airflow

Airflow on Ingest has been integrated with monitoring, as in we are recording metrics, e.g.

http://monitor-prometheus.api.wa.bl.uk/graph?g0.expr=airflow_dag_last_status&g0.tab=1&g0.stacked=0&g0.show_exemplars=0&g0.range_input=1h

Where the airflow_dag_last_status metric records the outcome of the most recent run for each workflow a.k.a. DAG. We have an alert for this, but it doens't fire because the for: 2hr period is too long:

Could you tweak it down to for: 5m so we know sooner if jobs are failing.

Move alerts into a Grafana dashboard

Having to check alertmanager seperately is a bit of a pain. Recent versions of Grafan include an AlertManager data source module, which could be use to integrate alert inspection and management into a Grafana dashboard. The module is in alpha but when it's ready, we should take advantage of this and make a new status dashboard that includes it.

Note also that there's a new Prometheus implementation that might be worth checking out: https://cortexmetrics.io/

Add AWS CloudWatch monitoring

Given it's proven difficult to keep an eye on what's going on on AWS, we should integrate some monitoring into our stack. The best way to do this seems to be to use the cloudwatch_exporter, as that's supported by the core Prometheus team.

There are quite a lot of example configurations that could be tweaked, e.g. this bit measures EC2 CPU

Add LDL services and add a cert expiration alert

Turns out we can use Prometheus and the Blackbox Exporter to track/look out for expiring SSL certificates. e.g.

(probe_ssl_earliest_cert_expiry - time())/(60*60*24)

This query will report days until expiration for each HTTPS service in the Blackbox Exporter. Therefore we an add an alert that fires when there's 30 days remaining.

We should also add the alpha and beta services to the system, at least to the extend that our production systems can see them. We still need to exactly resolve how to monitor the production systems.

Stats Pusher to allow the same metric with different labels

It's often handy to use the same metric with different labels, as it makes breakdown analysis easier. I wanted to do something like this:

trackdb_numFound = #######
trackdb_numFound{label='warcs'} = #####
trackdb_numFound{label='cdx'} = ####

i.e. use the label to count sub-sets of the total number of files found in TrackDB. However, because the metric name (trackdb_numFound) is constructed from the JSON dictionary keys, and we only have one metric definition per key, we can only use each metric name once. The meant to get it to work I had to use:

trackdb_numFound = ######
trackdb_numFound_warcs = #####
trackdb_numFound_cdx = ####

Which is not so easy to use.

Add a check that the main HttpFS service is running

The DLS relies on the HttpFS service with the domain name dls.httpfs.wa.bl.uk, which is hard-coded within DLS to refer to Nellie's 194. IP address. The firewall is set up along this route.

The monitor system should check that the relevant HttpFS service is running port 14000. Note also that our patched version of HttpFS has been modified so that the Content-Length is declared. Ideally we should also check for that, in case we 'downgrade' by accident at some point in the future.

Move Gluster-filling-up metric to something more reliable

The current gluster-filling-up alert uses delta() and I'm not sure what it's doing because it's not very easy to interpret. See e.g. this comparison with delta() and deriv()

That link includes this alternative implementation:

deriv(node_filesystem_free_bytes{instance="gluster-fuse:9100",mountpoint="/mnt/gluster/fc"}[24h]) > 1e6

This seems a but more stable, so I suggest we switch to that instead of

expr: -delta(node_filesystem_free_bytes{mountpoint='/mnt/gluster/fc',instance='gluster-fuse:9100'}[12h]) > 0

Stats Pusher stops gathering metrics if one of them fails

The current Stats Pusher stat_values.py calls sys.exit() if one of the HTTP calls fails. However, this kills the whole script, meaning that all subsequent metrics are no longer collected. This in turn means many false alerts get fired.

r = requests.get(uri)
logger.debug(f"Response code [{r.status_code}]")
r.raise_for_status()
response = r.json()
except HTTPError as he:
logger.error(f"HTTP error trying to get [{uri}]\n[{he}]")
sys.exit()
except Exception as e:
logger.error(f"Failed to get [{uri}]\n[{e}]")
sys.exit()

The loop that goes through the checks should continue on to the next one if there is a problem. There is already a catch-all Exception handler at the loop level, so the best plan would seem to be to raise the Exception up the chain rather than swallow it locally.

Improve WARC backlog alerting

Currently, any WARC/log backlog on Gluster is identified by Gluster filling up overall, but Gluster contains other files and this metric turns out to be a poor indicator of issues, leading to a lot of false alarms:

- alert: gluster_fc_filling_up
expr: deriv(node_filesystem_free_bytes{instance="gluster-fuse:9100",mountpoint="/mnt/gluster/fc"}[24h]) > 1e6
for: 12h
labels:
severity: severe
annotations:
summary: "The FC Gluster volume is filling up"
description: "The Gluster volume that stores FC output is filling up. This likely means there's a problem with the move-to-hdfs process and WARCs are backing up."

Rather, we should focus on specific alerts for logs and WARCs. There's already a daily log file alert:

- alert: trackdb_no_new_npld_crawl_logs
expr: absent(trackdb_last_timestamp{label="npld.crawl-logs"}) or (time() - trackdb_last_timestamp{label="npld.crawl-logs"}) / (60 * 60) > 24
for: 8h
labels:
severity: severe
annotations:
summary: "No new NPLD crawl logs on HDFS!"
description: "According to TrackDB, no new NPLD crawl logs have turned up on HDFS lately."

So this should be checked and another alert based on newly-established metrics should be used instead:

delta(ukwa_files_count{fs="gluster", job="warc_tidy", kind="warcs"}[12h] ) > 0

This metric detects whether the number of WARCs on gluster has increased in the last 12 hours. Looking back at this metric, we can tell that if this is the case for more 12 hours, there is a problem shifting data off Gluster.

Drop old alert related to Luigi crawl launcher

Crawls are now launched with Airflow, so alertname=crawl_launcher_has_not_run alerts are no longer necessary. Airflow itself manages jobs and alerts on failures, so no replacement alert is needed in Prometheus itself.

Document monitoring architecture and plan

Switching away from using custom code approach here, to configuring off the shelf monitoring tools.

Prometheus and Grafana as overall monitoring of statistics and alerts.

See https://github.com/ukwa/ukwa-documentation/blob/master/Monitoring-Services.md for details.

Areas to monitor:

  • Check HTTP-based services are up and responsive (in groups)
  • Check HDFS storage status and increase. (Means scraping HTML tags)
  • Check Crawler status (Means scanning Docker containers and re-formatting JSON)
  • Check Crawler local disk space etc.
  • Check AMQP or Kafka queues.

Logs and other events (like crawl events) routed from servers e.g. using filebeat into a monitoring-events Kafka that logstash can consume and push to elasticsearch. This acts as a 'debugging console' where last few days of logs are kept and can be used to debug what's happening.

To Do:

There's some useful example Docker stuff here

Add a task to check that the W3ACT production database has been backed up

We should have some checks that will throw out alerts if it looks like certain processes haven't run.

To start with, the W3ACT production database should be backed up onto HDFS once a day (by this task. A monitoring task could check if the output is present for yesterday, and raise an exception if not.

Currently, the backup path is:

/2_backups/crawler01/pulsefeprod_postgres_1/w3act.pgdump-20170714

and you should be able to check the task is complete using e.g.

yesterday = datetime.date.today() - datetime.timedelta(days=1)
target = BackupProductionW3ACTPostgres(date=yesterday)
if not target.complete():
    raise Exception("BLAH BLAH BLAH")

To do this, we need to add python-shepherd as a dependency for this project so it can inspect the tasks.

Additional alert if the crawl log(s) are not being written.

We need a new alert, alongside this one that is based on what's on HDFS

The new alert should be based on that, but use this metric, which spots when the tidy-logs job has noted that the crawl log is missing or not growing.

delta(ukwa_crawler_log_size_bytes{log='crawl.log'}[1h]) == 0 or absent(ukwa_crawler_log_size_bytes{log='crawl.log'})

If this condition is active for: 1h then an alert should inform us that the crawl_job_name crawl is not writing to it's crawl.log.

Fix problem with CDX-up-to-date check

The current check for whether the CDX is up to date uses the BBC robots.txt file as a check, but (I think) Wayback is configured to omit revisits, so this only reports the date correctly if the robots.txt is changed. It would make more sense to use https://www.bbc.co.uk/news as the sensor URL.

Decide whether to enhance the dashboard or switch to an off-the-shelf system

Currently the system runs it's own simple (if rather dense) dashboard, as a basic Python Flask app and a simple template that formats the result of the monitoring tasks (stored on disk).

Originally, we were planning to stick with this approach, but move towards a schematic representation of our service using CSS to indicate status (see overview.svg). Perhaps enhanced with a few trend plots (e.g. using plot.ly) as required.

An alternative is to just use some off-the-shelf dashboard that's fancier and configurable:

There's various things we'd ideally like to add to our monitoring dashboard, like screenshots of crawled pages, which would lean towards having our own simple dashboard. However, perhaps this an unrealistic amount of effort? It's difficult to see the pay-off unless these are systems we can use to debug problems, not just be alerted to them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.