Giter VIP home page Giter VIP logo

ntpmon's Introduction

NTPmon

Introduction

NTPmon is a program which is designed to report on essential health metrics for NTP. It provides a Nagios check which can be used with many alerting systems, including support for Nagios performance data. NTPmon can also run as a daemon for sending metrics to collectd, prometheus, or telegraf. It supports both ntpd and chronyd.

NTPmon is designed to encourage the use of robust NTP configurations. The defaults for what is considered healthy and non-healthy are roughly based on RFC8633: NTP Best Current Practices.

Copyright

Copyright (c) 2015-2024 Paul D. Gear https://libertysys.com.au/

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see https://www.gnu.org/licenses/agpl.html.

Gallery

Here are some graphs produced with data gathered by NTPmon using telegraf, InfluxDB, and Grafana.

A system offset graph:

Graph of system offset

A system offset histogram:

Histogram of system offset

A root dispersion graph:

Graph of root dispersion

A frequency error graph showing variation based on temperature due to time of day:

Histogram of frequency error

Or you could try these interactive Grafana dashboard snapshots:

You can find more context for these dashboards in the following blog posts:

Installation

On Ubuntu (and possibly other Debian derivatives) NTPmon and its prerequisites can be installed from its PPA using:

sudo add-apt-repository ppa:paulgear/ntpmon
sudo apt install chrony ntpmon

chrony is the preferred NTP server on Ubuntu; you can also use ntp or ntpsec from the universe pool, although the are not guaranteed to receive security updates unless you use Ubuntu Pro.

If you wish to use something other than the prometheus exporter by default, you must edit /etc/default/ntpmon to configure the command-line options. Run /opt/ntpmon/bin/ntpmon --help for details of all available options.

Prerequisites

NTPmon is written in python, and requires python 3.8 or later. It uses modules from the standard python library, and also requires the psutil library, which is available from pypi or your operating system repositories. It requires ntpq or chronyc to retrieve metrics from the running NTP daemon. If you intend to run the prometheus exporter, the prometheus python client is also required.

On Ubuntu (and probably other Debian-based Linux distributions), you can install all the prerequisites by running:

sudo apt-get install chrony python3-prometheus-client python3-psutil
# or substitute ntp for the traditional NTP server

Usage

To run NTPmon directly from source after manually installing the prerequisites:

cd /opt
git clone https://github.com/paulgear/ntpmon
cd ntpmon
./src/ntpmon.py --help

Metrics

NTPmon alerts on the following metrics of the local NTP server:

Summary metrics

sync

Does NTP have a sync peer? If not, return CRITICAL, otherwise return OK.

peers

Are there more than the minimum number of peers active? The NTP algorithms require a minimum of 3 peers for accurate clock management; to allow for failure or maintenance of one peer at all times, NTPmon returns OK for 4 or more configured peers, CRITICAL for 1 or 0, and WARNING for 2-3.

reach

Are the configured peers reliably reachable on the network? Return CRITICAL for less than 50% total reachability of all configured peers; return OK for greater than 75% total reachability of all configured peers.

offset

Is the clock offset from its sync peer (or other peers, if the sync peer is not available) acceptable? Return CRITICAL for 50 milliseconds or more average difference, WARNING for 10 ms or more average difference, and OK for anything less.

System metrics

In addition, NTPmon retrieves the following metrics directly from the local NTP server (using ntpq -nc readvar or chronyc -c tracking):

  • offset (as sysoffset, to distinguish it from offset)
  • sys_jitter (as sysjitter, for grouping with sysoffset)
  • frequency
  • stratum
  • rootdelay
  • rootdisp

See the NTP documentation for the meaning of these metrics.

Peer metrics

Counts of each peer type are emitted under the ntpmon_peers metric. The recognised peer types are pps, sync, invalid, false, excess, backup, outlier, survivor, and unknown. (Under normal circumstances, unknown will never appear - its presence indicates a bug in NTPmon.) Note that sync also includes the pps peer (if any), and survivor also includes the sync peer (if any), so they are not strictly mutually exclusive. There should be no overlap in any of the other types.

If your chronyd or ntpd is configured to store peer (source) statistics, these will be collected as they appear in the relevant log files (/var/log/chrony/measurements.log and /var/log/ntpstats/peerstats, respectively, by default) and emitted under the ntpmon_peer (singluar) metric, in addition to all the above-mentioned metrics. Use the --logfile command line option to monitor a different file if your distribution uses different locations. NTPmon will silently ignore any issues relating to these files in order to continue running, so if you don't notice metrics coming out when you expect them, check permissions on the files and compare their contents to the documented formats. Please submit a bug report if you encounter persistent issues with this.

Collectd doesn't have a really great way to support these individual peer metrics, so each peer is considered to be a collectd "host". This feature should be considered experimental for collectd, and subject to change or deprecation (input on this is welcome).

Prometheus exporter

When run in prometheus mode, NTPmon uses the prometheus python client to expose metrics via the HTTP server built into that library. No security testing or validation has been performed on this library by the NTPmon author; users are suggested not to expose it on untrusted networks, and are reminded that - as stated in the license terms - this software comes with no warranty.

Telegraf integration

When run in telegraf mode, NTPmon requires the telegraf socket listener input plugin to be enabled. Use the --connect command-line option if you configure this to listen on a host and/or port other than the default (127.0.0.1:8094).

Startup delay

By default, until the NTP server has been running for 512 seconds (the minimum time for 8 polls at 64-second intervals), check_ntpmon will return OK (zero return code). This is to prevent false positives on startup or for short-lived VMs. To ignore this safety precaution, use --run-time with a low number (e.g. 1 sec).

Roadmap

Python version

The current minimum python version targeted is 3.8. This version reaches end of life in October 2024 and will be deprecated in NTPmon sometime between the release of Ubuntu 24.04 ("Noble Numbat") in April 2024 and python 3.8's EOL date.

Output integrations

Telegraf is the preferred output integration for NTPmon (over collectd and prometheus), due to its higher resolution timestamps, and measuring the timestamp at the source which generated it rather than the scraping host. The other integrations (first collectd, then Nagios, then prometheus) may eventually go away if they are not widely used. Please let me know (via an issue) if you have strong feelings about this.

ntpmon's People

Contributors

paulgear avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ntpmon's Issues

classifier.worst_metric() has a rudimentary view of "worst"

At the moment it picks the first metric with the highest criticality. So if you have 2 or more metrics in critical, it will only ever pick the first one. It may be that we need to allow the caller to specify which metrics are more important as a tie-breaker.

Sync peer check should have a configurable grace period

We have an environment running chrony via the ntp charm on bionic that regularly drops and re-selects it's sync peer on random hosts which lasts from 1-30 minutes with system default chrony profiles using the default ubuntu pools and an additional local source IP.

We get alerts for ntpmon showing "no sync peer" but it typically clears within 30 minutes or an hour, if not shorter.

It would be very helpful for alerting noise reduction to have a configurable length of time in which a system can take to find a sync peer among available sources.

Nagios is triggered with: CRITICAL: offset is out of range (nan) - must be between -0.050000 and 0.050000

On a system running chronyd as a service, we get a frequent error message such as stated above.
They happen once in a while and all self resolve, usually within 30 minutes. At the moment such an situation is active, it is possible to login on the system and their is no apparent network issue.

Uptime (according to systemctl) is 4 months 17 days

● chrony.service - chrony, an NTP client/server
   Loaded: loaded (/lib/systemd/system/chrony.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2019-04-18 20:10:42 UTC; 4 months 17 days ago
     Docs: man:chronyd(8)
           man:chronyc(1)
           man:chrony.conf(5)
 Main PID: 72447 (chronyd)
    Tasks: 1 (limit: 314572)
   CGroup: /system.slice/chrony.service
           └─72447 /usr/sbin/chronyd

Sep 04 17:34:43 wa1okosl012 chronyd[72447]: Received KoD RATE from 91.189.91.157
Sep 04 18:23:15 wa1okosl012 chronyd[72447]: Received KoD RATE from 91.189.89.199
Sep 04 20:29:00 wa1okosl012 chronyd[72447]: Received KoD RATE from 69.89.207.199
Sep 04 21:32:07 wa1okosl012 chronyd[72447]: Received KoD RATE from 91.189.89.198
Sep 04 23:16:45 wa1okosl012 chronyd[72447]: Selected source 69.41.163.31

Last time this error happened was 1:20 utc, Sep 05

Below I attached the log files covering the period.

measurements.log
statistics.log
tracking.log

SERVICEPERFDATA frequency=-14.939000 offset=nan peers=21 reach=94.642857 result=2 rootdelay=0.022047 rootdisp=0.021695 runtime=12035387 stratum=3 sync=0.000000 sysjitter= sysoffset=-0.000142107 tracehosts= traceloops= tracetime=

[Wishlist] Configurable metric thresholds for nagios check_ntpmon.py

In some use cases, the hard-coded thresholds in alert.py _metricdefs lead to unactionable critical alerts when using check_ntpmon.py.

As an example, an all-reach-mean of 20% during an edge cloud upstream BGP storm causing intermittent NTP server access may be considered by that cloud's operators as a warning indicator of upstream connectivity, but is not actionable as a critical failure of the NTP component itself.

It would be useful to have a CLI argument to provide overrides to the _metricdefs in the NTPAlerter for the nagios check.

An example simple cli implementation might look like this if one wanted to override the reach and offset thresholds:

check_ntpmon.py --check peers sync reach offset --metric-overrides 'reach:high:50:10;offset:mid:-0.1:-0.05:0.05:0.1'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.