Giter VIP home page Giter VIP logo

nhc's Introduction

LBNL Node Health Check (NHC)

Join the chat at https://gitter.im/mej/nhc

TORQUE, Slurm, and other schedulers/resource managers provide for a periodic "node health check" to be performed on each compute node to verify that the node is working properly. Nodes which are determined to be "unhealthy" can be marked as down or offline so as to prevent jobs from being scheduled or run on them. This helps increase the reliability and throughput of a cluster by reducing preventable job failures due to misconfiguration, hardware failure, etc.

Though many sites have created their own scripts to serve this function, the vast majority are one-off efforts with little attention paid to extensibility, flexibility, reliability, speed, or reuse. Developers at Lawrence Berkeley National Laboratory created this project in an effort to change that. LBNL Node Health Check (NHC) has several design features that set it apart from most home-grown solutions:

  • Reliable - To prevent single-threaded script execution from causing hangs, execution of subcommands is kept to an absolute minimum, and a watchdog timer is used to terminate the check if it runs for too long.
  • Fast - Implemented almost entirely in native bash (2.x or greater). Reducing pipes and subcommands also cuts down on execution delays and related overhead.
  • Flexible - Anything which can be described in a shell function can be a check. Modules can also populate cache data and reuse it for multiple checks.
  • Extensible - Its modular functional interface makes writing new checks easy. Just drop modules into the scripts directory, then add your checks to the config file!
  • Reusable - Written to be ultra-portable and can be used directly from a resource manager or scheduler, run via cron, or even spawned centrally (e.g., via pdsh). The configuration file syntax allows for all compute nodes to share a single configuration.

In a typical scenario, the NHC driver script is run periodically on each compute node by the resource manager client daemon (e.g., pbs_mom). It loads its configuration file to determine which checks are to be run on the current node (based on its hostname). Each matching check is run, and if a failure is encountered, NHC will exit with an error message describing the problem. It can also be configured to mark nodes offline so that the scheduler will not assign jobs to bad nodes, reducing the risk of system-induced job failures. NHC can also log errors to the syslog (which is often forwarded to the master node). Some resource managers are even able to use NHC as a pre-job validation tool, keeping scheduled jobs from running on a newly-failed node, and/or a post-job cleanup/checkup utility to remove nodes from the scheduler which may have been adversely affected by the just-completed job.

Table of Contents (by gh-md-toc)

Getting Started

The following instructions will walk you through downloading and installing LBNL NHC, configuring it for your system, testing the configuration, and implementing it for use with the TORQUE resource manager.

Installation

Pre-built RPM packages for Red Hat Enterprise Linux versions 6, 7, and 8 are made available with each release along with the source tarballs. The latest release, as well as prior releases, can be found on GitHub. Simply download the appropriate RPM for your compute nodes' RHEL/OEL/AlmaLinux/Rocky version.

The previous NHC Yum repository was supplied by LBNL, mostly to make the task of reporting download counts up to DOE easier, and is thus no longer available to us (obviously). If you have a suggestion for an alternative or could host one yourself, please let the team know!

OpenSUSE/SLES packages are available through the Open Build Service.

If you prefer to install from source, or if you aren't using one of the above platforms, the tarball for the latest release is also available via the NHC Project on GitHub. WARNING: DO NOT use the "Source code (zip)" or "Source code (tar.gz)" links at the bottom of the release's Assets list! Those are archives of the Git repository generated by GitHub, not the NHC team, and they lack the required contents to build (without requiring additional developer tools)!

To install NHC from the source tarball linked above, untar it, change into the directory it created, and run:

# ./configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/libexec
# make test
# make install

NOTE: The make test step is optional but recommended. This will run NHC's built-in unit test suite to make sure everything is functioning properly!

NOTE: You can also fork and/or clone the whole NHC project on GitHub; this is recommended if you plan to contribute to NHC development as this makes it very easy to submit your changes upstream using GitHub Pull Requests! Visit the NHC Project Page to Watch, Star, or Fork the project!

Whether you use RPMs or install from source, the script will be installed as /usr/sbin/nhc, the configuration file and check scripts in /etc/nhc, and the helper scripts in /usr/libexec/nhc. Once you've completed one of the 3 installation methods above on your compute nodes' root filesystem image, you can proceed with the configuration.

Sample Configuration

The default configuration supplied with LBNL NHC is intended to be more of an overview of available checks than a working configuration. It's essentially impossible to create a default configuration that will work out-of-the-box for any host and still do something useful. But there are some basic checks which are likely to apply, with some modifications of boundary values, to most systems. Here's an example nhc.conf which shouldn't require too many tweaks to be a solid starting point:

# Check that / is mounted read-write.
* || check_fs_mount_rw /


# Check that sshd is running and is owned by root.
* || check_ps_service -u root -S sshd


# Check that there are 2 physical CPUs, 8 actual cores, and 8 virtual cores (i.e., threads)
* || check_hw_cpuinfo 2 8 8


# Check that we have between 1kB and 1TB of physical RAM
* || check_hw_physmem 1k 1TB


# Check that we have between 1B and 1TB of swap
* || check_hw_swap 1b 1TB


# Check that we have at least some swap free
* || check_hw_swap_free 1


# Check that eth0 is available
* || check_hw_eth eth0

Obviously you'll need to adjust the CPU and memory numbers, but this should get you started.

Config File Auto-Generation

Instead of starting with a basic sample configuration and building on it, as of version 1.4.1, the nhc-genconf utility is supplied with NHC which uses the same shell code as NHC itself to query various attributes of your system (CPU socket/core/thread counts, RAM size, swap size, etc.) and automatically generate an initial configuration file based on its scan. Simply invoke nhc-genconf on each system where NHC will be running. By default, this will create the file /etc/nhc/nhc.conf.auto which can then be renamed (or used directly via NHC's -c option), tweaked, and deployed on your system!

Normally the config file which nhc-genconf creates will use the hostname of the node on which it was run at the beginning of each line. This is to allow multiple files to be merged and sorted into a single config that will work across your system. However, you may wish to provide a custom match expression to prefix each line; this may be done via the -H option (e.g., -H host1 or -H '*').

The scan also includes BIOS information obtained via the dmidecode command. The default behavior only includes lines from the output which match the regular expression /([Ss]peed|[Vv]ersion)/, but this behavior may be altered by supplying an alternative match string via the -b option (e.g., -b '*release*').

It can be incredibly tedious, especially for large, well-established heterogeneous or multi-generational clusters to gather up all the different types of hardware that exist in your system and write the appropriate NHC config file rules, match expressions, etc. The following commands might come in handy for aggregating the results of nhc-genconf across a large group of nodes:

# wwsh ssh 'n*' "/usr/sbin/nhc-genconf -H '*' -c -" | dshbak -c
 OR
# pdsh -a "/usr/sbin/nhc-genconf -H '*' -c -" | dshbak -c

Testing

As of version 1.2 (and higher), NHC comes with a built-in set of fairly extensive unit tests. Each of the check functions is tested for proper functionality; even the driver script (/usr/sbin/nhc itself) is tested! To run the unit tests, use the make test command at the top of the source tree. You should see something like this:

# make test
make -C test test
make[1]: Entering directory `/home/mej/svn/lbnl/nhc/test'
Running unit tests for NHC:
nhcmain_init_env...ok 6/6
nhcmain_finalize_env...ok 14/14
nhcmain_check_conffile...ok 1/1
nhcmain_load_scripts...ok 6/6
nhcmain_set_watchdog...ok 1/1
nhcmain_run_checks...ok 2/2
common.nhc...ok 18/18
ww_fs.nhc...ok 61/61
ww_hw.nhc...ok 65/65
ww_job.nhc...ok 2/2
ww_nv.nhc...ok 4/4
ww_ps.nhc...ok 32/32
All 212 tests passed.
make[1]: Leaving directory `/home/mej/svn/lbnl/nhc/test'
#

If everything works properly, all the unit tests should pass. Any failures represent a problem that should be reported to the NHC Users' Mailing List!

Before adding the node health check to your resource manager (RM) configuration, it's usually prudent to do a test run to make sure it's installed/configured/running properly first. To do this, simply run /usr/sbin/nhc with no parameters. Successful execution will result in no output and an exit code of 0. If this is what you get, you're done testing! Skip to the next section.

If you receive an error, it will look similar to the following:

ERROR Health check failed:  Actual CPU core count (2) does not match expected (8).

Depending on which check failed, the message will vary. Hopefully it will be clear what the discrepancy is based on the content of the message. Adjust your configuration file to match your system and try again. If you need help, feel free to post to the NHC Users' Mailing List.

Additional information may be found in /var/log/nhc.log, the runtime logfile for NHC. A successful run based on the configuration above will look something like this:

Node Health Check starting.
Running check:  "check_fs_mount_rw /"
Running check:  "check_ps_daemon sshd root"
Running check:  "check_hw_cpuinfo 2 8 8"
Running check:  "check_hw_physmem 1024 1073741824"
Running check:  "check_hw_swap 1 1073741824"
Running check:  "check_hw_swap_free 1"
Running check:  "check_hw_eth eth0"
Node Health Check completed successfully (1s).

A failure will look like this:

Node Health Check starting.
Running check:  "check_fs_mount_rw /"
Running check:  "check_ps_daemon sshd root"
Running check:  "check_hw_cpuinfo 2 8 8"
Health check failed:  Actual CPU core count (2) does not match expected (8).

We can see from the excerpt here that the check_hw_cpuinfo check failed and that the machine we ran on appears to be a dual-socket single-core system (2 cores total). Since our configuration expected a dual-socket quad-core system (8 cores total), this was flagged as a failure. Since we're testing our configuration, this is most likely a mismatch between what we told NHC to expect and what the system actually has, so we need to fix the configuration file. Once we have a working configuration and have gone into production, a failure like this would likely represent a hardware issue.

Once the configuration has been modified, try running /usr/sbin/nhc again. Continue fixing the discrepancies and re-running the script until it succeeds; then, proceed with the next section.

Implementation

Instructions for putting NHC into production depend entirely on your use case. We can't possibly hope to delineate them all, but we'll cover some of the most common.

Slurm Integration

Add the following to /etc/slurm.conf (or /etc/slurm/slurm.conf, depending on version) on your master node AND your compute nodes (because, even though the HealthCheckProgram only runs on the nodes, your slurm.conf file must be the same across your entire system):

HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=300

This will execute NHC every 5 minutes.

For optimal support of Slurm, NHC version 1.3 or higher is recommended. Prior versions will require manual intervention.

TORQUE Integration

NHC can be executed by the pbs_mom process at job start, job end, and/or regular intervals (irrespective of whether or not the node is running job(s)). More detailed information on how to configure the pbs_mom health check can be found in the TORQUE Documentation. The configuration used here at LBNL is as follows:

$node_check_script /usr/sbin/nhc
$node_check_interval 5,jobstart,jobend
$down_on_error 1

This causes pbs_mom to launch /usr/sbin/nhc every 5 "MOM intervals" (45 seconds by default), when starting a job, and when a job completes (or is terminated). Failures will cause the node to be marked as "down."

NOTE: Some concern has been expressed over the possibility for "OS jitter" caused by NHC. NHC was designed to keep jitter to an absolute minimum, and the implementation goes to extreme lengths to reduce and eliminate as many potential causes of jitter as possible. No significant jitter has been experienced so far (and similar checks at similar intervals are used on extremely jitter-sensitive systems); however, increase the interval to 80 instead of 5 for once-hourly checks if you suspect NHC-generated jitter to be an issue for your system. Alternatively, some sites have configured NHC to detect running jobs and simply exit (or run fewer checks); that works too!

In addition, NHC will by default mark the node "offline" (i.e., pbsnodes -o) and add a note (viewable with pbsnodes -ln) specifying the failure. Once the failure has been corrected and NHC completes successfully, it will remove the note it set and clear the "offline" status from the node. In order for this to work, however, each node must have "operator" access to the TORQUE daemon. Unfortunately, the support for wildcards in pbs_server attributes is limited to replacing the host, subdomain, and/or domain portions with asterisks, so for most setups this will likely require omitting the entire hostname section. The following has been tested and is known to work:

qmgr -c "set server operators += root@*"

This functionality is not strictly required, but it makes determining the reason nodes are marked down significantly easier!

Another possible caveat to this functionality is that it only works if the canonical hostname (as returned by the hostname command or the file /proc/sys/kernel/hostname) of each node matches its identity within TORQUE. If your site uses FQDNs on compute nodes but has them listed in TORQUE using the short versions, you will need to add something like this to the top of your NHC configuration file:

* || HOSTNAME="$HOSTNAME_S"

This will cause the offline/online helpers to use the shorter hostname when invoking pbsnodes. This will NOT, however, change how the hostnames are matched in the NHC configuration, so you'll still need to use FQDN matching there.

It's also important to note here that NHC will only set a note on nodes that don't already have one (and aren't yet offline) or have one set by NHC itself; also, it will only online nodes and clear notes if it sees a note that was set by NHC. It looks for the string "NHC:" in the note to distinguish between notes set by NHC and notes set by operators. If you use this feature, and you need to mark nodes offline manually (e.g., for testing), setting a note when doing so is strongly encouraged. (You can do this via the -N option, like this: pbsnodes -o -N 'Testing stuff' n0000 n0001 n0002) There was a bug in versions prior to 1.2.1 which would cause it to treat nodes with no notes the same way it treats nodes with NHC-assigned notes. This should be fixed in 1.2.1 and higher, but you never know....

Grid Engine Integration

Sun Grid Engine (SGE) has had a somewhat "colorful" history over the years. It has evolved and changed hands numerous times, and there are currently multiple incarnations of it which are developed under both commercial and open source models. Unfortunately, I don't have a whole lot of experience with any of them -- it was on its way out when I first joined the team at LBNL and was eliminated completely shortly thereafter. So I'm afraid I don't have the expertise to get NHC working with any of the Grid Engine variants.

The good news, though, is that Dave Love -- developer of the Son of Grid Engine open source project -- does! He has made multiple contributions over the years to help get NHC integrating effectively with SGE and all the assorted Grid Engine variants. Additionally, he put together a great recipe to help SGE users (and other users of ❓GE incarnations), so rather than try to reproduce it here and keep it updated, I recommend you peruse his work in its entirety if you're a user of one of those products! 🤵

Periodic Execution

The original method for doing this was to employ a simple crontab entry, like this one:

[email protected]
*/5 * * * * /usr/sbin/nhc

Annoyingly, this would result in an e-mail being sent every 5 minutes if one of the health checks fails. It was for this very reason that the contributed nhc.cron script was originally written. However, even though it avoids the former technique's flood of e-mail when a problem arose, it still had no clean way of dealing with multiple contexts and could not be set up to do periodic reminders of issues. Additionally, it would fail to notify if a new problem was detected before or at the same time the old problem was resolved.

Version 1.4.1 introduces a vastly superior option: nhc-wrapper. This tool will execute nhc1 and record the results. It then compares the results to the output of the previous run, if present, and will ignore results that are identical to those previously obtained. Old results can be set to expire after a given length of time (and thus re-reported). Results may be echoed to stdout or sent via e-mail. Once an unrecognized command line option or non-option argument is encountered, it and the rest of the command line arguments are passed to the wrapped program intact.

This tool will typically be run via cron(8). It can be used to wrap distinct contexts of NHC in a manner identical to NHC itself (i.e., specified via executable name or command line arg); also, unlike the old nhc.cron script, this one does a comparison of the results rather than only distinguishing between the presence/absence of output, and those results can have a finite lifespan.

nhc-wrapper also offers another option for periodic execution: looping (-L). When launched from a terminal or inittab/init.d entry in looping mode, nhc-wrapper will execute a loop which runs the wrapped program (e.g., nhc) at a time interval you supply. It attempts to be smart about interpreting your intent as well, calculating sleep times after subprogram execution (i.e., the interval is from start time to start time, not end time to start time) and using nice, round execution times when applicable (i.e., based on 00:00 local time instead of whatever random time the wrapper loop happened to be entered). For example, if you ask it to run every 5 minutes, it'll run at :00, :05, :10, :15, etc. If you ask for every 4 hours, it'll run at 00:00, 04:00, 08:00, 12:00, 16:00, and 20:00 exactly--regardless of what time it was when you originally launched nhc-wrapper!

This allows the user to run nhc-wrapper in a terminal to keep tabs on it while still running checks at predictable times (just like crond would). It also has some flags to provide timestamps (-L t) and/or ASCII horizontal rulers (-L r) between executions; clearing the screen (-L c) before each execution (watch-style) is also available.

Examples:

To run nhc and notify root when errors appear, are cleared, or every 12 hours while they persist:

# /usr/sbin/nhc-wrapper -M root -X 12h

Same as above, but run the "nhc-cron" context instead (nhc -n nhc-cron):

# /usr/sbin/nhc-wrapper -M root -X 12h -n nhc-cron
  OR
# /usr/sbin/nhc-wrapper -M root -X 12h -A '-n nhc-cron'

Same as above, but run nhc-cron (symlink to nhc) instead:

# /usr/sbin/nhc-wrapper -M root -X 12h -P nhc-cron
  OR
# ln -s nhc-wrapper /usr/sbin/nhc-cron-wrapper
# /usr/sbin/nhc-cron-wrapper -M root -X 12h

Expire results after 1 week, 1 day, 1 hour, 1 minute, and 1 second:

# /usr/sbin/nhc-wrapper -M root -X 1w1d1h1m1s

Run verbosely, looping every minute with ruler and timestamp:

# /usr/sbin/nhc-wrapper -L tr1m -V

Or for something quieter and more cron-like:

# /usr/sbin/nhc-wrapper -L 1h -M root -X 12h

Configuration

Now that you have a basic working configuration, we'll go more in-depth into how NHC is configured, including command-line invocation, configuration file syntax, modes of operation, how individual checks are matched against a node's hostname, and what checks are already available in the NHC distribution for your immediate use.

Configuration of NHC is generally done in one of 3 ways: passing option flags and/or configuration (i.e., environment) variables on the command line, setting variables and specifying checks in the configuration file (/etc/nhc/nhc.conf by default), and/or setting variables in the sysconfig initialization file (/etc/sysconfig/nhc by default). The latter works essentially the same as any other sysconfig file (it is directly sourced into NHC's bash session using the . operator), so this document does not go into great detail about using it. The following sections discuss the other two mechanisms.

Command-Line Invocation

From version 1.3 onward, NHC supports a subset of command-line options and arguments in addition to the configuration and sysconfig files. A few specific settings have CLI options associated with them as shown in the table below; additionally, any configuration variable which is valid in the configuration or sysconfig file may also be passed on the command line instead.

Options

Command-Line Option Equivalent Configuration Variable Purpose
-D confdir CONFDIR=confdir Use config directory confdir (default: /etc/name)
-a NHC_CHECK_ALL=1 Run ALL checks; don't exit on first failure (useful for cron-based monitoring)
-c conffile CONFFILE=conffile Load config from conffile (default: confdir/name.conf)
-d DEBUG=1 Activate debugging output
-e check EVAL_LINE=check Evaluate check and exit immediately based on its result
-f NHC_CHECK_FORKED=1 Run each check in a separate background process (EXPERIMENTAL)
-h N/A Show command line help
-l logspec LOGFILE=logspec File name/path or BASH-syntax directive for logging output (- for STDOUT)
-n name NAME=name Set program name to name (default: nhc); see -D & -c
-q SILENT=1 Run quietly
-t timeout TIMEOUT=timeout Use timeout of timeout seconds (default: 30)
-v VERBOSE=1 Run verbosely (i.e., show check progress)

NOTE: Due to the use of the getopts bash built-in, and the limitations thereof, POSIX-style bundling of options (e.g., -da) is NOT supported, and all command-line options MUST PRECEDE any additional variable/value-type arguments!

Variable/Value Arguments

Instead of, or possibly in addition to, the use of command-line options, NHC accepts configuration via variables specified on the command line. Simply pass any number of VARIABLE=value arguments on the command line, and each variable will be set to its respective value immediately upon NHC startup. This happens before the sysconfig file is loaded, so it can be used to alter such values as $SYSCONFIGDIR (/etc/sysconfig by default) which would normally be unmodifiable.

It's important to note that while command-line configuration directives will override NHC's built-in defaults for various variables, variables set in the configuration file (see below) will NOT be overridden. The config file takes precedence over the command line, in contrast to most other CLI tools out there (and possibly contrary to user expectation) due to the way bash deals with variables and initialization. If you want the command line to take precedence, you'll need to test the value of the variable in the config file and only alter it if the current value matches NHC's built-in default.

Example Invocations

Most sites just run nhc by itself with no options when launching from a resource manager daemon. However, when running from cron or manually at the command line, numerous other possible scenarios exist for invoking NHC in various ways. Here are some real-world examples.

To run in debug mode, either of the following two command lines may be used:

# nhc -d
# nhc DEBUG=1

To run for testing purposes in debug mode with no timeout and with node online/offline disabled:

# nhc -d -t 0 MARK_OFFLINE=0

To force use of Slurm as the resource manager and use a sysconfig path in /opt:

# nhc NHC_RM=slurm SYSCONFIGDIR=/opt/etc/sysconfig

NHC can also be invoked with the -e option to run a specific single check rather than reading checks from a config file. This technique is great for debugging a new check you're writing, testing your syntax for a particular check line before adding it to a config file, or running a single system check (possibly with an auto-fix option activated) across a cluster or nodeset during routine maintenance or cluster bring-up. For example, to check that /net/scratch1 and /net/scratch2 are mounted (type lustre) from the correct location, or if they are not, to restart the rlustre service to (hopefully) correct the problem:

# nhc -e 'check_fs_mount_rw -t lustre -s "fs[0-9]:/export/fs/scratch[12]" -e "/sbin/service rlustre restart" -f /net/scratch1 /net/scratch2'

To run NHC out-of-band (e.g., from cron) with the name nhc-oob (which will load its config from /etc/sysconfig/nhc-oob and /etc/nhc/nhc-oob.conf):

# nhc -n nhc-oob

NOTE: As an alternative, you may symlink /usr/sbin/nhc-oob to nhc and run nhc-oob instead. This will accomplish the same thing.

Configuration File Syntax

The configuration file is fairly straight-forward. Stored by default in /etc/nhc/nhc.conf, the file is plain text and recognizes the traditional # introducer for comments. Any line that starts with a # (with or without leading whitespace) is ignored. Blank lines are also ignored.

Examples:

# This is a comment.
       # This is also a comment.

# This line and the previous one will both be ignored.

Configuration lines contain a target specifier, the separator string ||, and the check command. The target specifies which hosts should execute the check; only nodes whose hostname matches the given target will execute the check on that line. All other nodes will ignore it and proceed to the next check.

A check is simply a shell command. All NHC checks are bash functions defined in the various included files in /etc/nhc/scripts/*.nhc, but in actuality any valid shell command that properly returns success or failure will work. This documentation and all examples will only reference bash function checks. Each check can take zero or more arguments and is executed exactly as seen in the configuration.

As of version 1.2, configuration variables may also be set in the config file with the same syntax. This makes it easy to alter specific settings, commands, etc. globally or for individual hosts/hostgroups!

Example:

    * || SOMEVAR="value"
    * || check_something
*.foo || another_check 1 2 3

Match Strings

As noted in the last section, the first item on each line of the NHC configuration file specifies the target for the check which will follow. When NHC runs on a particular host, it reads and parses each line of the configuration file, comparing the hostname of the host (taken from the $HOSTNAME variable) with the specified target expression; if the target matches, the check will be saved for later execution. Lines whose targets don't match the current host are ignored completely. The target is expressed in the form of a match string -- an NHC expression that allows for exact string matches or a variety of dynamic comparison methods. Match strings are a very important concept and are used throughout NHC, not just for check targets, but as parameters to individual checks as well, so it's important that users fully understand how they work.

There are multiple forms of match string supported by NHC. The default style is a glob, also known as a wildcard. bash will determine if the hostname of the node (specifically, the contents of /proc/sys/kernel/hostname) matches the supplied glob expression (e.g., n*.viz) and execute only those checks which have matching target expressions. If the hostname does not match the glob, the corresponding check is ignored.

The second method for specifying host matches is via regular expression. Regex targets must be surrounded by slashes to identify them as regular expressions. The internal regex matching engine of bash is used to compare the hostname to the given regular expression. For example, given a target of /^n00[0-5][0-9]\.cc2$/, the corresponding check would execute on n0017.cc2 but not on n0017.cc1 or n0083.cc2.

The third form of match string (supported in NHC versions 1.2.2 and later) is node range expressions similar to those used by pdsh, Warewulf, and other open source HPC tools. (Please note that not all expressions supported by other tools will work in NHC due to limitations in bash.) The match expression is placed in curly braces and specifies one or more comma-separated node name ranges, and the corresponding check will only execute on nodes which fall into at least one of the specified ranges. Note that only one range expression is supported per range, and commas within ranges are not supported. So, for example, the target {n00[00-99].phys,n000[0-4].bio} would cause its check to execute on n0030.phys, n0099.phys, and n0001.bio, but not on n0100.phys nor n0005.bio. Expressions such as {n[0-3]0[00-49].r[00-29]} and {n00[00-29,54,87].sci} are not supported (though the latter may be written instead as {n00[00-29].sci,n0054.sci,n0087.sci}).

Match strings of any form (glob/wildcard, regular expression, node range, or external) can be negated. This simply means that a match string which would otherwise have matched will instead fail to match, and vice versa (i.e., the boolean result of the match is inverted). To negate any match string, simply prefix it (before the initial type character, if any) with an exclamation mark (!). For example, to run a check on all but the I/O nodes, you could use the expression: !io*

Examples:

                *  || valid_check1
              !ln* || valid_check2
       /n000[0-9]/ || valid_check3
    !/\.(gpu|htc)/ || valid_check4
      {n00[20-39]} || valid_check5
!{n03,n05,n0[7-9]} || valid_check6
   {n00[10-21,23]} || this_target_is_invalid

Throughout the rest of the documentation, we will refer to this concept as a match string (or abbreviated mstr). Anywhere a match string is expected, either a glob, a regular expression surrounded by slashes, or node range expression in braces, possibly with a leading ! to negate it, may be specified.

Supported Variables

As mentioned above, version 1.2 and higher support setting/changing shell variables within the configuration file. Many aspects of NHC's behavior can be modified through the use of shell variables, including a number of the commands in the various checks and helper scripts NHC employs.

There are, however, some variables which can only be specified in /etc/sysconfig/nhc, the global initial settings file for NHC. This is typically for obvious reasons (e.g., you can't change the path to the config file from within the config file!).

The table below provides a list of the configuration variables which may be used to modify NHC's behavior; those which won't work in a config file (only sysconfig or command line) are marked with an asterisk ("*"):

Variable Name Default Value Purpose
*CONFDIR /etc/nhc Directory for NHC configuration data
*CONFFILE $CONFDIR/$NAME.conf Path to NHC config file
DEBUG 0 Set to 1 to activate debugging output
*DETACHED_MODE 0 Set to 1 to activate Detached Mode
*DETACHED_MODE_FAIL_NODATA 0 Set to 1 to cause Detached Mode to fail if no prior check result exists
DF_CMD df Command used by check_fs_free, check_fs_size, and check_fs_used
DF_FLAGS -Tka Flags to pass to $DF_CMD for space checks. NOTE: Adding the -l flag is strongly recommended if only checking local filesystems.
DFI_CMD df Command used by check_fs_inodes, check_fs_ifree, and check_fs_iused
DFI_FLAGS -Tia Flags to pass to $DFI_CMD. NOTE: Adding the -l flag is strongly recommended if only checking local filesystems.
*EVAL_LINE None (unset) Same as -e command-line option: If set, evaluate $EVAL_LINE as a check and exit immediately based on its result.
*FORCE_SETSID 1 Re-execute NHC as a session leader if it isn't already one at startup
*HELPERDIR /usr/libexec/nhc Directory for NHC helper scripts
*HOSTNAME Set from /proc/sys/kernel/hostname Canonical name of current node
*HOSTNAME_S $HOSTNAME truncated at first . Short name (no domain or subdomain) of current node
IGNORE_EMPTY_NOTE 0 Set to 1 to treat empty notes like NHC-assigned notes (<1.2.1 behavior)
*INCDIR $CONFDIR/scripts Directory for NHC check scripts
JOBFILE_PATH TORQUE/PBS: $PBS_SERVER_HOME/mom_priv/jobs
Slurm: $SLURM_SERVER_HOME
Directory on compute nodes where job records are kept
*LOGFILE >>/var/log/nhc.log File name/path or BASH-syntax directive for logging output (- for STDOUT)
LSF_BADMIN badmin Command to use for LSF's badmin (may include path)
LSF_BHOSTS bhosts Command to use for LSF's bhosts (may include path)
LSF_OFFLINE_ARGS hclose -C Arguments to LSF's badmin to offline node
LSF_ONLINE_ARGS hopen Arguments to LSF's badmin to online node
MARK_OFFLINE 1 Set to 0 to disable marking nodes offline on check failure
MAX_SYS_UID 99 UIDs <= this number are exempt from rogue process checks
MCELOG mcelog Command to use to check for MCE log errors
MCELOG_ARGS --client Parameters passed to $MCELOG command
MCELOG_MAX_CORRECTED_RATE 9 Maximum number of corrected MCEs allowed before check_hw_mcelog() returns failure
MCELOG_MAX_UNCORRECTED_RATE 0 Maximum number of uncorrected MCEs allowed before check_hw_mcelog() returns failure
MDIAG_CMD mdiag Command to use to invoke Moab's mdiag command (may include path)
*NAME nhc Used to populate default paths/filenames for configuration
NHC_AUTH_USERS root nobody Users authorized to have arbitrary processes running on compute nodes
NHC_CHECK_ALL 0 Forces all checks to be non-fatal. Displays each failure message, reports total number of failed checks, and returns that number.
NHC_CHECK_FORKED 0 Forces each check to be executed in a separate forked subprocess. NHC attempts to detect directives which set environment variables to avoid forking those. Enhances resiliency if checks hang.
NHC_RM Auto-detected Resource manager with which to interact (pbs, slurm, sge, or lsf)
NVIDIA_HEALTHMON nvidia-healthmon Command used by check_nv_healthmon to check nVidia GPU status
NVIDIA_HEALTHMON_ARGS -e -v Arguments to $NVIDIA_HEALTHMON command
OFFLINE_NODE $HELPERDIR/node-mark-offline Helper script used to mark nodes offline
ONLINE_NODE $HELPERDIR/node-mark-online Helper script used to mark nodes online
PASSWD_DATA_SRC /etc/passwd Colon-delimited file in standard passwd format from which to load user account data
PATH /sbin:/usr/sbin:/bin:/usr/bin If a path is not specified for a particular command, this variable defines the directory search order.
PBSNODES pbsnodes Command used by above helper scripts to mark nodes online/offline
PBSNODES_LIST_ARGS -n -l all Arguments to $PBSNODES to list nodes and their status notes
PBSNODES_OFFLINE_ARGS -o -N Arguments to $PBSNODES to mark node offline with note
PBSNODES_ONLINE_ARGS -c -N Arguments to $PBSNODES to mark node online with note
PBS_SERVER_HOME /var/spool/torque Directory for TORQUE files
RESULTFILE /var/run/nhc/$NAME.status Used in Detached Mode to store result of checks for subsequent handling
RM_DAEMON_MATCH TORQUE/PBS: /\bpbs_mom\b/
Slurm: /\bslurmd\b/
SGE/UGE: /\bsge_execd\b/
Match string used by check_ps_userproc_lineage to make sure all user processes were spawned by the RM daemon
SILENT 0 Set to 1 to disable logging via $LOGFILE
SLURM_SCONTROL scontrol Command to use for Slurm's scontrol (may include path)
SLURM_SC_OFFLINE_ARGS update State=DRAIN Arguments to pass to Slurm's scontrol to offline a node
SLURM_SC_ONLINE_ARGS update State=IDLE Arguments to pass to Slurm's scontrol to online a node
SLURM_SERVER_HOME /var/spool/slurmd Location of Slurm data files (see also: $JOBFILE_PATH)
SLURM_SINFO sinfo Command to use for Slurm's sinfo (may include path)
STAT_CMD /usr/bin/stat Command to use to stat() files
STAT_FMT_ARGS -c Parameter to introduce format string to stat command
*TIMEOUT 30 Watchdog timer (in seconds)
VERBOSE 0 Set to 1 to display each check line before it's executed

Example usage:

       * || export PATH="$PATH:/opt/torque/bin:/opt/torque/sbin"
  n*.rh6 || MAX_SYS_UID=499
  n*.deb || MAX_SYS_UID=999
  *.test || DEBUG=1
       * || export MARK_OFFLINE=0
       * || NVIDIA_HEALTHMON="/global/software/rhel-6.x86_64/modules/nvidia/tdk/3.304.3/nvidia-healthmon/nvidia-healthmon"

Detached Mode

Version 1.2 and higher support a feature called "detached mode." When this feature is activated on the command line or in /etc/sysconfig/nhc (by setting DETACHED_MODE=1), the nhc process will immediately fork itself. The foreground (parent) process will immediately return success. The child process will run all the checks and record the results in $RESULTFILE (default: /var/run/nhc.status). The next time nhc is executed, just before forking off the child process (which will again run the checks in the background), it will load the results from $RESULTFILE from the last execution. Once the child process has been spawned, it will then return the previous results to its caller.

The advantage of detached mode is that any hangs or long-running commands which occur in the checks will not cause the resource manager daemon (e.g., pbs_mom) to block. Sites that use home-grown health check scripts often use a similar technique for this very reason -- it's non-blocking.

However, a word of caution: if a detached-mode nhc encounters a failure, it won't get acted upon until the next execution. So let's say you have NHC configured to only on job start and job end. Let's further suppose that the /tmp filesystem encounters an error and gets remounted read-only at some point after the completion of the last job and that you have check_fs_mount_rw /tmp in your nhc.conf. In normal mode, when a new job tries to start, nhc will detect the read-only mount on job start and will take the node out of service before the job is allowed to begin executing on the node. In detached mode, however, since nhc has not been run in the meantime, and the previous run was successful, nhc will return success and allow the job to start before the error condition is noticed!

For this reason, when using detached mode, periodic checks are HIGHLY recommended. This will not completely prevent the above scenario, but it will drastically reduce the odds of it occurring. Users of detached mode, as with any similar method of delayed reporting, must be aware of and accept this caveat in exchange for the benefits of the more-fully-non-blocking behavior.

Built-in Checks

In the documentation below, parameters surrounded by square brackets ([like this]) are optional. All others are required.

The LBNL Node Health Check distribution supplies the following checks:

check_cmd_output

check_cmd_output [-t timeout] [-r retval] [-m match [...]] { -e 'command [arg1 [...]]' | command [arg1 [...]] }

check_cmd_output executes a command and compares each line of its output against any _mstr_s (match strings) passed in. If any positive match is not found in the command output, or if any negative match is found, the check fails. The check also fails if the exit status of command does not match retval (if supplied) or if the command fails to complete within timeout seconds (default 5). Options to this check are as follows:

Check Option Purpose
-ecommand Execute command and gather its output. The command is split on word boundaries, much like /bin/sh -c '...' does.
-mmstr If the match string is negated, no line of the output may match the specified mstr expression. Otherwise, at least one line must match. This option may be used multiple times as needed.
-rretval Exit status (a.k.a. return code or return value) of command must equal retval or the check will fail.
-tsecs Command will timeout if not completed within secs seconds (default is 5).

NOTE: If the command is passed using -e, the command string is split on word boundaries to create the argv[] array for the command. If passed on the end of the check line, DO NOT quote the command. Each parameter must be distinct. Only use quotes to group multiple words into a single argument. For example, passing command as "service bind restart" will work if used with -e but will fail if passed at the end of the check line (use without quotes instead)!

Example (Verify that the rpcbind service is alive): check_cmd_output -t 1 -r 0 -m '/is running/' /sbin/service rpcbind status


check_cmd_status

check_cmd_status [-t timeout] -r retval command [arg1 [...]]

check_cmd_status executes a command and redirects its output to /dev/null. The check fails if the exit status of command exit status does not match retval or if the command fails to complete within timeout seconds (default 5). Options to this check are as follows:

Check Option Purpose
-rretval Exit status (a.k.a. return code or return value) of command must equal retval or the check will fail.
-tsecs Command will timeout if not completed within secs seconds (default is 5).

Example (Make sure SELinux is disabled): check_cmd_status -t 1 -r 1 selinuxenabled


check_dmi_data_match

check_dmi_data_match [-! | -n | '!'] [-h handle] [-t type] string

check_dmi_data_match uses parsed, structured data taken from the output of the dmidecode command to allow the administrator to make very specific assertions regarding the contents of the DMI (a.k.a. SMBIOS) data. Matches can be made against any output or against specific types (classifications of data) or even handles (identifiers of data blocks, typically sequential). Output is restructured such that sections which are indented underneath a section header have the text of the section header prepended to the output line along with a colon and intervening space. So, for example, the string "ISA is supported" which appears underneath the "Characteristics:" header, which in turn is underneath the "BIOS Information" header/type, would be parsed by check_dmi_data_match as "BIOS Information: Characteristics: ISA is supported"

See the dmidecode man page for more details.

WARNING: Although string is technically a match string, and supports negation in its own right, you probably don't want to use negated match strings here. Passing the -! or -n parameters to the check means, "check all relevant DMI data and pass the check only if no matching line is found." Using a negated match string here would mean, "The check passes as soon as ANY non-matching line is found" -- almost certainly not the desired behavior! A subtle but important distinction.

Example (check for BIOS version): check_dmi_data_match "BIOS Information: Version: 1.0.37"


check_dmi_raw_data_match

check_dmi_raw_data_match match_string [...]

check_dmi_raw_data_match is basically like a grep on the raw output of the dmidecode command. If you don't need to match specific strings in specific sections but just want to match a particular string anywhere in the raw output, you can use this check instead of check_dmi_data_match (above) to avoid the additional overhead of parsing the output into handles, types, and expanded strings.

Example (check for firmware version in raw output; could really match any version): check_dmi_raw_data_match "Version: 1.24.4175.33"


check_file_contents

check_file_contents file mstr [...]

check_file_contents looks at the specified file and allows one or more (possibly negated) mstr match strings (glob, regexp, etc.) to be applied to the contents of the file. The check fails unless ALL specified expressions successfully match the file content, but the order in which they appear in the file need not match the order specified on the check line. No post-processing is done on the file, but take care to quote any shell metacharacters in your match expressions properly. Also remember that matching against the contents of large files will slow down NHC and potentially cause a timeout. Reading of the file stops when all match expressions have been successfully found in the file.

The file is only read once per invocation of check_file_contents, so if you need to match several expressions in the same file, passing them all to the same check is advisable.

NOTE: This check handles negated match strings internally so that they "do the right thing:" ensure that no matching lines exist in the entire file.

Example (verify setting of $pbsserver in pbs_mom config): check_file_contents /var/spool/torque/mom_priv/config '/^\$pbsserver master$/'


check_file_stat

check_file_stat [-D num] [-G name] [-M mode] [-N secs] [-O secs] [-T num] [-U name] [-d num] [-g gid] [-m mode] [-n secs] [-o secs] [-t num] [-u uid] filename(s)

check_file_stat allows the user to assert specific properties on one or more files, directories, and/or other filesystem objects based on metadata returned by the Linux/Unix stat command. Each option specifies a test which is applied to each of the filename(s) in order. The check fails if any of the comparisons does not match. Options to this check are as follows:

Check Option Purpose
-Dnum Specifies that the device ID for filename(s) should be num (decimal or hex)
-Gname Specifies that filename(s) should be owned by group name
-Mmode Specifies that the permissions for filename(s) should include at LEAST the bits set in mode
-Nsecs Specifies that the ctime (i.e., inode change time) of filename(s) should be newer than secs seconds ago
-Osecs Specifies that the ctime (i.e., inode change time) of filename(s) should be older than secs seconds ago
-Tnum Specifies that the minor device number for filename(s) be num
-Uname Specifies that filename(s) should be owned by user name
-dnum Specifies that the device ID for filename(s) should be num (decimal or hex)
-ggid Specifies that filename(s) should be owned by group id gid
-mmode Specifies that the permissions for filename(s) should EXACTLY be the bits set in mode
-nsecs Specifies that the mtime (i.e., modification time) of filename(s) should be newer than secs seconds ago
-osecs Specifies that the mtime (i.e., modification time) of filename(s) should be older than secs seconds ago
-tnum Specifies that the major device number for filename(s) be num
-uuid Specifies that filename(s) should be owned by uid uid

Example (Assert correct uid, gid, owner, group, & major/minor device numbers for /dev/null): check_file_stat -u 0 -g 0 -U root -G root -t 1 -T 3 /dev/null


check_file_test

check_file_test [-a] [-b] [-c] [-d] [-e] [-f] [-g] [-h] [-k] [-p] [-r] [-s] [-t] [-u] [-w] [-x] [-O] [-G] [-L] [-S] [-N] filename(s)

check_file_test allows the user to assert very simple attributes on one or more files, directories, and/or other filesystem objects based on tests which can be performed via the shell's built-in test command. Each option specifies a test which is applied to each of the filename(s) in order. NHC internally evaluates the shell expression testoption filename for each option given for each filename specified. (In other words, passing 2 options and 3 filenames will evaluate 6 test expressions in total.) The check fails if any of the test command evaluations returns false. For efficiency, this check should be used in preference to check_file_stat whenever possible as it does not require calling out to the stat command. Options to this check are as follows:

Check Option Purpose
-a Evaluates to true if the filename being tested exists (same as -e).
-b Evaluates to true if the filename being tested exists and is block special.
-c Evaluates to true if the filename being tested exists and is character special.
-d Evaluates to true if the filename being tested exists and is a directory.
-e Evaluates to true if the filename being tested exists.
-f Evaluates to true if the filename being tested exists and is a regular file.
-g Evaluates to true if the filename being tested exists and is setgid.
-h Evaluates to true if the filename being tested exists and is a symbolic link.
-k Evaluates to true if the filename being tested exists and has its sticky bit set.
-p Evaluates to true if the filename being tested exists and is a named pipe.
-r Evaluates to true if the filename being tested exists and is readable.
-s Evaluates to true if the filename being tested exists and is not empty.
-t Evaluates to true if the filename being tested is a numeric file descriptor which references a valid tty.
-u Evaluates to true if the filename being tested exists and is setuid.
-w Evaluates to true if the filename being tested exists and is writable.
-x Evaluates to true if the filename being tested exists and is executable.
-O Evaluates to true if the filename being tested exists and is owned by NHC's EUID.
-G Evaluates to true if the filename being tested exists and is owned by NHC's EGID.
-L Evaluates to true if the filename being tested exists and is a symbolic link (same as -h).
-S Evaluates to true if the filename being tested exists and is a socket.
-N Evaluates to true if the filename being tested exists and has been modified since it was last read.

Example (Assert correct ownerships and permissions on /dev/null similar to above, assuming NHC runs as root): check_file_test -O -G -c -r -w /dev/null


check_fs_inodes

check_fs_inodes mountpoint [min] [max]

Ensures that the specified mountpoint has at least min but no more than max total inodes. Either may be blank.

WARNING: Use of this check requires execution of the /usr/bin/df command which may HANG in cases of NFS failure! If you use this check, consider also using Detached Mode!

Example (make sure /tmp has at least 1000 inodes): check_fs_inodes /tmp 1k


check_fs_ifree

check_fs_ifree mountpoint min

Ensures that the specified mountpoint has at least min free inodes.

WARNING: Use of this check requires execution of the /usr/bin/df command which may HANG in cases of NFS failure! If you use this check, consider also using Detached Mode!

Example (make sure /local has at least 100 inodes free): check_fs_ifree /local 100


check_fs_iused

check_fs_iused mountpoint max

Ensures that the specified mountpoint has no more than max used inodes.

WARNING: Use of this check requires execution of the /usr/bin/df command which may HANG in cases of NFS failure! If you use this check, consider also using Detached Mode!

Example (make sure /tmp has no more than 1 million used inodes): check_fs_iused /tmp 1M


check_fs_mount

check_fs_mount [-0] [-r] [-t fstype] [-s source] [-o options] [-O remount_options] [-e missing_action] [-E found_action] {-f|-F} mountpoint [...]

-OR- (deprecated)

check_fs_mount mountpoint [source] [fstype] [options]

check_fs_mount examines the list of mounted filesystems on the local machine to verify that the specified entry is present. mountpoint specifies the directory on the node where the filesystem should be mounted. source is a match string which is compared against the device, whatever that may be (e.g., server:/path for NFS or /dev/sda1 for local). fstype is a match string for the filesystem type (e.g., nfs, ext4, tmpfs). options is a match string for the mount options. Any number (zero or more) of these 3 items (i.e., sources, types, and/or options) may be specified; additionally, one or more mountpoints may be specified. Use -f for normal filesystems and -F for auto-mounted filesystems (to trigger them to be mounted prior to performing the check).

Unless the -0 (non-fatal) option is given, this check will fail if any of the specified filesystems is not found or does not match the type(s)/source(s)/option(s) specified. The -r (remount) option will cause NHC to attempt to re-mount missing filesystem(s) by issuing the system command "mount -o remount_options filesystem" in the background as root. This is "best effort," so success or failure of the mount attempt is not taken into account. If specified, missing_action is executed if a filesystem is not found. Also, if specified, found_action is executed for each filesystem which is found and correctly mounted.

Example (check for NFS hard-mounted /home from bluearc1:/global/home and mount if missing): check_fs_mount -r -s bluearc1:/global/home -t nfs -o *hard* -f /home


check_fs_mount_ro

check_fs_mount_ro [-0] [-r] [-t fstype] [-s source] [-o options] [-O remount_options] [-e missing_action] [-E found_action] -f mountpoint

-OR- (deprecated)

check_fs_mount_ro mountpoint [source] [fstype]

Checks that a particular filesystem is mounted read-only. Shortcut for check_fs_mount -o '/(^|,)ro($|,)/' ...


check_fs_mount_rw

check_fs_mount_rw [-0] [-r] [-t fstype] [-s source] [-o options] [-O remount_options] [-e missing_action] [-E found_action] -f mountpoint

-OR- (deprecated)

check_fs_mount_rw mountpoint [source] [fstype]

Checks that a particular filesystem is mounted read-write. Shortcut for check_fs_mount -o '/(^|,)rw($|,)/' ...


check_fs_free

check_fs_free mountpoint minfree

(Version 1.2+) Checks that a particular filesystem has at least minfree space available. The value for minfree may be specified either as a percentage or a numerical value with an optional suffix (k or kB for kilobytes, the default; M or MB for megabytes; G or GB for gigabytes; etc., all case insensitive).

WARNING: Use of this check requires execution of the /usr/bin/df command which may HANG in cases of NFS failure! If you use this check, consider also using Detached Mode!

Example: check_fs_free /tmp 128MB


check_fs_size

check_fs_size mountpoint [minsize] [maxsize]

(Version 1.2+) Checks that the total size of a particular filesystem is between minsize and maxsize (inclusive). Either may be blank; to check for a specific size, pass the same value for both parameters. The value(s) for minsize and/or maxsize are specified as positive integers with an optional suffix (k or kB for kilobytes, the default; M or MB for megabytes; G or GB for gigabytes; etc., all case insensitive).

WARNING: Use of this check requires execution of the /usr/bin/df command which may HANG in cases of NFS failure! If you use this check, consider also using Detached Mode!

Example: check_fs_size /tmp 512m 4g


check_fs_used

check_fs_used mountpoint maxused

(Version 1.2+) Checks that a particular filesystem has less than maxused space consumed. The value for maxused may be specified either as a percentage or a numerical value with an optional suffix (k or kB for kilobytes, the default; M or MB for megabytes; G or GB for gigabytes; etc., all case insensitive).

WARNING: Use of this check requires execution of the /usr/bin/df command which may HANG in cases of NFS failure! If you use this check, consider also using Detached Mode!

Example: check_fs_used / 98%


check_hw_cpuinfo

check_hw_cpuinfo [sockets] [cores] [threads]

check_hw_cpuinfo compares the properties of the OS-detected CPU(s) to the specified values to ensure that the correct number of physical sockets, execution cores, and "threads" (or "virtual cores") are present and functioning on the system. For a single-core, non-hyperthreading-enabled processor, all 3 parameters would be identical. Multicore CPUs will have more cores than sockets, and CPUs with Intel HyperThreading Technology (HT) turned on will have more threads than cores. Since HPC workloads often suffer when HT is active, this check is a handy way to make sure that doesn't happen.

Example (dual-socket 4-core Intel Nehalem with HT turned off): check_hw_cpuinfo 2 8 8


check_hw_eth

check_hw_eth device

check_hw_eth verifies that a particular Ethernet device is available. Note that it cannot check for IP configuration at this time.

Example: check_hw_eth eth0


check_hw_gm

check_hw_gm device

check_hw_gm verifies that the specified Myrinet device is available. This check will fail if the Myrinet kernel drivers are not loaded but does not distinguish between missing drivers and a missing interface.

Example: check_hw_gm myri0


check_hw_ib

check_hw_ib rate [device]

check_hw_ib determines whether or not an active IB link is present with the specified data rate (in Gb/sec). Version 1.3 and later support the device parameter for specifying the name of the IB device. Version 1.4.1 and later also verify that the kernel drivers and userspace libraries are the same OFED version.

Example (QDR Infiniband): check_hw_ib 40


check_hw_mcelog

check_hw_mcelog

check_hw_mcelog queries the running mcelog daemon, if present. If the daemon is not running or has detected no errors, the check passes. If errors are present, the check fails and sends the output to the log file and syslog.

The default behavior is to run mcelog --client but is configurable via the $MCELOG and $MCELOG_ARGS variables.

(Version 1.4.1 and higher) check_hw_mcelog will now also check the correctable and uncorrectable error counts in the past 24 hours and compare them to the settings $MCELOG_MAX_CORRECTED_RATE and $MCELOG_MAX_UNCORRECTED_RATE, respectively; if either actual count exceeds the value specified in the threshold, the check will fail. Set either or both variables to the empty string to obtain the old behavior.


check_hw_mem

check_hw_mem min_kb max_kb [fudge]

check_hw_mem compares the total system memory (RAM + swap) with the minimum and maximum values provided (in kB). If the total memory is less than min_kb or more than max_kb kilobytes, the check fails. To require an exact amount of memory, use the same value for both parameters.

If the optional fudge value is specified, either as an absolute size value or as a percentage of the total amount of memory, it represents a "fudge factor," a tolerance by which the amount of memory detected in the system may vary (either below min_kb or above max_kb) without failing the check. This allows both for slight variations in the Linux kernel's reported values and for rounding errors in the size calculations and unit conversions.

Example (exactly 26 GB system memory required): check_hw_mem 27262976 27262976


check_hw_mem_free

check_hw_mem_free min_kb

check_hw_mem_free adds the free physical RAM to the free swap (see below for details) and compares that to the minimum provided (in kB). If the total free memory is less than min_kb kilobytes, the check fails.

Example (require at least 640 kB free): check_hw_mem_free 640


check_hw_physmem

check_hw_physmem min_kb max_kb [fudge]

check_hw_physmem compares the amount of physical memory (RAM) present in the system with the minimum and maximum values provided (in kB). If the physical memory is less than min_kb or more than max_kb kilobytes, the check fails. To require an exact amount of RAM, use the same value for both parameters.

If the optional fudge value is specified, either as an absolute size value or as a percentage of the total amount of RAM, it represents a "fudge factor," a tolerance by which the amount of RAM detected in the system may vary (either below min_kb or above max_kb) without failing the check. This allows both for slight variations in the Linux kernel's reported values and for rounding errors in the size calculations and unit conversions.

Example (at least 12 GB RAM/node, no more than 48 GB): check_hw_physmem 12582912 50331648


check_hw_physmem_free

check_hw_physmem_free min_kb

check_hw_physmem_free compares the free physical RAM to the minimum provided (in kB). If less than min_kb kilobytes of physical RAM are free, the check fails. For purposes of this calculation, kernel buffers and cache are considered to be free memory.

Example (require at least 1 kB free): check_hw_physmem_free 1


check_hw_swap

check_hw_swap min_kb max_kb

check_hw_swap compares the total system virtual memory (swap) size with the minimum and maximum values provided (in kB). If the total swap size is less than min_kb or more than max_kb kilobytes, the check fails. To require an exact amount of memory, use the same value for both parameters.

Example (at most 2 GB swap): check_hw_swap 0 2097152


check_hw_swap_free

check_hw_swap_free min_kb

check_hw_swap_free compares the amount of free virtual memory to the minimum provided (in kB). If the total free swap is less than min_kb kilobytes, the check fails.

Example (require at least 1 GB free): check_hw_swap_free 1048576


check_moab_sched

check_moab_sched [-t timeout] [-a alert_match] [-m [!]mstr] [-v version_match]

check_moab_sched executes mdiag -S -v and examines its output, similarly to check_cmd_output. In addition to the arbitrary positive/negative mstr match strings, it also accepts an alert_match for flagging specific Moab alerts and a version_match for making sure the expected version is running. The check will fail based on any of these matches, or if mdiag does not return within the specified timeout.

Example (ensure we're running Moab 7.2.3 and it's not paused): check_moab_sched -t 45 -m '!/PAUSED/' -v 7.2.3


check_moab_rm

check_moab_rm [-t timeout] [-m [!]mstr]

check_moab_rm executes mdiag -R -v and examines its output, similarly to check_cmd_output. In addition to the arbitrary positive/negative mstr match strings, it also checks for any RMs which are not in the Active state (and fails if any are inactive). The check will also fail if mdiag does not return within the specified timeout.

Example (basic Moab RM sanity check): check_moab_rm -t 45


check_moab_torque

check_moab_torque [-t timeout] [-m [!]mstr]

check_moab_torque executes qmgr -c 'print server' and examines its output, similarly to check_cmd_output. In addition to the arbitrary positive/negative mstr match strings, it also checks to make sure that the scheduling parameter is set to True (and fails if it isn't). The check will also fail if qmgr does not return within the specified timeout.

Example (basic TORQUE configuration/responsiveness sanity check): check_moab_torque -t 45


check_net_ping

check_net_ping [-I interface] [-W timeout] [-c count] [-i interval] [-s packetsize] [-t ttl] [-w deadline] target(s)

(Version 1.4.2+) check_net_ping provides an NHC-based wrapper around the standard Linux/UNIX ping command. The most common command-line options for ping are supported, and any number of hostnames and/or IP addresses may be specified as targets. All options specified on the check_net_ping command line are passed directly to ping -q -n for each target specified. The following options are supported:

Check Option Purpose
-Iinterface interface is either an address or an interface name from which to send packets
-Wtimeout Wait timeout seconds for a response
-ccount Stop after sending count packets
-iinterval Wait interval seconds before sending each packet
-spacketsize Specifies that packets with packetsize bytes of data be sent
-tttl Set IP Time To Live in each packet to ttl
-wdeadline ping will exit after deadline seconds regardless of how many packets were sent/received

Example (check network connectivity to master, io, and xfer nodes): check_net_ping -W 3 -i 0.25 -c 5 master io000 xfer


check_net_socket

check_net_socket [-0] [-a] [-!] [-n <name>] [-p <proto>] [-s <state>] [-l <locaddr>[:<locport>]] [-r <rmtaddr>[:<rmtport>]] [-t <type>] [-u <user>] [-d <daemon>] [-e <action>] [-E <found_action>]

(Version 1.4.1+) check_net_socket executes either the command $NETSTAT_CMD $NETSTAT_ARGS (default: netstat -Tanpee -A inet,inet6,unix) or (if $NETSTAT_CMD is not in $PATH) the command $SS_CMD $SS_ARGS (default: ss -anpee -A inet,unix). The output of the command is parsed for socket information. Then each socket is compared with the match criteria passed in to the check: protocol proto, state state, local and/or remote address(es) locaddr/rmtaddr with optional ports locport/rmtport, type type, owner user, and/or process name daemon. If a matching socket is found, found_action is executed, and the check returns successfully. If no match is found, action is executed, and the check fails. Reverse the success/failure logic by specifying -! (i.e., if NHC finds one or more matching sockets, the check will fail).

The name parameter may be used to label the type of socket being sought (e.g., -n 'SSH daemon TCP listening socket'). If -0 is specified, the check is non-fatal (i.e., missing matches will be noted but will not terminate NHC. Use -a to locate all matching sockets (mainly for debugging).

Example (search for HTTP daemon IPv4 listening socket and restart if missing): check_net_socket -n "HTTP daemon" -p tcp -s LISTEN -l '0.0.0.0:80' -d httpd -e 'service httpd start'


check_nv_healthmon

check_nv_healthmon

(Version 1.2+) check_nv_healthmon runs the command $NVIDIA_HEALTHMON (default: nvidia-healthmon) with the arguments specified in $NVIDIA_HEALTHMON_ARGS (default: -e -v) to check for problems with any nVidia Tesla GPU devices on the system. If any errors are found, the entire (human-readable) output of the command is logged, and the check fails. NOTE: Version 3.304 or higher of the nVidia Tesla Deployment Kit (TDK) is required! See http://developer.nvidia.com/cuda/tesla-deployment-kit for details and downloads.

Example: check_nv_healthmon


check_ps_blacklist

(deprecated) check_ps_blacklist command [[!]owner] [args]

(Version 1.2+) check_ps_blacklist looks for a running process matching command (or, if args is specified, command+args). If owner is specified, the process must be owned by owner; if the optional ! is also specified, the process must NOT be owned by owner. If any matching process is found, the check fails. (This is the opposite of check_ps_daemon.)

NOTE: This check (as well as its complementary check, check_ps_daemon) has largely been replaced with check_ps_service. The latter should be used instead whenever possible.

Example (prohibit sshd NOT owned by root): check_ps_blacklist sshd !root


check_ps_cpu

check_ps_cpu [-0] [-a] [-f] [-K] [-k] [-l] [-s] [-u [!]user] [-m [!]mstr] [-e action] threshold

(Version 1.4+) check_ps_cpu is a resource consumption check. It flags any/all matching processes whose current percentage of CPU utilization meets or exceeds the specified threshold. The % suffix on the threshold is optional but fully supported. Options to this check are as follows:

Check Option Purpose
-0 Non-fatal. Failure of this check will be ignored.
-a Find, report, and act on all matching processes. Default behavior is to fail check after first matching process.
-eaction Execute /bin/bash -caction if matching process is found.
-f Full match. Match against entire command line, not just first word.
-K Kill parent of matching process (or processes, if used with -a) with SIGKILL. (NOTE: Does NOT imply -k)
-k Kill matching process (or processes, if used with -a) with SIGKILL.
-l Log matching process (or processes, if used with -a) to NHC log ($LOGFILE).
-mmstr Look only at processes matching mstr (NHC match string, possibly negated). Default is to check all processes.
-rvalue Renice matching process (or processes, if used with -a) by the specified value (may be positive or negative).
-s Log matching process (or processes, if used with -a) to the syslog.
-u [!]user User match. Matches only processes owned by user (or, if negated, NOT owned by user).

Example (look for non-root-owned process consuming 99% CPU or more; renice it to the max): check_ps_cpu -u !root -r 20 99%


check_ps_daemon

(deprecated) check_ps_daemon command [owner] [args]

check_ps_daemon looks for a running process matching command (or, if args is specified, command+args). If owner is specified, the process must be owned by owner. If no matching process is found, the check fails.

NOTE: This check (as well as its complementary check, check_ps_blacklist) has largely been replaced with check_ps_service. The latter should be used instead whenever possible.

Example (look for a root-owned sshd): check_ps_daemon sshd root


check_ps_kswapd

check_ps_kswapd cpu_time discrepancy [action [actions...]]

check_ps_kswapd compares the accumulated CPU time (in seconds) between kswapd kernel threads to make sure there's no imbalance among different NUMA nodes (which could be an early symptom of failure). Threads may not exceed cpu_time seconds nor differ by more than a factor of discrepancy. Unlike most checks, check_ps_kswapd need not be fatal. Zero or more actions may be specified from the following allowed actions: ignore (do nothing), log (write error to log file and continue), syslog (write error to syslog and continue), or die (fail the check as normal). The default is "die" if no action is specified.

Example (max 500 CPU hours, 100x discrepancy limit, only log and syslog on error): check_ps_kswapd 1800000 100 log syslog


check_ps_loadavg

check_ps_loadavg limit_1m limit_5m limit_15m

check_ps_loadavg looks at the 1-minute, 5-minute, and 15-minute load averages reported by the kernel and compares them to the parameters limit_1m, limit_5m, and limit_15m, respectively. If any limit has been exceeded, the check fails. Limits which are empty (i.e., '') or not supplied are ignored (i.e., assumed to be infinite) and will never fail.

Example (ensure the 5-minute load average stays below 30): check_ps_loadavg '' 30


check_ps_mem

check_ps_mem [-0] [-a] [-f] [-K] [-k] [-l] [-s] [-u [!]user] [-m [!]mstr] [-e action] threshold

(Version 1.4+) check_ps_mem is a resource consumption check. It flags any/all matching processes whose total memory consumption (including both physical and virtual memory) meets or exceeds the specified threshold. The threshold is interpreted as kilobytes (1024 bytes) or can use NHC's standard byte-suffix syntax (e.g., 32GB). Percentages are not supported for this check at this time. Options to this check are as follows:

Check Option Purpose
-0 Non-fatal. Failure of this check will be ignored.
-a Find, report, and act on all matching processes. Default behavior is to fail check after first matching process.
-eaction Execute /bin/bash -caction if matching process is found.
-f Full match. Match against entire command line, not just first word.
-K Kill parent of matching process (or processes, if used with -a) with SIGKILL. (NOTE: Does NOT imply -k)
-k Kill matching process (or processes, if used with -a) with SIGKILL.
-l Log matching process (or processes, if used with -a) to NHC log ($LOGFILE).
-mmstr Look only at processes matching mstr (NHC match string, possibly negated). Default is to check all processes.
-rvalue Renice matching process (or processes, if used with -a) by the specified value (may be positive or negative).
-s Log matching process (or processes, if used with -a) to the syslog.
-u [!]user User match. Matches only processes owned by user (or, if negated, NOT owned by user).

Example (look for process owned by baduser consuming 32GB or more of memory; log, syslog, and kill it): check_ps_mem -u baduser -l -s -k 32G


check_ps_physmem

check_ps_physmem [-0] [-a] [-f] [-K] [-k] [-l] [-s] [-u [!]user] [-m [!]mstr] [-e action] threshold

(Version 1.4+) check_ps_physmem is a resource consumption check. It flags any/all matching processes whose physical memory consumption (i.e., resident RAM only) meets or exceeds the specified threshold. The threshold is interpreted as a percentage if followed by a %, or as a number of kilobytes (1024 bytes) if numeric only, or can use NHC's standard byte-suffix syntax (e.g., 32GB). Options to this check are as follows:

Check Option Purpose
-0 Non-fatal. Failure of this check will be ignored.
-a Find, report, and act on all matching processes. Default behavior is to fail check after first matching process.
-eaction Execute /bin/bash -caction if matching process is found.
-f Full match. Match against entire command line, not just first word.
-K Kill parent of matching process (or processes, if used with -a) with SIGKILL. (NOTE: Does NOT imply -k)
-k Kill matching process (or processes, if used with -a) with SIGKILL.
-l Log matching process (or processes, if used with -a) to NHC log ($LOGFILE).
-mmstr Look only at processes matching match (NHC match string, possibly negated). Default is to check all processes.
-rvalue Renice matching process (or processes, if used with -a) by the specified value (may be positive or negative).
-s Log matching process (or processes, if used with -a) to the syslog.
-u [!]user User match. Matches only processes owned by user (or, if negated, NOT owned by user).

Example (look for all non-root-owned processes consuming more than 20% of system RAM; syslog and kill them all, but continue running): check_ps_physmem -0 -a -u !root -s -k 20%


check_ps_service

check_ps_service [-0] [-f] [-S|-r|-c|-s|-k] [-u user] [-d daemon | -m mstr] [ -e action | -E action ] service

(Version 1.4+) check_ps_service is similar to check_ps_daemon except it has the ability to start, restart, or cycle services which aren't running but should be, and to stop or kill services which shouldn't be running but are. Options to this check are as follows:

Check Option Purpose
-0 Non-fatal. Failure of this check will be ignored.
-S Start service. Service service will be started if not found running. Equivalent to -e '/sbin/service service start'
-c Cycle service. Service service will be cycled if not found running. Equivalent to -e '/sbin/service servicestop ; sleep 2 ; /sbin/serviceservice start'
-ddaemon Match running process by daemon instead of service. Equivalent to -m'*daemon'
-eaction Execute /bin/bash -caction if process IS NOT found running.
-Eaction Execute /bin/bash -caction if process IS found running.
-f Full match. Match against entire command line, not just first word.
-k Kill service. Service service will be killed (and check will fail) if found running. Similar to pkill -9 service
-mmstr Use match to search the process list for the service. Default is *service
-r Restart service. Service service will be restarted if not found running. Equivalent to -e '/sbin/service service restart'
-s Stop service. Service service will be stopped (and check will fail) if found running. Performs /sbin/service service stop
-u [!]user User match. Matches only processes owned by user (or, if negated, NOT owned by user).

Example (look for a root-owned sshd and start if missing): check_ps_service -u root -S sshd


check_ps_time

check_ps_time [-0] [-a] [-f] [-K] [-k] [-l] [-s] [-u [!]user] [-m [!]mstr] [-e action] threshold

(Version 1.4+) check_ps_time is a resource consumption check. It flags any/all matching processes whose total utilization of CPU time meets or exceeds the specified threshold. The threshold is a quantity of minutes suffixed by an M and/or a quantity of seconds suffixed by an S. A number with no suffix is interpreted as seconds. Options to this check are as follows:

Check Option Purpose
-0 Non-fatal. Failure of this check will be ignored.
-a Find, report, and act on all matching processes. Default behavior is to fail check after first matching process.
-eaction Execute /bin/bash -caction if matching process is found.
-f Full match. Match against entire command line, not just first word.
-K Kill parent of matching process (or processes, if used with -a) with SIGKILL. (NOTE: Does NOT imply -k)
-k Kill matching process (or processes, if used with -a) with SIGKILL.
-l Log matching process (or processes, if used with -a) to NHC log ($LOGFILE).
-mmstr Look only at processes matching match (NHC match string, possibly negated). Default is to check all processes.
-rvalue Renice matching process (or processes, if used with -a) by the specified value (may be positive or negative).
-s Log matching process (or processes, if used with -a) to the syslog.
-u [!]user User match. Matches only processes owned by user (or, if negated, NOT owned by user).

Example (look for runawayd daemon process consuming more than a day of CPU time; restart service and continue running): check_ps_time -0 -m '/runawayd/' -e '/sbin/service runawayd restart' 3600m


check_ps_unauth_users

check_ps_unauth_users [action [actions...]]

check_ps_unauth_users examines all processes running on the system to determine if the owner of each process is authorized to be on the system. Authorized users are anyone with a UID below, by default, 100 (including root) and any users currently running jobs on the node. All other processes are unauthorized. If an unauthorized user process is found, the specified action(s) are taken. The following actions are valid: kill (terminate the process), ignore (do nothing), log (write error to log file and continue), syslog (write error to syslog and continue), or die (fail the check as normal). The default is "die" if no action is specified.

Example (log, syslog, and kill rogue user processes): check_ps_unauth_users log syslog kill


check_ps_userproc_lineage

check_ps_userproc_lineage [action [actions...]]

check_ps_userproc_lineage examines all processes running on the system to check for any processes not owned by an "authorized user" (see previous check) which are not children (directly or indirectly) of the Resource Manager daemon. Refer to the $RM_DAEMON_MATCH configuration variable for how NHC determines the RM daemon process. If such a rogue process is found, the specified action(s) are taken. The following actions are valid: kill (terminate the process), ignore (do nothing), log (write error to log file and continue), syslog (write error to syslog and continue), or die (fail the check as normal). The default is "die" if no action is specified.

Example (mark the node bad on rogue user processes): check_ps_userproc_lineage die


Customization

Once you've fully configured NHC to run the built-in checks you need for your nodes, you're probably at the point where you've thought of something else you wish it could do but currently can't. NHC's design makes it very easy to create additional checks for your site and have NHC load and use them at runtime. This section will detail how to create new checks, where to place them, and what NHC will do with them.

While technically a "check" can be anything the nhc driver script can execute, for consistency and extensibility purposes (as well as usefulness to others), we prefer and recommend that checks be shell functions defined in a distinct, namespaced .nhc file. The instructions contained in this section will assume that this is the model you wish to use.

NOTE: If you do choose to write your own checks, and you feel they might be useful to the NHC community, we encourage you to share them. You can either e-mail them to the NHC Developers' Mailing List or submit a Pull Request on GitHub. GitHub PRs are definitely preferred, but if you choose to use e-mail instead, please provide either individual file attachments or a unified diff (i.e., diff -Nurp) against the NHC tarball/git tree if at all possible (though any usable format will likely be accepted).

Writing Checks

The first decision to be made is what to name your check file. As mentioned above, check files live (by default; see the $INCDIR and $CONFDIR configuration variables) in /etc/nhc/scripts/ and are named something.nhc2. A file containing utility and general-purpose functions called common.nhc can be found here. All other files placed here by the upstream package follow the naming convention siteid_class.nhc (e.g., the NHC project's file containing hardware checks is named lbnl_hw.nhc). Your siteid can be anything you'd like (other than lbnl, obviously) but should be recognizable. The class should refer to the subsystem or conceptual group of things you'll be monitoring.

For purposes of this example, we'll pretend we're from John Sheridan University, using site abbreviation "jsu," and we want to write checks for our "stuff."

Your /etc/nhc/scripts/jsu_stuff.nhc file should start with a header which provides a summary of what will be checked, the name and e-mail of the author, possibly the date or other useful information, and any copyright or license restrictions you are placing on the file3. It should look something like this:

# NHC -- John Sheridan University's Checks for Stuff
#
# Your Name <[email protected]>
# Date
#
# Copyright and/or license information if different from upstream
#

Next, initialize any variables you will use to sane defaults. This does two things: it provides anyone reading your code a single place to look for "global" variables, and it makes sure you have something to test for later if you need to check the existence of cache data. Make sure your variables are properly namespaced; that is, they should start with a prefix corresponding to your site, the system you're checking, etc.

# Initialize variables
declare STUFF_NORMAL_VARIABLE=""
declare -a STUFF_ARRAY_VARIABLE=( )
declare -A STUFF_HASH_VARIABLE=( )

If your check may run more than once and does anything that's resource-intensive (running subprocesses, file I/O, etc.), you should (in most cases, unless it would cause malfunctions to occur) perform the intensive tasks only once and store the information in one or more shell variables for later use. These should be the variables you just initialized in the section above. They can be arrays or scalars.

# Function to populate data structures with data
function nhc_stuff_gather_data() {
    # Gather and cache data for later use.
    STUFF_NORMAL_VARIABLE="value"
    STUFF_ARRAY_VARIABLE=( "value" )
    STUFF_HASH_VARIABLE=( [key]="value" )
}

Next, you need to write your check function(s). These should be named check_class_purpose where class is the same as used previously ("stuff" for this example), and purpose gives a descriptive name to the check to convey what it checks. Our example will use the obvious-but-potentially-vague "works" as its purpose, but the name you choose will undoubtedly be more clever.

If you have created a data-gathering function as shown above and populated one or more cache variables, the first thing your check should do is see if the cache has been populated already. If not, run your data-gathering function before proceeding with the check.

As for how you write the check...well, that's entirely up to you. It will depend on what you need to check and the available options for doing so. (However, consult the next section for some tips and bashisms to make your checks more efficient.) The example here is clearly a useless and contrived one but should nevertheless be illustrative of the general concept:

# Check to make sure stuff is functioning properly
function check_stuff_works() {
    # Load cache if empty
    if [[ ${#STUFF_ARRAY_VARIABLE[*]} -eq 0 ]]; then
        nhc_stuff_gather_data
    fi

    # Use cached data
    if [[ "${STUFF_ARRAY_VARIABLE[0]}" = "" ]]; then
        # check failed
        die 1 "Stuff is not working"
        return 1
    fi

    # check passed
    return 0
}

If other check functions are needed for a particular subsystem, write those similarly. If you're using a cache, each check should look for (and call the gather function if necessary) the cache variables before doing the actual checking as shown above.

Once you have all the checks you need, you can add them to the configuration file on your node(s), like so:

 *  || check_stuff_works

Next time NHC runs, it will automatically pick up your new check(s)!

Tips and Best Practices for Checks

Several of the philosophies and underlying principles which governed the design and implementation of the LBNL Node Health Check project were mentioned above in the Introduction. Certain code constructs were used to fulfill these principles which are not typical for the average run-of-the-mill shell script, largely because things which must be highly performant tend not to be written as shell scripts. Why? Two reasons: (1) It doesn't have a lot of the fancier, more complex features of the dedicated (i.e., non-shell) scripting languages; and (2) Many script authors don't know of many of the features bash does offer because they're used so infrequently. It can be somewhat of a vicious cycle/feedback loop when nobody bothers to learn something specifically because no one else is using it.

So why was bash chosen for this project? Simple: it's everywhere. If you're running Linux, it's almost guaranteed to be there4. The same cannot be said of any other scripting or non-compiled language (not even PERL or Python). And forcing everyone to write their checks in C or another compiled language would raise the barrier to entry and reduce the number of sites for which NHC could be useful. Since half the point is getting more places using a common tool (or at least a common framework), that would defeat the purpose. Thus, bash made the most sense.

The important question, then, becomes how to make bash scripts more efficient. And the solution is clear: do as much as possible with native bash constructs instead of shelling out to subcommands like sed, awk, grep, and the other common UNIX swashbucklers. The more one investigates the features bash provides, the more one finds how many of its long-held features tend to go unused and just how much one truly is able to do without the need to fork-and-exec. In this section, several aspects of common shell script constructs (plus 1 or 2 not-so-common ones) will be reviewed along with ways to improve efficiency and avoid subcommands whenever possible.

Arrays

Arrays are an important tool in any sufficiently-capable scripting language. Bash has had support for arrays for quite some time; recent versions even add associative array support (i.e., string-based indexing, akin to hashes in PERL). To maintain compatibility, associative arrays are not currently used in NHC, but traditional arrays are used quite heavily. Though a complete tutorial on arrays in bash is beyond the scope of this document, a brief "cheat sheet" is probably a good idea. So here you go:

Syntax Purpose
declare -a AVAR Declare the shell variable $AVAR to be an array (not strictly required, but good form).
AVAR=( ... ) Assign elements of array $AVAR based on the word expansion of the contents of the parentheses. ... is one or more words of the form [subscript]=value or an expression which expands to such. Only the value(s) are required.
${AVAR[subscript]} Evaluates to the subscriptth element of the array $AVAR. Array indexes in bash start from 0, just like in C or PERL. subscript must evaluate to an integer >= 0.
${#AVAR[*]} Evaluates to the number of elements in the array $AVAR.
${AVAR[*]} Evaluates to all the values in the $AVAR array as a single word (like $*). Use only where keeping values separate doesn't matter.
"${AVAR[@]}" Evaluates to all values in the $AVAR array, each as a separate word. This keeps values distinct (just like $@ vs. $*).
"${AVAR[@]:offset:length}" Evaluates to the values of $AVAR as above, starting at element ${AVAR[offset]} and including at most length elements. length may be omitted, and offset may be negative.

A more detailed examination of bash arrays can be found here.

Several examples of array-based techniques will appear in the following sections, so make sure you have a solid grasp on the basic usage of array syntax before continuing.

File I/O

When using the command prompt, most of us reach for things like cat or less when we need to view the contents of a file; thus, our inclination tends to be to reach for the same tools when writing shell scripts. cat, however, is not a bash built-in, so a fork-and-exec is required to spawn /bin/cat just so it can read a file and return the contents. This overhead is negligible for interactive shell usage, and may be a non-issue for many shell-scripting scenarios, but for efficiency-critical scenarios like NHC, we can and should do better!

File input and output (either truncate or append) are both natively supported by bash using the (mostly) well-known Redirection Operators. Rather than reading data from files into variables (arrays or scalars) using command substitution (i.e., the `` and $() operators), use redirection operators to pull the contents of the file into the variable. One technique for doing this is to redirect to the read built-in. So instead of this:

MOTD=`cat /etc/motd`

use:

read MOTD < /etc/motd

bash also allows an even simpler form for using this technique:

MOTD=$(< /etc/motd)

It looks similar to command substitution but uses I/O redirection in place of an actual command. It does, however, still do a fork() and pipe() to do the file I/O. On Linux, this is done via clone() which is fairly lightweight but still not quite as efficient as the read command shown above (which is a bash built-in).

The same syntax can be used to populate array variables with multiple fields' worth of data:

UPTIME=( $(< /proc/uptime) )

This will store the system uptime (in seconds) in the variable ${UPTIME[0]} and the idle time in ${UPTIME[1]}. Declare $UPTIME as an array in advance using declare -a or local -a to make this clearer, and (as always!) make sure to add comments! To avoid the fork() (see above), use read instead:

read -a UPTIME < /proc/uptime

Though not as easy to spot, other subcommands may also be able to be eliminated using this technique. For example, the Linux kernel makes the full hostname of the system available in a file in the /proc filesystem. Knowing this, the hostname command substitution may be eliminated by utilizing the contents of this file:

read HOSTNAME < /proc/sys/kernel/hostname

As an aside... Knowing these tricks may also be helpful in other situations. If you're trying to repair a system in which the root filesystem has become partially corrupted, and the cat command no longer works, this can provide you a way to view the contents of system files directly in your shell!

Line Parsing and Loops

While certainly not as capable as PERL at text processing, the shell does offer some seldom-used features to facilitate the processing of line-oriented input. By default, the shell splits things up based on whitespace (i.e., space characters, tabs, and newlines) to distinguish each "word" from the next. This is why quoting must be used to join arguments which contain spaces to allow them to be treated as single parameters. As with many aspects of the shell, however, this behavior can be customized, allowing for different delimiter characters to be applied to input (typically file I/O). Since character-delimited files are commonplace in UNIX, this idiom is quite frequently useful when shell scripting.

One easily-recognized example would be /etc/passwd. It is both line-oriented and colon-delimited. Parsing its contents is often useful for shell scripts, but most which need this data tend to use awk or cut to pull the appropriate fields. Direct splitting and parsing of this file can be done in native bash without the use of subcommands:

IFS=':'
while read -a LINE ; do
    THIS_UID=${LINE[2]}
    UIDS[${#UIDS[*]}]=$THIS_UID
    PWDATA_USER[$THIS_UID]="${LINE[0]}"
    PWDATA_GID[$THIS_UID]=${LINE[3]}
    PWDATA_GECOS[$THIS_UID]="${LINE[4]}"
    PWDATA_HOME[$THIS_UID]="${LINE[5]}"
    PWDATA_SHELL[$THIS_UID]="${LINE[6]}"
done < /etc/passwd
IFS=$' \t\n'

The above code reads a line at a time from /etc/passwd into the $LINE array. Because the bash Input Field Separator variable, $IFS, has been set to a colon (':') instead of whitespace, each field of the passwd file will go into a separate element of the $LINE array. The values in $LINE are then used to populate 5 parallel arrays with the userid, GID, GECOS field, home directory, and shell for each user (indexed by UID). It also keeps an array of all the UIDs it has seen. Everything here is done in the same bash process which is executing the script, so it is quite efficient. The $IFS variable is reset to its proper value after the loop completes.

Sometimes, however, the elimination of a subprocess is impractical or impossible. A similar approach may still be used to keep the parsing of the command's output as efficient as possible. For example, a bash-native implementation of the netstat -nap command would be impossible (or at least a very close approximation thereof), so we could use the following method to populate our cache data from its output:

IFS=$'\n'
LINES=( $(netstat -nap) )

IDX=0
for ((i=0; i<${#LINES[*]}; i++)); do
    IFS=$' \t\n'
    LINE=( ${LINES[$i]} )
    if [[ "${LINE[0]}" != "tcp" && "${LINE[0]}" != "udp" ]]; then
        continue
    fi
    NET_PROTO[$IDX]=${LINE[0]}
    NET_RECVQ[$IDX]=${LINE[1]}
    NET_SENDQ[$IDX]=${LINE[2]}
    NET_LOCADDR[$IDX]=${LINE[3]}
    NET_REMADDR[$IDX]=${LINE[4]}
    if [[ "${NET_PROTO[$IDX]}" == "tcp" ]]; then
        NET_STATE[$IDX]=${LINE[5]}
        NET_PROC[$IDX]=${LINE[6]}
    else
        NET_STATE[$IDX]=""
        NET_PROC[$IDX]=${LINE[5]}
    fi
    if [[ "${NET_PROC[$IDX]}" == */* ]]; then
        IFS='/'
        LINE=( ${NET_PROC[$IDX]} )
        NET_PROCPID[$IDX]=${LINE[0]}
        NET_PROCNAME[$IDX]=${LINE[1]}
    else
        NET_PROCPID[$IDX]='???'
        NET_PROCNAME[$IDX]="unknown"
    fi
    ((IDX++))
done
IFS=$' \t\n'

By resetting $IFS to contain only a newline character, we can easily split the command results into individual lines. We place these results into the $LINES array. Each line is then split on the traditional whitespace characters and placed into the $LINE (with no 'S' on the end) array. We're tracking only TCP and UDP sockets here, so everything else (including column headers) gets thrown away. We store each field in our cache arrays, and we even further split one of the fields which uses '/' as a separator. After our loop is complete, we reset $IFS, and we now have a fully-populated set of cache variables containing all our TCP- and UDP-based sockets, all with only 1 fork-and-exec required!

Text Transformations

Bash got a regular expression matching operator in version 3, but it still lacks regex-based transforms. However, with a minimum of extra effort, glob-based transforms can often provide the necessary functionality.

The following basic variable transformations are available:

Syntax Purpose
${VAR:offset} Evaluates to the substring of $VAR starting at offset and continuing until the end of the string. If offset is negative, it is interpreted relative to the end of $VAR.
${VAR:offset:length} Same as above, but the result will contain at most length characters from $VAR.
${#VAR} Gives the length, in characters, of the value assigned to $VAR.
${VAR#pattern} Removes the shortest string matching pattern from the beginning of $VAR.
${VAR##pattern} Same as above, but the longest string matching pattern is removed.
${VAR%pattern} Removes the shortest string matching pattern from the end of $VAR.
${VAR%%pattern} Same as above, but the longest string matching pattern is removed.
${VAR/pattern/replacement} The first string matching pattern in $VAR is replaced with replacement. replacement and the last / may be omitted to simply remove the matching string. Patterns starting with # or % must match beginning or end (respectively) of $VAR.
${VAR//pattern/replacement} Same as above, but ALL strings matching pattern are replaced/removed.

So here are some ways the above constructs can be used to do common operations on strings/files:

Traditional Method Native bash method
sed 's/^ *//' while [[ "$LINE" != "${LINE## }" ]]; do LINE="${LINE## }" ; done
sed 's/ *$//' while [[ "$LINE" != "${LINE%% }" ]]; do LINE="${LINE%% }" ; done
echo ${LIST[*]} | fgrep string [[ "${LIST[*]//string}" != "${LIST[*]}" ]]
tail -1 ${LINES[*]:-1}
cat file | tr '\r' '' LINES=( "${LINES[@]//$'\r'}" )

There are infinitely more, of course, but these should get you thinking along the right lines!

Matching

Matching input data against potential or expected patterns is common to all programming, and NHC is no exception. As previously mentioned, however, bash 2.x did not have regular expression matching capability. To abstract this out, NHC's common.nhc file (loaded automatically by nhc when it runs) provides the mcheck_regexp(), mcheck_range(), and mcheck_glob() functions which return 0 (i.e., bash's "true" or "success" value) if the first argument matches the pattern provided as the second argument. To allow for a single matching interface to support all styles of matching, the mcheck() function is also provided. If the pattern is surrounded by slashes (e.g., /pattern/), mcheck() will attempt a regular expression match; if the pattern is surrounded by braces (e.g., {pattern}), a range match is attempted; otherwise, it attempts a glob match. (For older bash versions which lack the regex matching operator, egrep is used instead...which unfortunately will mean additional subshells.) The mcheck() function is used to implement the pattern matching of the first field in nhc.conf as well as all other occurrences of match strings (a.k.a. mstrs) used as check parameters throughout the configuration file.

For consistency with NHC's built-in checks, it is recommended that user-supplied checks which require matching functionality do so by simply calling mcheck string expression and evaluating the return value. If true (i.e., 0), string did match the /regex/, {range}, external match expression, or glob supplied as expression. If false, the match failed.

See the earlier section on Match Strings for details.


Footnotes

[1]: Actually, nhc-wrapper will strip "-wrapper" off the end of its name and execute whatever remains, or you can specify a subprogram directly using the -P option on the nhc-wrapper command line. It was intentionally written to be somewhat generic in its operation so as to be potentially useful in wrapping other utilities.

[2]: Previously, any file in that directory got loaded regardless of extension. This is no longer the case, so use of the .nhc extension is now required. This change was made to avoid loading *.nhc.rpmnew files, for example.

[3]: If you don't specify otherwise, all checks made available publicly or directly to the NHC development team are copyrighted by the author and licensed as specified in the BSD-3/LBNL-BSD license used by NHC.

[4]: Well, okay... If you're running enough of Linux that it can function as a compute node. Bootstrap images and other embedded/super-minimal cases aren't really applicable to NHC anyway.

nhc's People

Contributors

basvandervlies avatar bbbbbrie avatar ianlee1521 avatar jthiltges avatar martbhell avatar martijnkruiten avatar mej avatar mrobbert avatar mslacken avatar naureensaba avatar rpabel avatar saford91 avatar treydock avatar wickberg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nhc's Issues

IB hw check gets confusing output

When issuing a nhc with check_hw_ib, the ouput when an error is found is:

ERROR: nhc: Health check failed: check_hw_ib: No IB port hfi1_0:1 is ACTIVE (LinkUp 100 Gb/sec).

I suggest to improve the output to something more understandable like, i.e:

ERROR: nhc: Health check failed: check_hw_ib: No IB port hfi1_0:1 is ACTIVE (State:INIT, Physical state:LinkUp, Rate:100Gbps).

another example:
ERROR: nhc: Health check failed: check_hw_ib: No IB port hfi1_0:1 is ACTIVE (State:DOWN, Physical state:<unknown>, Rate:100Gbps).

Note that if the state is DOWN or INIT, it is not shown in the output of nhc, moreover NHC is giving a LinkUP in the case that the physical state is unknown.

The suggested patch is attached.

0001-Changed-output-format-for-IB-hw-checks.zip

From 0ca568f558354e4ba77f18ba45b84cf70f1a1b10 Mon Sep 17 00:00:00 2001
From: Felip Moll <[email protected]>
Date: Tue, 16 May 2017 15:37:52 +0200
Subject: [PATCH] Changed output format for IB hw checks

---
 scripts/lbnl_hw.nhc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/lbnl_hw.nhc b/scripts/lbnl_hw.nhc
index 29cce8d..b7627b3 100644
--- a/scripts/lbnl_hw.nhc
+++ b/scripts/lbnl_hw.nhc
@@ -353,7 +353,7 @@ function check_hw_ib() {
         RATE=" $RATE Gb/sec"
     fi
 
-    die 1 "$FUNCNAME:  No IB port$DEV is $STATE ($PHYS_STATE$RATE)."
+    die 1 "$FUNCNAME:  No IB port$DEV is $STATE (State:$HW_IB_STATE, Physical state:$HW_IB_PHYS_STATE, Rate:${HW_IB_RATE}Gbps)."
     return 1
 }
 
-- 
2.9.3

Slurm resource manager detected incorrectly as NHC_RM="pbs"

I've installed the RPM lbnl-nhc-1.4.2-1.el7.noarch.rpm on a new cluster which is using the SLURM resource manager. The NHC_RM is unfortunately autodetected as "pbs" because function nhcmain_find_rm() contains this code:
# Search PATH for commands
if type -a -p -f -P pbsnodes >&/dev/null ; then
NHC_RM="pbs"
The problem is that the SLURM (current version 16.05.2) RPM package "slurm-torque" contains a file /usr/bin/pbsnodes for convenience/compatibility reasons.
My suggestion is to replace "pbsnodes" in function nhcmain_find_rm() by "qmgr" which presumably wouldn't exist on a SLURM system:
if type -a -p -f -P qmgr >&/dev/null ; then
NHC_RM="pbs"

NHC on Centos/7.4 - problem generating rpm - fail test lbnl_file.nhc

Hi all,

I am trying to generate the rpm to deploy on our cluster here but the usual procedure I follow is giving me errors on Centos 7.4.
For a straight rpm build I get:

[root@nslurmdb1 nhc]# rpmbuild --rebuild lbnl-nhc-1.4.2-1.el7.src.rpm
Installing lbnl-nhc-1.4.2-1.el7.src.rpm
warning: user mej does not exist - using root
warning: group mej does not exist - using root
warning: user mej does not exist - using root
warning: group mej does not exist - using root
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.iZv7yu

  • umask 022
  • cd /root/rpmbuild/BUILD
  • cd /root/rpmbuild/BUILD
  • rm -rf lbnl-nhc-1.4.2
  • /usr/bin/gzip -dc /root/rpmbuild/SOURCES/lbnl-nhc-1.4.2.tar.gz
  • /usr/bin/tar -xvvf -
    drwxr-xr-x 1000/1000 0 2015-11-11 21:11 lbnl-nhc-1.4.2/
    -rw-r--r-- 1000/1000 25217 2015-11-11 21:11 lbnl-nhc-1.4.2/aclocal.m4
    -rw-r--r-- 1000/1000 2508 2015-10-14 02:13 lbnl-nhc-1.4.2/LICENSE
    -rw-r--r-- 1000/1000 6340 2015-10-14 02:13 lbnl-nhc-1.4.2/nhc.conf
    drwxr-xr-x 1000/1000 0 2015-11-11 21:11 lbnl-nhc-1.4.2/test/
    -rw-r--r-- 1000/1000 4314 2015-10-14 02:13 lbnl-nhc-1.4.2/test/test_lbnl_nv.nhc
    -rw-r--r-- 1000/1000 11140 2015-10-14 02:13 lbnl-nhc-1.4.2/test/test_lbnl_hw.nhc
    -rw-r--r-- 1000/1000 11131 2015-11-11 21:11 lbnl-nhc-1.4.2/test/Makefile.in
    -rw-r--r-- 1000/1000 2243 2015-10-14 02:13 lbnl-nhc-1.4.2/test/test_lbnl_cmd.nhc
    -rw-r--r-- 1000/1000 14050 2015-10-14 02:13 lbnl-nhc-1.4.2/test/test_common.nhc
    -rw-r--r-- 1000/1000 343 2015-10-14 02:13 lbnl-nhc-1.4.2/test/test_lbnl_moab.nhc
    -rw-r--r-- 1000/1000 20813 2015-11-11 20:56 lbnl-nhc-1.4.2/test/test_lbnl_ps.nhc
    -rw-r--r-- 1000/1000 12770 2015-10-14 02:13 lbnl-nhc-1.4.2/test/test_lbnl_dmi.nhc
    -rw-r--r-- 1000/1000 475 2015-10-14 02:13 lbnl-nhc-1.4.2/test/Makefile.am
    -rw-r--r-- 1000/1000 12745 2015-10-14 02:13 lbnl-nhc-1.4.2/test/test_lbnl_fs.nhc
    -rw-r--r-- 1000/1000 8514 2015-10-14 02:13 lbnl-nhc-1.4.2/test/test_lbnl_net.nhc
    -rw-r--r-- 1000/1000 322 2015-10-14 02:13 lbnl-nhc-1.4.2/test/test_lbnl_job.nhc
    -rw-r--r-- 1000/1000 5148 2015-10-14 02:13 lbnl-nhc-1.4.2/test/shut.inc.sh
    -rw-r--r-- 1000/1000 12362 2015-10-14 02:13 lbnl-nhc-1.4.2/test/test_lbnl_file.nhc
    -rwxr-xr-x 1000/1000 7541 2015-10-14 02:13 lbnl-nhc-1.4.2/test/nhc-test
    drwxr-xr-x 1000/1000 0 2015-11-11 21:11 lbnl-nhc-1.4.2/bench/
    -rw-r--r-- 1000/1000 10768 2015-11-11 21:11 lbnl-nhc-1.4.2/bench/Makefile.in
    -rw-r--r-- 1000/1000 109 2015-10-14 02:13 lbnl-nhc-1.4.2/bench/Makefile.am
    drwxr-xr-x 1000/1000 0 2015-11-11 21:11 lbnl-nhc-1.4.2/contrib/
    -rw-r--r-- 1000/1000 756 2015-10-14 02:10 lbnl-nhc-1.4.2/contrib/nhc.cron
    -rwxr-xr-x 1000/1000 16160 2015-10-14 02:13 lbnl-nhc-1.4.2/nhc-genconf
    -rw-r--r-- 1000/1000 31533 2015-11-11 21:11 lbnl-nhc-1.4.2/Makefile.in
    drwxr-xr-x 1000/1000 0 2015-11-11 21:11 lbnl-nhc-1.4.2/scripts/
    -rw-r--r-- 1000/1000 13215 2015-10-14 02:13 lbnl-nhc-1.4.2/scripts/lbnl_net.nhc
    -rw-r--r-- 1000/1000 20358 2015-10-14 02:13 lbnl-nhc-1.4.2/scripts/common.nhc
    -rw-r--r-- 1000/1000 4819 2015-10-14 02:13 lbnl-nhc-1.4.2/scripts/lbnl_moab.nhc
    -rw-r--r-- 1000/1000 1250 2015-10-14 02:13 lbnl-nhc-1.4.2/scripts/lbnl_nv.nhc
    -rw-r--r-- 1000/1000 19698 2015-10-14 02:13 lbnl-nhc-1.4.2/scripts/lbnl_fs.nhc
    -rw-r--r-- 1000/1000 8272 2015-11-11 20:56 lbnl-nhc-1.4.2/scripts/lbnl_cmd.nhc
    -rw-r--r-- 1000/1000 14024 2015-10-14 02:13 lbnl-nhc-1.4.2/scripts/lbnl_file.nhc
    -rw-r--r-- 1000/1000 3808 2015-10-14 02:13 lbnl-nhc-1.4.2/scripts/lbnl_job.nhc
    -rw-r--r-- 1000/1000 7886 2015-10-14 02:13 lbnl-nhc-1.4.2/scripts/lbnl_dmi.nhc
    -rw-r--r-- 1000/1000 31697 2015-11-11 20:56 lbnl-nhc-1.4.2/scripts/lbnl_ps.nhc
    -rw-r--r-- 1000/1000 14980 2015-10-14 02:13 lbnl-nhc-1.4.2/scripts/lbnl_hw.nhc
    -rwxr-xr-x 1000/1000 216 2015-10-14 02:10 lbnl-nhc-1.4.2/autogen.sh
    -rwxr-xr-x 1000/1000 23116 2015-11-11 20:56 lbnl-nhc-1.4.2/nhc
    -rwxr-xr-x 1000/1000 6873 2015-11-11 21:11 lbnl-nhc-1.4.2/missing
    -rw-r--r-- 1000/1000 952 2015-10-14 02:13 lbnl-nhc-1.4.2/configure.ac
    -rw-r--r-- 1000/1000 2536 2015-11-11 21:11 lbnl-nhc-1.4.2/lbnl-nhc.spec
    -rw-r--r-- 1000/1000 30 2015-10-14 02:13 lbnl-nhc-1.4.2/nhc-test.conf
    -rwxr-xr-x 1000/1000 13997 2015-11-11 21:11 lbnl-nhc-1.4.2/install-sh
    -rw-r--r-- 1000/1000 1394 2015-10-14 02:13 lbnl-nhc-1.4.2/Makefile.am
    -rwxr-xr-x 1000/1000 12669 2015-10-14 02:13 lbnl-nhc-1.4.2/nhc-wrapper
    -rwxr-xr-x 1000/1000 103144 2015-11-11 21:11 lbnl-nhc-1.4.2/configure
    -rw-r--r-- 1000/1000 50966 2015-10-14 02:10 lbnl-nhc-1.4.2/ChangeLog
    drwxr-xr-x 1000/1000 0 2015-11-11 21:11 lbnl-nhc-1.4.2/helpers/
    -rw-r--r-- 1000/1000 5383 2015-10-14 02:13 lbnl-nhc-1.4.2/helpers/node-mark-offline
    -rw-r--r-- 1000/1000 5402 2015-10-14 02:13 lbnl-nhc-1.4.2/helpers/node-mark-online
    -rw-r--r-- 1000/1000 31 2015-10-14 02:10 lbnl-nhc-1.4.2/COPYING
    -rw-r--r-- 1000/1000 2548 2015-11-11 21:07 lbnl-nhc-1.4.2/lbnl-nhc.spec.in
    -rw-r--r-- 1000/1000 95 2015-10-14 02:13 lbnl-nhc-1.4.2/nhc.logrotate
  • STATUS=0
  • '[' 0 -ne 0 ']'
  • cd lbnl-nhc-1.4.2
  • /usr/bin/chmod -Rf a+rX,u+w,g-w,o-w .
  • exit 0
    Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.TIVW18
  • umask 022
  • cd /root/rpmbuild/BUILD
  • cd lbnl-nhc-1.4.2
  • CFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic'
  • export CFLAGS
  • CXXFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic'
  • export CXXFLAGS
  • FFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -I/usr/lib64/gfortran/modules'
  • export FFLAGS
  • FCFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -I/usr/lib64/gfortran/modules'
  • export FCFLAGS
  • LDFLAGS='-Wl,-z,relro '
  • export LDFLAGS
  • '[' 1 == 1 ']'
  • '[' x86_64 == ppc64le ']'
    ++ find . -name config.guess -o -name config.sub
  • ./configure --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info
    configure: WARNING: unrecognized options: --disable-dependency-tracking
    checking for a BSD-compatible install... /usr/bin/install -c
    checking whether build environment is sane... yes
    checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
    checking for gawk... gawk
    checking whether make sets $(MAKE)... yes
    checking whether make supports nested variables... yes
    checking for Git version description... 1.4.2
    checking that generated files are newer than configure... done
    configure: creating ./config.status
    config.status: creating Makefile
    config.status: creating bench/Makefile
    config.status: creating test/Makefile
    config.status: creating lbnl-nhc.spec
    configure: WARNING: unrecognized options: --disable-dependency-tracking
  • /usr/bin/make
    Making all in bench
    make[1]: Entering directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2/bench' make[1]: Nothing to be done for all'.
    make[1]: Leaving directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2/bench' Making all in test make[1]: Entering directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2/test'
    make[1]: Nothing to be done for all'. make[1]: Leaving directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2/test'
    make[1]: Entering directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2' make[1]: Nothing to be done for all-am'.
    make[1]: Leaving directory `/root/rpmbuild/BUILD/lbnl-nhc-1.4.2'
  • exit 0
    Executing(%install): /bin/sh -e /var/tmp/rpm-tmp.YjEJzR
  • umask 022
  • cd /root/rpmbuild/BUILD
  • '[' /root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64 '!=' / ']'
  • rm -rf /root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64
    ++ dirname /root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64
  • mkdir -p /root/rpmbuild/BUILDROOT
  • mkdir /root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64
  • cd lbnl-nhc-1.4.2
  • umask 0077
  • /usr/bin/make install DESTDIR=/root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64
    Making install in bench
    make[1]: Entering directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2/bench' make[2]: Entering directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2/bench'
    make[2]: Nothing to be done for install-exec-am'. make[2]: Nothing to be done for install-data-am'.
    make[2]: Leaving directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2/bench' make[1]: Leaving directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2/bench'
    Making install in test
    make[1]: Entering directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2/test' make[2]: Entering directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2/test'
    make[2]: Nothing to be done for install-exec-am'. make[2]: Nothing to be done for install-data-am'.
    make[2]: Leaving directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2/test' make[1]: Leaving directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2/test'
    make[1]: Entering directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2' make[2]: Entering directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2'
    /usr/bin/mkdir -p '/root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64/usr/sbin'
    /usr/bin/install -c nhc nhc-genconf nhc-wrapper '/root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64/usr/sbin'
    /usr/bin/mkdir -p '/root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64/etc/logrotate.d' '/root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64/var/lib/nhc' '/root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64/var/run/nhc'
    /usr/bin/install -c -m 644 ./nhc.logrotate '/root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64/etc/logrotate.d/nhc'
    /usr/bin/mkdir -p '/root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64/etc/nhc'
    /usr/bin/install -c -m 644 nhc.conf '/root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64/etc/nhc'
    /usr/bin/mkdir -p '/root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64/usr/libexec/nhc'
    /usr/bin/install -c helpers/node-mark-online helpers/node-mark-offline '/root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64/usr/libexec/nhc'
    /usr/bin/mkdir -p '/root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64/etc/nhc'
    /usr/bin/mkdir -p '/root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64/etc/nhc/scripts'
    /usr/bin/install -c -m 644 scripts/lbnl_cmd.nhc scripts/common.nhc scripts/lbnl_dmi.nhc scripts/lbnl_file.nhc scripts/lbnl_fs.nhc scripts/lbnl_hw.nhc scripts/lbnl_job.nhc scripts/lbnl_moab.nhc scripts/lbnl_net.nhc scripts/lbnl_nv.nhc scripts/lbnl_ps.nhc '/root/rpmbuild/BUILDROOT/lbnl-nhc-1.4.2-1.el7.centos.x86_64/etc/nhc/scripts'
    make[2]: Leaving directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2' make[1]: Leaving directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2'
  • /usr/lib/rpm/find-debuginfo.sh --strict-build-id -m --run-dwz --dwz-low-mem-die-limit 10000000 --dwz-max-die-limit 110000000 /root/rpmbuild/BUILD/lbnl-nhc-1.4.2
    /usr/lib/rpm/sepdebugcrcfix: Updated 0 CRC32s, 0 CRC32s did match.
  • /usr/lib/rpm/check-buildroot
  • /usr/lib/rpm/redhat/brp-compress
  • /usr/lib/rpm/redhat/brp-strip-static-archive /usr/bin/strip
  • /usr/lib/rpm/brp-python-bytecompile /usr/bin/python 1
  • /usr/lib/rpm/redhat/brp-python-hardlink
  • /usr/lib/rpm/redhat/brp-java-repack-jars
    Executing(%check): /bin/sh -e /var/tmp/rpm-tmp.Z6sEMA
  • umask 022
  • cd /root/rpmbuild/BUILD
  • cd lbnl-nhc-1.4.2
  • /usr/bin/make test
    /usr/bin/make -C test test
    make[1]: Entering directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2/test' Running unit tests for NHC: nhcmain_init_env...ok 6/6 nhcmain_help...ok 8/8 nhcmain_parse_cmdline...ok 16/16 nhcmain_finalize_env...ok 14/14 nhcmain_check_conffile...ok 1/1 nhcmain_load_scripts...ok 6/6 nhcmain_set_watchdog...ok 1/1 nhcmain_watchdog_timer...ok 3/3 nhcmain_run_checks...ok 2/2 common.nhc...ok 108/108 lbnl_cmd.nhc...ok 13/13 lbnl_dmi.nhc...ok 45/45 lbnl_file.nhc...../scripts/lbnl_file.nhc: line 83: /dev/fd/63: No such file or directory failed 4/71 TEST FAILED: Single regexp match success: Got "1" but expected "0" make[1]: *** [test] Error 255 make[1]: Leaving directory /root/rpmbuild/BUILD/lbnl-nhc-1.4.2/test'
    make: *** [test] Error 2
    error: Bad exit status from /var/tmp/rpm-tmp.Z6sEMA (%check)

RPM build errors:
user mej does not exist - using root
group mej does not exist - using root
user mej does not exist - using root
group mej does not exist - using root
Bad exit status from /var/tmp/rpm-tmp.Z6sEMA (%check)

Doing an autogen, configure, make test does not give this error from a git clone.

I did not have this problem on Centos 7.3.

Any idea on what is going on?

Best Regards,

Miguel Afonso Oliveira

check_dmi_data_match fails because of whitespaces

The autogenerated nhc.conf on some of our nodes contains spaces in check_dmi_data_match tests like this:

check_dmi_data_match -h 0x0400 -t 4 "Processor Information: Version: Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz"

The string is later read in and passed around and when it is fed to eval, the whitespaces are collapsed:

1510827263] - DEBUG:  Glob match check:  Processor Information: Version: Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz does not match Processor Information: Version: Intel(R) Xeon(R) CPU X5550 @ 2.67GHz

The easiest workaround I found is escaping all whitespaces twice:

check_dmi_data_match -h 0x0401 -t 4 "Processor\\ Information:\\ Version:\\ Intel(R)\\ Xeon(R)\\ CPU\\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ X5550\\ \\ @\\ 2.67GHz"

Now the tests run just fine:

Running check:  "check_dmi_data_match -h 0x0400 -t 4 "Processor\ Information:\ Version:\ Intel(R)\ Xeon(R)\ CPU\ \ \ \ \ \ \ \ \ \ \ X5550\ \ @\ 2.67GHz"

I've adapted nhc-genconf to change the output if two consecutive whitespaces are found:

diff -ruN nhc.orig/nhc-genconf nhc/nhc-genconf
--- nhc.orig/nhc-genconf        2017-11-16 11:38:10.872479034 +0100
+++ nhc/nhc-genconf     2017-11-16 11:37:43.041458469 +0100
@@ -248,7 +248,12 @@
             IFS=$' \t\n'
             for ((i=0; i<${#LINES[*]}; i++)); do
                 if mcheck "${LINES[$i]}" "$DMI_MATCH"; then
-                    echo "# $HOSTNAME || check_dmi_data_match -h $HANDLE ${DMI_TYPE_IDS[$HANDLE]:+-t ${DMI_TYPE_IDS[$HANDLE]}} \"${LINES[i]}\""
+                    if [[ "${LINES[$i]}" == *"  "* ]]; then
+                        ESCAPED=$(echo "${LINES[i]}" | sed 's/\([ ]\)/\\\\\1/g')
+                        echo "# $HOSTNAME || check_dmi_data_match -h $HANDLE ${DMI_TYPE_IDS[$HANDLE]:+-t ${DMI_TYPE_IDS[$HANDLE]}} \"${ESCAPED}\""
+                    else
+                        echo "# $HOSTNAME || check_dmi_data_match -h $HANDLE ${DMI_TYPE_IDS[$HANDLE]:+-t ${DMI_TYPE_IDS[$HANDLE]}} \"${LINES[i]}\""
+                    fi
                 fi
             done
         done

But maybe this should be fixed at the point where the whitespaces are collapsed due to missing quotation marks?

Roland

nhc-genconf errors: nhc_common_unparse_size: command not found

Unmodified install dumps following error when calling nhc-genconf:

[root@yslogin6 ~]# /usr/local/sbin/nhc-genconf -H '*' -c -
# NHC Configuration File
#
# Lines are in the form "<hostmask>||<check>"
# Hostmask is a glob, /regexp/, or {noderange}
# Comments begin with '#'
#
# This file was automatically generated by nhc-genconf
# Thu Jun 7 10:29:00 MDT 2018
#

#######################################################################
###
### NHC Configuration Variables
###
# * || export MARK_OFFLINE=1 NHC_CHECK_ALL=0


#######################################################################
###
### Hardware checks
###
 * || check_hw_cpuinfo   
/usr/local/sbin/nhc-genconf: line 330: nhc_common_unparse_size: command not found
 * || check_hw_physmem   3%
/usr/local/sbin/nhc-genconf: line 334: nhc_common_unparse_size: command not found
 * || check_hw_swap   3%


#######################################################################
###
### nVidia GPU checks
###
 * || check_nv_healthmon

OS:

[root@yslogin6 ~]# lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.4.1708 (Core) 
Release:        7.4.1708
Codename:       Core

Git: Installed tag 1.4.2

NHC should control if another nhc is running

When doing some check in NHC that interacts with a filesystem that is failing, the NHC process can stay in D forever. There should be an option to allow or not allow the concurrent run of NHC process in order to avoid such situation:

root     344249  0.0  0.0  15940  4980 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344405  0.0  0.0  15940  4820 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344433  0.0  0.0  15940  4952 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344445  0.0  0.0  15940  4832 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344458  0.0  0.0  15940  4984 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344487  0.0  0.0  15940  4952 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344500  0.0  0.0  15940  4980 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344515  0.0  0.0  15940  4944 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344542  0.0  0.0  15940  4940 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344555  0.0  0.0  15940  5004 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344571  0.0  0.0  15940  4832 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344598  0.0  0.0  15940  4992 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344611  0.0  0.0  15940  4924 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344625  0.0  0.0  15940  4840 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344654  0.0  0.0  15940  4836 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344666  0.0  0.0  15940  4952 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344680  0.0  0.0  15940  4824 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344709  0.0  0.0  15940  4792 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344722  0.0  0.0  15940  4792 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344736  0.0  0.0  15940  4952 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344764  0.0  0.0  15940  4952 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344777  0.0  0.0  15940  4928 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344792  0.0  0.0  15940  4812 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344819  0.0  0.0  15940  4904 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344832  0.0  0.0  15940  4952 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344846  0.0  0.0  15940  4980 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344875  0.0  0.0  15940  4808 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344887  0.0  0.0  15940  4824 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344901  0.0  0.0  15940  4968 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344930  0.0  0.0  15940  4952 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344943  0.0  0.0  15940  4932 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344957  0.0  0.0  15940  4796 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344985  0.0  0.0  15940  4840 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     344998  0.0  0.0  15940  4980 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345013  0.0  0.0  15940  4960 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345040  0.0  0.0  15940  4840 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345053  0.0  0.0  15940  4908 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345067  0.0  0.0  15940  4956 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345096  0.0  0.0  15940  4824 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345108  0.0  0.0  15940  4968 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345122  0.0  0.0  15940  4996 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345151  0.0  0.0  15940  4852 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345164  0.0  0.0  15940  4952 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345177  0.0  0.0  15940  4836 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345206  0.0  0.0  15940  4964 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345219  0.0  0.0  15940  4820 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345234  0.0  0.0  15940  4836 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345261  0.0  0.0  15940  4796 ?        Ds   Jul07   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345621  0.0  0.0  15940  4932 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345649  0.0  0.0  15940  4972 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345661  0.0  0.0  15940  4956 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345675  0.0  0.0  15940  4796 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345705  0.0  0.0  15940  4956 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345718  0.0  0.0  15940  4836 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345731  0.0  0.0  15940  4948 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345760  0.0  0.0  15940  4952 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345773  0.0  0.0  15940  4904 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345788  0.0  0.0  15940  4996 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345815  0.0  0.0  15940  4908 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345828  0.0  0.0  15940  4824 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345843  0.0  0.0  15940  4956 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345871  0.0  0.0  15940  4796 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345883  0.0  0.0  15940  4836 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345897  0.0  0.0  15940  4796 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345926  0.0  0.0  15940  4812 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345939  0.0  0.0  15940  4948 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345952  0.0  0.0  15940  4904 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345981  0.0  0.0  15940  4808 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     345994  0.0  0.0  15940  4996 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346009  0.0  0.0  15940  4936 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346036  0.0  0.0  15940  4984 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346049  0.0  0.0  15940  4996 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346063  0.0  0.0  15940  4972 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346092  0.0  0.0  15940  4792 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346104  0.0  0.0  15940  4840 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346119  0.0  0.0  15940  4812 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346147  0.0  0.0  15940  4980 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346160  0.0  0.0  15940  4940 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346173  0.0  0.0  15940  4980 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346202  0.0  0.0  15940  4904 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346215  0.0  0.0  15940  4956 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346229  0.0  0.0  15940  4908 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346257  0.0  0.0  15940  4820 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346270  0.0  0.0  15940  4996 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346283  0.0  0.0  15940  4924 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346312  0.0  0.0  15940  4920 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346323  0.0  0.0  15940  4836 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346338  0.0  0.0  15940  4824 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346366  0.0  0.0  15940  4960 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346378  0.0  0.0  15940  4960 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346391  0.0  0.0  15940  4980 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346421  0.0  0.0  15940  4956 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346434  0.0  0.0  15940  4960 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346449  0.0  0.0  15940  4824 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346476  0.0  0.0  15940  4948 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346489  0.0  0.0  15940  4928 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346504  0.0  0.0  15940  4808 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346532  0.0  0.0  15940  4972 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346545  0.0  0.0  15940  4948 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346559  0.0  0.0  15940  4948 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346587  0.0  0.0  15940  4820 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346600  0.0  0.0  15940  4812 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346615  0.0  0.0  15940  4960 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346643  0.0  0.0  15940  4904 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346655  0.0  0.0  15940  4972 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346669  0.0  0.0  15940  4828 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346698  0.0  0.0  15940  4980 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346711  0.0  0.0  15940  4956 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346725  0.0  0.0  15940  4940 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346753  0.0  0.0  15940  4944 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346766  0.0  0.0  15940  4832 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346781  0.0  0.0  15940  4948 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346808  0.0  0.0  15940  4944 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346821  0.0  0.0  15940  4836 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346836  0.0  0.0  15940  4936 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346864  0.0  0.0  15940  4832 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346876  0.0  0.0  15940  4980 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346891  0.0  0.0  15940  4792 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346919  0.0  0.0  15940  4832 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346932  0.0  0.0  15940  4968 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346946  0.0  0.0  15940  4924 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346975  0.0  0.0  15940  4924 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     346988  0.0  0.0  15940  4944 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347002  0.0  0.0  15940  4840 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347030  0.0  0.0  15940  4956 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347043  0.0  0.0  15940  4944 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347058  0.0  0.0  15940  4972 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347086  0.0  0.0  15940  5000 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347098  0.0  0.0  15940  4836 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347112  0.0  0.0  15940  4944 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347141  0.0  0.0  15940  4936 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347154  0.0  0.0  15940  4832 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347167  0.0  0.0  15940  4960 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347196  0.0  0.0  15940  4932 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347209  0.0  0.0  15940  4996 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347223  0.0  0.0  15940  4980 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347251  0.0  0.0  15940  4792 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347264  0.0  0.0  15940  4996 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     347281  0.0  0.0  16964  4840 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     354366  0.0  0.0  15940  4840 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     354379  0.0  0.0  15940  4964 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     354391  0.0  0.0  15940  4824 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     354420  0.0  0.0  15940  4964 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     354433  0.0  0.0  15940  4960 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     354448  0.0  0.0  15940  4792 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     354476  0.0  0.0  15940  4956 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     354489  0.0  0.0  15940  4836 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     354504  0.0  0.0  16964  4936 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     354534  0.0  0.0  15940  4948 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     354547  0.0  0.0  15940  4796 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30
root     354561  0.0  0.0  16964  4984 ?        Ds   Jul08   0:00 /bin/bash /usr/sbin/nhc -q -t 30

Optimize nhc_hw_gather_data for gathering CPU information on KNL nodes

While the code is perfectly functional in it's current state, the function nhc_hw_gather_data can take upwards of 40-60 seconds on multithreaded KNL nodes. It would be nice to optimize this portion of the check so that we aren't forced to wait 10's of seconds to grab CPU information.

If it matters we are using lbnl-nhc-1.4.2-1, and have tried using the -f flag, which doesn't appear to help.

[root@sknl0705 ~]# /usr/sbin/nhc -f ERROR: nhc: Health check failed: Script timed out while executing "check_hw_cpuinfo 1 68 272".

NHC must understand the Slurm node state "resv" (Reserved)

We're installing some new nodes in our Slurm cluster and their fabric cables are not yet in place, so the Node Health Check (NHC) gives an error as expected:

[root@b001 ~]# nhc
ERROR: nhc: Health check failed: check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).

However, because we have temporarily set the Slurm state of these nodes to "resv" (Reserved), some warning messages are printed in /var/log/nhc.log:

ERROR: nhc: Health check failed: check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).
20190409 13:20:33 /usr/libexec/nhc/node-mark-offline b001 check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).
/usr/libexec/nhc/node-mark-offline: Not sure how to handle node state "resv" on b001

I would like to request the addition of Slurm state "resv" to the /usr/libexec/nhc/node-mark-offline script as in this diff:

--- /usr/libexec/nhc/node-mark-offline.orig 2015-11-11 22:46:52.000000000 +0100
+++ /usr/libexec/nhc/node-mark-offline 2019-04-09 13:29:48.587902690 +0200
@@ -63,7 +63,7 @@
OLD_NOTE_LEADER="${LINE[1]}"
OLD_NOTE="${LINE[*]:2}"
case "$STATUS" in

  •    alloc*|comp*|drain*|drng*|fail*|idle*|maint*|mix*|resume*|undrain*)
    
  •    resv*|alloc*|comp*|drain*|drng*|fail*|idle*|maint*|mix*|resume*|undrain*)
           case "$STATUS" in
               drain*|drng*|fail*|maint*)
                   # If the node is already offline, and there is no old note, and
    

With this change I do get the expected behavior of NHC, and the nhc.log shows:

ERROR: nhc: Health check failed: check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).
20190409 13:29:51 /usr/libexec/nhc/node-mark-offline b001 check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).
/usr/libexec/nhc/node-mark-offline: Marking resv b001 offline: NHC: check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec).

See also this Slurm bug report: https://bugs.schedmd.com/show_bug.cgi?id=6816

Thanks,
Ole

syntax error: invalid arithmetic operator from ` * || check_ps_service -u root -S sshd`

I haven't spent a ton of time with NHC, but I am finding one of the default NHC tests isn't working on my system.

Test is: * || check_ps_service -u root -S sshd
Error is: /etc/nhc/scripts/common.nhc: line 381: IntelOPA-Basic.RHEL72-x86_64.10.1.1.0.9: syntax error: invalid arithmetic operator (error token is ".RHEL72-x86_64.10.1.1.0.9")

Does this mean anything to you? I commented the test out, but at some point I would like to start building up the tests at our center. This one is obviously basic.

nhc.conf

Hi,

I'm not sure if this is the right place to ask.

I added additional 2 compute nodes to my existing slurm cluster, and installed all required software, nhc, slurm etc using ansible on the 2 new nodes, but all the 2 new nodes showing 'drain' status because of NHC:check_fs_mount . /run/user/1000 not mounted error.

Then I found out that on the 2 new nodes, there's below line in nhc.conf

testnode 1 || check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/run/user/1000"

I commented this line out and the 2 new nodes went back to normal. I don't know how the above line is added into nhc.conf. I looked at nhc-genconf file but couldn't find out reason either.

When I first created the cluster we use the same ansible code to install nhc, and the nodes does not contain that /run/user/1000 in nhc.conf. All the nodes are create using the same Centos image.

Appreciate if you could help

Thanks very much,
Yi

NHC seems to be ignoring 'NHC_RM=slurm' in nhc.conf

Hi Michael,

I've added this line in nhc.conf to use slurm as the RM:

* || export NHC_RM=slurm

However, NHC still seems to be checking /var/spool/torque (which also exists on our systems), and deciding that we must be using torque:

nhc -x -c /etc/nhc/nhc.conf | grep NHC_RM
...
+ local -a DIRLIST
+ [[ -d /var/spool/torque ]]
+ NHC_RM=pbs
+ return 0
...
(truncated)

to confirm:

# grep NHC_RM /etc/nhc/nhc.conf
   * || export NHC_RM=slurm

Is this by design or a bug? I'd think once explicitly defined in nhc.conf, nhc would not need to perform any further checks to identify the RM... Skipping this check would also save some cycles ;)

Thanks!

No configure script in tarball

I'm installing on Ubuntu so I tried to follow the instructions to install from source which just say to download the latest tarball and run configure. That would be fine, but there is no configure in the tarball. There is a configure.in and I was able to generate the configure script with autoconf, but either the documentation needs to be updated or the tarball needs to be created after generating the configure script.

e-mail address [email protected] does not work

The address is mentioned in several places in the README.

I got this from the Mail Delivery Subsystem:

From: Mail Delivery Subsystem 
To: johan.guldmyr@domain
X-Failed-Recipients: [email protected]
Subject: Delivery Status Notification (Failure)
Message-ID: <[email protected]>
Date: Wed, 27 Jan 2016 07:14:10 +0000
Content-Type: text/plain; charset=UTF-8
X-Bayes-Prob: 0.0001 (Score 0, tokens from: 01_Tag_Only, default, @@RPTN)
X-Spam-Score: 0.00 () [Tag at 5.00] FREEMAIL_FROM:0.001
X-CanIt-Geo: ip=2607:f8b0:400e:c00::242; country=US; region=Oregon; city=The Dalles; latitude=45.5447; longitude=-121.1543; http://maps.google.com/maps?q=45.5447,-121.1543&z=6
X-CanItPRO-Stream: 01_Tag_Only (inherits from default)
X-Canit-Stats-ID: 01QaTwhro - d13946051ad9
X-Scanned-By: CanIt (www . roaringpenguin . com)

Hello johan.guldmyr@domain

We're writing to let you know that the group you tried to contact (nhc) may not exist, or you may not have permission to post messages to the group. A few more details on why you weren't able to post:

 * You might have spelled or formatted the group name incorrectly.
 * The owner of the group may have removed this group.
 * You may need to join the group before receiving permission to post.
 * This group may not be open to posting.

If you have questions related to this or any other Google Group, visit the Help Center at https://support.google.com/a/lbl.gov/bin/topic.py?topic=25838.

Thanks,

lbl.gov admins

check_ps_userproc_lineage prints with UID instead of username

I currently have check_ps_userproc_lineage set to log and syslog and noticed that the log messages all use a user's UID instead of the username. My read of the code is that username is supposed to be what's printed. Seems that check_ps_unauth_users is correctly using username.

Example from nhc -l- -v:

Running check:  "check_ps_userproc_lineage log syslog"
check_ps_userproc_lineage:  20705's "sh" process is rogue. (PID 27519)

Example from syslog:

Mar  3 16:28:02 o0199 nhc[52778]: check_ps_userproc_lineage:  20705's "sh" process is rogue. (PID 27519)

potential space leak in SGE load sensor loop

With the version of bash in RHEL6 (and presumably others), an unbounded space leak appears if NHC is used as an SGE load sensor. Running under (a rebuild of) Fedora's bash 4.3 is OK; I haven't tried other versions.

This patch causes it to bail out of the loop if the size of the process doubles, at which point it will get restarted.

diff --git a/nhc b/nhc
index 1705e79..706c07d 100755
--- a/nhc
+++ b/nhc
@@ -681,6 +692,29 @@ if [[ "$NHC_RM" == "sge" ]]; then
         if nhcmain_run_checks ; then
             nhcmain_finish
         fi
+       # This loop leaks space with some versions of bash,
+       # e.g. RHEL6's version of 4.1; Fedora's version of 4.3 is OK.
+       # We'll bail out if memory use has ballooned too much, and
+       # execd will re-start us.  Arbitrarily decide on RSS more then
+       # doubling after the first run (which doesn't take that long).
+       # For what it's worth, after triggering a core dump with nhc
+       # in a hard loop from "yes ''|nhc":
+       #   strings core.5137|sort |uniq -c|sort -r -n |head 
+       #    157052 e_size
+       #     25489 :1}"
+       #      7002 [*]}
+       #      6730 ARG"
+       #      6688 TARG"
+       #      ...
+       # (The number of "e_size" entries is a bit less than the
+       # number of iterations.)
+       if [[ -z "$INIT_RSS" ]]; then
+           INIT_RSS=$(ps -p $$ -o rss | tail -n1)
+       elif (($(ps -p $$ -o rss) > 2*$INIT_RSS)); then
+           syslog "nhc bailing out with bash leak -- try a recent version of bash"
+           syslog_flush
+           exit 1
+       fi
     done
 else
     nhcmain_load_scripts

check_ps_userproc_lineage misses users whos UID == MAX_SYS_UID

My UID is 1000 which happens to correspond to the MAX_SYS_UID value. As a result my processes are never flagged as rogue processes.
Diff to fix it is:

diff --git a/scripts/lbnl_ps.nhc b/scripts/lbnl_ps.nhc
index 1c894cb..cfffbed 100644
--- a/scripts/lbnl_ps.nhc
+++ b/scripts/lbnl_ps.nhc
@@ -674,7 +674,7 @@ function check_ps_unauth_users() {
                 log )    log "$UNAUTH_MSG" ;;
                 syslog ) syslog "$UNAUTH_MSG" ;;
                 die )    die 1 "$UNAUTH_MSG" ; return 1 ;;
-                kill )   [[ ${THIS_UID:-0} -gt $MAX_SYS_UID ]] && kill -9 $THIS_PID ;;
+                kill )   [[ ${THIS_UID:-0} -ge $MAX_SYS_UID ]] && kill -9 $THIS_PID ;;
                 ignore ) break ;;
             esac
         done
@@ -704,7 +704,7 @@ function check_ps_userproc_lineage() {
         THIS_UID="${PS_UID[$THIS_PID]}"
         THIS_USER="${PS_USER[$THIS_PID]:-${PWUID_USER[$THIS_UID]:-$THIS_UID}}"
         THIS_CMD="${PS_ARGS[$THIS_PID]/% *}"
-        if [[ ${THIS_UID:-0} -le $MAX_SYS_UID ]]; then
+        if [[ ${THIS_UID:-0} -lt $MAX_SYS_UID ]]; then
             continue
         fi
         if mcheck "${NHC_AUTH_USERS}" "/(^|[^A-Za-z0-9])$THIS_USER(\$|[^A-Za-z0-9])/" ; then
@@ -723,7 +723,7 @@ function check_ps_userproc_lineage() {
                 log )    log "$UNAUTH_MSG" ;;
                 syslog ) syslog "$UNAUTH_MSG" ;;
                 die )    die 1 "$UNAUTH_MSG" ; return 1 ;;
-                kill )   [[ ${THIS_UID:-0} -gt $MAX_SYS_UID ]] && kill -9 $THIS_PID ;;
+                kill )   [[ ${THIS_UID:-0} -ge $MAX_SYS_UID ]] && kill -9 $THIS_PID ;;
                 ignore ) break ;;
             esac
         done

nhc-wrapper error

Hello,
The following error appears to be reoccurring:

/usr/sbin/nhc-wrapper: line 199: 25013 Killed "$SUBPROG" "${ARGLIST[@]}" &>"$OUTFILE"

Can you please help me to better understand this error and resolve it?

avoiding repeated messages when used as SGE load sensor

If NHC is used as an SGE load sensor with syslogging, it currently spams syslog with a message on each run until the problem is resolved. This change avoids sending messages when the state hasn't changed.

diff --git a/nhc b/nhc
index 1705e79..706c07d 100755
--- a/nhc
+++ b/nhc
@@ -40,6 +40,10 @@

 ### Library functions

+# Cache for the last message to avoid spamming syslog in the SGE loop
+# until the state changes.
+last_died_msg=
+
 # Declare a print-error-and-exit function.
 function die() {
     IFS=$' \t\n'
@@ -48,8 +52,11 @@ function die() {

     CHECK_DIED=1
     log "ERROR:  $NAME:  Health check failed:  $*"
-    syslog "Health check failed:  $*"
-    syslog_flush
+    if [[ "$NHC_RM" != "sge" || "$*" != "$last_died_msg" ]]; then
+       last_died_msg="$*"
+       syslog "Health check failed:  $*"
+       syslog_flush
+    fi
     if [[ -n "$NHC_RM" && "$MARK_OFFLINE" -eq 1 && "$FAIL_CNT" -eq 0 ]]; then
         eval $OFFLINE_NODE "'$HOSTNAME'" "'$*'"
     fi
@@ -628,6 +635,10 @@ function nhcmain_mark_online() {
 function nhcmain_finish() {
     local ELAPSED

+    if [[ -n "$last_died_msg" ]]; then
+       syslog "Health check recovered"
+       last_died_msg=
+    fi
     syslog_flush
     ELAPSED=$((SECONDS-NHC_START_TS))
     vlog "Node Health Check completed successfully (${ELAPSED}s)."

Handle Changes to Slurm's REBOOT workflow

Recent versions of Slurm have made changes in the reboot workflow and added a reboot state. As it is, in Slurm 17.11, doing an scontrol reboot or scontrol reboot asap returns the node back to IDLE before NHC can have a chance to offline the node. This causes jobs to run and potentially fail if a required resource isn't up yet.

I have a preliminary set of patches here that I'm opening up for feedback and suggestions.

I'm still testing things, and I still need to make sure that Slurm 18.08's scontrol reboot nextstate=<STATE> command works with these changes, though I don't believe it will have any affect.

Thanks,
Michael

P.S. This was originally brought up here: https://bugs.schedmd.com/show_bug.cgi?id=6391

SLES Does Not Use libexecdir Either

Per @chrissamuel and https://en.opensuse.org/openSUSE:Specfile_guidelines#Libexecdir, SLES and OpenSuSE don't use /usr/libexec, preferring instead to use the older FHS guidance and path of /usr/lib. The fix committed by @basvandervlies in 86b80f4 (#56) restricts the use of /usr/lib to only Debian systems; instead, it should be used any time /usr/libexec is unavailable.

Ultimately, per #63, we probably want to honor user-supplied autoconf paths in a more sane way, but for now, the existence test likely makes the most sense.

Problem with Slurm node state check?

I have a node that was being used for testing and now I want to bring it back into service.

Looking at the NHC logs, the Slurm node state check is failing both to mark the node offline and mark it online, when appropriate:

20160704 14:18:00 /usr/libexec/nhc/node-mark-offline hmem000. check_fs_mount:  /camp not mounted
/usr/libexec/nhc/node-mark-offline:  Not sure how to handle node state "" on hmem000.
20160704 14:23:03 [slurm] /usr/libexec/nhc/node-mark-online hmem000.
/usr/libexec/nhc/node-mark-online:  Not sure how to handle node state "" on hmem000.
/usr/libexec/nhc/node-mark-online:  Skipping  node hmem000. ( )

Variables in nhcmain_init_env() parameters not being updated according to configure parameters

I'm not sure this is a really a bug, it's possible that I'm missing how the mechanism works. I'm configuring nhc as follows:

./configure --prefix=${basepath} --sysconfdir=${basepath}/etc --libexecdir=${basepath}/libexec

"make install" moves the bits in correct path, specified by $basepath, as designed. I'd expect, however, that the nhcmain_init_env() function in nhc would also use these locations, but I'm still seeing the default paths embedded in the code:

SYSCONFIGDIR="/etc/sysconfig"
LIBEXECDIR="/usr/libexec" 

Old nhc versions included a "BASEDIR" variable, which is no longer in the nhc code.

Thanks!

NHC returns false "OK" when checking for mounted GPFS filesystems

To be honest, I'm not exactly sure if this is because GPFS is doing something non-standard, or this would happen with any stale remote filesystem type.

[root@node001 ~]# nhc -a

[root@node001 ~]# mount | grep projectsn
projectsn on /projectsn type gpfs (rw,relatime)

[root@node001 ~]# df -h /projectsn
df: '/projectsn': Stale file handle

It makes the filesystem check pretty unreliable, as this is one of the more likely things to go wrong. Any advice? This is with NHC 1.4.2, but I suspect this is not something that is version dependent.

check_ps_unauth_users() killing interactive SLURM jobs

I experienced trouble with the SLURM implementation of check_ps_unauth_users() in release 1.4.2 of NHC killing interactive jobs. (Jobs submitted via sbatch are left alone.)

Undesired/unexpected behavior
check_ps_unauth_users: foo's "sleep" process is unauthorized. (PID 12347)
check_ps_unauth_users: foo's "/bin/bash" process is unauthorized. (PID 12372)

Upon closer inspection, this appeared to be a result of how the list of users with currently running jobs was calculated:

STAT_OUT=$(${STAT_CMD:-/usr/bin/stat} ${STAT_FMT_ARGS:--c} %U JOBFILE_PATH/job*/slurm_script)

Details
Job files like slurm_script are not created when interactive jobs are launched. Instead, there is a file with the node's hostname and job ID as a part of the filename:

|-- compute-0-2_1084.4294967294
|-- cred_state
|-- cred_state.old
-- job01084
-- slurm_script

Potential solution
I successfully addressed this locally using squeue, which can be configured to report just usernames:

STAT_OUT=$(squeue -w localhost --noheader -o %u)

This should report the username of all users with jobs running on the local node. (If a user is running jobs but not on this node, any processes she has on localhost are unauthorized.)

This has been tested with SLURM 15.08.7.

Please let me know if I have overlooked something or if you have any questions.

Thanks!

(NHC is awesome; thank you!)

check_fs_used reporting issues on wrong filesystem

Using nhc 1.4.2, I have several check_fs_used for our grid nodes. Works great most of the time, auto draining the nodes when the utilization gets to high. However, when a specific filesystem is getting hammered and slow to respond the check_fs_used() tends to report an issue with the first filesystem called with check_fs_used(), not the specific one that's slow to respond. This appears to be because it's not calling df on only the filesystem that's being tested but all filesystems every time check_fs_used() is called.

Please make check_hw_ib a parameter that for comparison operators (e.g., '>' or '<=')

It would be great if there were a way to specify a minimum rate for IB (we have a mixture of 40 and 56).

check_hw_ib

check_hw_ib rate [device]

check_hw_ib determines whether or not an active IB link is present with the specified data rate (in Gb/sec). Version 1.3 and later support the device parameter for specifying the name of the IB device. Version 1.4.1 and later also verify that the kernel drivers and userspace libraries are the same OFED version.

Example (QDR Infiniband): check_hw_ib 40

I contacted the mailing list about this and the response was:

"To support this properly, I'll need to rewrite the check to use the more modern getopt-based scheme which uses command-line options rather than simply fixed-position parameters to define what options are supported and which ones mean what. So it's a bit weightier of a change than if it were just a simple "let's add a new flag to support" patch. Having said that, though, it's still pretty straight-forward stuff. Shouldn't be too difficult. :-)"

Thanks for this great tool!

check_ps_unauth_users causes nhc to crash

It seems that because check_ps_unauth_users returns an error string containing a ', it causes a parsing error in nhc. Below includes a snippet of my nhc.conf and the nhc log that demonstrates the issue.

$ grep check_ps_unauth /etc/nhc.conf
* || check_ps_unauth_users log syslog kill die

$ tail /var/log/nhc.log

/usr/sbin/nhc: eval: line 54: syntax error near unexpected token `('
/usr/sbin/nhc: eval: line 54: `/usr/libexec/nhc/node-mark-offline 'n10000' 'check_ps_unauth_users:  david's "mono" process is unauthorized. (PID 777)''

I have a few ideas of a possible fix, one is to filter the note to just a set of valid characters and another is to let bash escape it by using a subprocess (). I just wanted to bring this to your attention and see what your opinion is for the best way to resolve this.

/usr/sbin/nhc - filtering the note

<         eval $OFFLINE_NODE "'$HOSTNAME'" "'$*'"

---
>         MARK_OFFLINE_NOTE="$*"
>         eval $OFFLINE_NODE "'$HOSTNAME'" "'${MARK_OFFLINE_NOTE//[^0-9a-zA-Z :_()\-]}'"

/usr/sbin/nhc - bash subprocess, which escapes the command arguments

<         eval $OFFLINE_NODE "'$HOSTNAME'" "'$*'"

---
>         ($OFFLINE_NODE "$HOSTNAME" "$*")

lbnl_file.nhc...../scripts/lbnl_file.nhc: line 83: /dev/fd/63: No such file or directory

When issuing a make test in git (9ac3a1a) version it throws an error. This make impossible to build the RPMs:

[lipi@llagosti src]$ git clone https://github.com/mej/nhc.git
..
[lipi@llagosti src]$ cd nhc/
..
[lipi@llagosti nhc]$ ./autogen.sh 
..
[lipi@llagosti nhc]$ make test
make -C test test
make[1]: Entering directory '/home/lipi/src/nhc/test'
Running unit tests for NHC:
nhcmain_init_env...ok 6/6
nhcmain_help...ok 8/8
nhcmain_parse_cmdline...ok 16/16
nhcmain_finalize_env...ok 14/14
nhcmain_check_conffile...ok 1/1
nhcmain_load_scripts...ok 6/6
nhcmain_set_watchdog...ok 1/1
nhcmain_watchdog_timer...ok 3/3
nhcmain_run_checks...ok 2/2
common.nhc...ok 108/108
lbnl_cmd.nhc...ok 13/13
lbnl_dmi.nhc...ok 45/45
lbnl_file.nhc...../scripts/lbnl_file.nhc: line 83: /dev/fd/63: No such file or directory
failed 4/71
TEST FAILED:  Single regexp match success:  Got "1" but expected "0"
Makefile:391: recipe for target 'test' failed
make[1]: *** [test] Error 255
make[1]: Leaving directory '/home/lipi/src/nhc/test'
Makefile:925: recipe for target 'test' failed
make: *** [test] Error 2

It is tested in Fedora 25 and SLES 12 SP1.

Suggestion of new check: check_fstype_used

Hi Michael,

We have a number of backup servers and storage servers with up to 5-10-20 mounted logical volumes, and for each server and each file system I configure in nhc.conf the check_fs_used, for example:

xx.fysik.dtu.dk || check_fs_used /u/snapshots/localhost 90%
xx.fysik.dtu.dk || check_fs_used /u/snapshots/nexmap 90%
xx.fysik.dtu.dk || check_fs_used /u/snapshots/tap 90%
xx.fysik.dtu.dk || check_fs_used /u/snapshots/tekhist 90%
and so on...

It has become cumbersome to maintain the dynamically changing nhc.conf file on many servers, so this is my suggestion for a new check:

check_fstype_used fstype maxused

where fstype might be a file system type such as xfs, ext4 and so on.

NHC could run "mount -t fstype" to construct a list of file systems, and the idea is that NHC could automagically loop over the list of file systems and execute check_fs_used for each of them.

Thanks for considering this idea.

/Ole

torque-6.0.1 breaks nhc

With torque-6.0.1, the notes begin with ERROR: instead of NHC:, and each time nhc is run by the pbs_mom it appends the same error. I included an example of the error below.

n157 down,offline ERROR: nhc: Health check failed: check_fs_mount: /projects not mounted - ERROR: nhc: Health check failed: check_fs_mount: /projects not mounted - ERROR: nhc: Health check failed: check_fs_mount: /projects not mounted - ERROR: nhc: Health check failed: check_fs_mount: /projects not mounted - ERROR: nhc: Health check failed: check_fs_mount: /projects not mounted - ERROR: nhc: Health check failed: check_fs_mount: /projects not mounted

I wrote a test node_check_script to try and determine the root cause:

!/bin/bash

pbsnodes -o -N 'NHC: test' n158
echo "ERROR: test msg"
exit 50

When the above is run, the node is marked offline, but the note is set to "ERROR: test msg" instead of "NHC: test".

The question is should this be fixed in nhc, or should it be fixed in torque? I suspect its this change that causes this behavior:
adaptivecomputing/torque@560926f#diff-7b5688dd0d82e64680c2e797f7859bde

NVVS (part of NVIDIA DCGM) has replaced nv-healthmon. NHC will fail on new GPUs w/o code mods

NVVS (part of NVIDIA's DCGM: Data Center GPU Manager) is the replacement for nv-healthmon, which is deprecated and unsupported for new and future NVIDIA hardware. Health checking for Pascal microarchitecture (P100/P4/P40 and later) NVIDIA GPUs installed on clusters using NHC will fail without appropriate modifications to NHC.

DCGM link: http://www.nvidia.com/object/data-center-gpu-manager.html

I can put you in direct contact with the DCGM engineering team at NVIDIA and get you the appropriate GPUs for your development and testing. When you are interested, just send me an email.

John Coombs
Tesla BU Alliance Management
NVIDIA
[email protected]

check_dmi_data_match is failing for strings containing '[' and ']'

Problem description:
When performing a check_dmi_data_match for a string that contains a '[' and a ']', the string is never found.

How to reproduce:

  1. Generate the config with nhc-genconf
    s03r1b01:/dev/shm # nhc-genconf -c nhc-auto.conf -H '*' -b '*'

  2. Remove all checks from generated nhc-auto.conf but one of check_dmi_data_match containing a [. Leave another one just for checking that it works, so the conf file looks like:

s03r1b01:/dev/shm # cat nhc-auto.conf
 * || export MARK_OFFLINE=0 NHC_CHECK_ALL=1
 * || check_dmi_data_match -h 0x0000 -t 0 "BIOS Information: Vendor: Lenovo"
 * || check_dmi_data_match -h 0x0000 -t 0 "BIOS Information: Version: -[TEE110H-1.00]-"
  1. Perform the checks and see how it fails for a string containing [ and ]:
s03r1b01:/dev/shm # nhc -c nhc-auto.conf -l /dev/fd/0 -v
Node Health Check starting.
Running check:  "export MARK_OFFLINE=0 NHC_CHECK_ALL=1"
Running check:  "check_dmi_data_match -h 0x0000 -t 0 "BIOS Information: Vendor: Lenovo""
Running check:  "check_dmi_data_match -h 0x0000 -t 0 "BIOS Information: Version: -[TEE110H-1.00]-""
ERROR:  nhc:  Health check failed:  check_dmi_data_match:  No match found for BIOS Information: Version: -[TEE110H-1.00]-
ERROR:  nhc:  Health check failed:  check_dmi_data_match:  No match found for BIOS Information: Version: -[TEE110H-1.00]-
ERROR:  nhc:  1 health checks failed.
ERROR:  nhc:  1 health checks failed.

Cause:
There seems to be a problem with the comparison done at common.nhc in function mcheck_glob.

The debug ends doing this comparison that fails:

++ mcheck_glob 'BIOS Information: Firmware Revision: 1.0' 'BIOS Information: Version: -[TEE110H-1.00]-'
++ [[ BIOS Information: Firmware Revision: 1.0 == BIOS Information: Version: -[TEE110H-1.00]- ]]
++ dbg 'Glob match check:  BIOS Information: Firmware Revision: 1.0 does not match BIOS Information: Version: -[TEE110H-1.00]-'

It seems to be because of line L257 that does:
if [[ "$1" == $2 ]]; then

note the $2 is not between ".

To reproduce the case in bash you can do a shell script:

[lipi@llagosti shm]$ cat test.sh
#!/bin/bash

A="[X]"
B="[X]"

if [[ "$A" == $B ]]
then
    echo "A equals B"
else
    echo "A not equals B"
fi

if [[ "$A" == "$B" ]]
then
    echo "A equals B"
else
    echo "A not equals B"
fi
[lipi@llagosti shm]$ ./test.sh
A not equals B
A equals B

nhc running on KNL node

Hi,

I'm running into an issue with nhc on a knights landing node. The hardware is a HPE XL260.

uname -a

Linux cn3102 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

#cat /etc/redhat-release
CentOS Linux release 7.3.1611 (Core)

The node has 256 GB of memory on it, but the check_hw_mem_free command of NHC is timing out:

nhc -d -v

DEBUG: Debugging activated via -d option.
DEBUG: Verbose mode activated via -v option.
[0] - DEBUG: NHC process 74923 is session leader.
[1502985678] - ERROR: nhc: Health check failed: Script timed out while executing "check_hw_mem_free 500mb".


cat /proc/meminfo

MemTotal: 280321744 kB
MemFree: 273449596 kB
MemAvailable: 274450436 kB
Buffers: 2224 kB
Cached: 2446640 kB
SwapCached: 28 kB
Active: 1170856 kB
Inactive: 1600764 kB
Active(anon): 482984 kB
Inactive(anon): 152932 kB
Active(file): 687872 kB
Inactive(file): 1447832 kB
Unevictable: 111348 kB
Mlocked: 111348 kB
SwapTotal: 4194300 kB
SwapFree: 4194272 kB
Dirty: 8 kB
Writeback: 0 kB
AnonPages: 434260 kB
Mapped: 104612 kB
Shmem: 304576 kB
Slab: 879420 kB
SReclaimable: 261204 kB
SUnreclaim: 618216 kB
KernelStack: 46880 kB
PageTables: 4808 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 144355172 kB
Committed_AS: 984060 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 1295136 kB
VmallocChunk: 34358355964 kB
HardwareCorrupted: 0 kB
AnonHugePages: 241664 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 707068 kB
DirectMap2M: 10692608 kB
DirectMap1G: 275775488 kB

Is there a fix for this or how can I debug?

Thanks,
Jeff

Slurm node state "resv" not understood by node-mark-online

We had a node with a Slurm reservation so that it's state was "reserved". This state is not understood by node-mark-online:

20170711 16:05:45 [slurm] /usr/libexec/nhc/node-mark-online h001
/usr/libexec/nhc/node-mark-online: Not sure how to handle node state "resv" on h001
/usr/libexec/nhc/node-mark-online: Skipping resv node h001 (none )

It seems to me that this could be solved by changing /usr/libexec/nhc/node-mark-online
77c77
< alloc*|comp*|idle*|mix*|resume*|undrain*|resv*)

    alloc*|comp*|idle*|mix*|resume*|undrain*)

Detached Mode error

When the detached mode is set, the log is displayed “ERROR: nhc: Unable to write to "/var/run/nhc/nhc.status" -- is /var/run/nhc read-only?”.The /var/run/nhc is drwxr-xr-x.

check_fs_{free,used} fails to check root file system

When checking the root file system on CentOS 7 with lbnl-nhc 1.4.2, df is invoked with the '-a' flag, causing output similar to the following:

Filesystem                             Type            1K-blocks     Used Available Use% Mounted on
rootfs                                 -                       -        -         -    - /
sysfs                                  sysfs                   0        0         0    - /sys
proc                                   proc                    0        0         0    - /proc
devtmpfs                               devtmpfs          8069440        0   8069440   0% /dev
securityfs                             securityfs              0        0         0    - /sys/kernel/security
tmpfs                                  tmpfs             8085876    48828   8037048   1% /dev/shm
devpts                                 devpts                  0        0         0    - /dev/pts
tmpfs                                  tmpfs             8085876     9456   8076420   1% /run
tmpfs                                  tmpfs             8085876        0   8085876   0% /sys/fs/cgroup
cgroup                                 cgroup                  0        0         0    - /sys/fs/cgroup/systemd
pstore                                 pstore                  0        0         0    - /sys/fs/pstore
cgroup                                 cgroup                  0        0         0    - /sys/fs/cgroup/perf_event
cgroup                                 cgroup                  0        0         0    - /sys/fs/cgroup/cpu,cpuacct
/dev/mapper/vg_centos-lv_root          xfs              41918468 21748416  20170052  52% /
selinuxfs                              selinuxfs               0        0         0    - /sys/fs/selinux
systemd-1                              -                       -        -         -    - /proc/sys/fs/binfmt_misc
hugetlbfs                              hugetlbfs               0        0         0    - /dev/hugepages
mqueue                                 mqueue                  0        0         0    - /dev/mqueue
debugfs                                debugfs                 0        0         0    - /sys/kernel/debug
nfsd                                   nfsd                    0        0         0    - /proc/fs/nfsd
binfmt_misc                            binfmt_misc             0        0         0    - /proc/sys/fs/binfmt_misc
/dev/sda1                              xfs                505580   404616    100964  81% /boot
/dev/mapper/vg_centos-lv_var           xfs              10475520  1026820   9448700  10% /var
/dev/mapper/vg_centos-lv_home          xfs             225968332 61061460 164906872  28% /home
/dev/mapper/vg_centos-lv_tmp           xfs              10475520   349428  10126092   4% /tmp
/dev/mapper/vg_centos-lv_tmp           xfs              10475520   349428  10126092   4% /var/tmp
/dev/mapper/vg_centos-lv_var_log       xfs               8378368    97572   8280796   2% /var/log
/dev/mapper/vg_centos-lv_var_log_audit xfs               2082816    68788   2014028   4% /var/log/audit

While parsing the output, it seems like NHC gets the 'rootfs' file system instead of the logical volume actually mounted on /. NHC realized this file systems reports a zero size and aborts the check without looking farther into the df output. The following error is given in syslog

nhc[7739]: WARNING:  Possible bogus check:  check_fs_used on pseudofilesystem /

This error is generated from within check_fs_free, check_fs_used, and check_fs_size from lbnl_fs.nhc:

    for ((i=0; i < ${#DF_DEV[*]}; i++)); do
        if [[ "${DF_MNTPT[$i]}" != "$FS" ]]; then
            continue
        fi
        if [[ "${DF_SIZE[$i]}" == "0" ]]; then
            syslog "WARNING:  Possible bogus check:  ${FUNCNAME[0]} on pseudofilesystem $FS"
            return 0
        fi

I can get around this by setting DF_FLAGS to '-Tk' instead of '-Tka'. Should the '-a' flag be default? I'm thinking this code should continue instead of returning 0 to allow it to find the real root file system.

Move README to Sphinx docs (and ReadTheDocs)

Michael —

Congratulations on the move to GitHub! I understand this might not really have been your first choice, but for what it’s worth, I’m happy about it. :)

I wanted to reach out and inquire whether you had any thoughts about "dissecting" the README.md file and moving that into more of a true documentation guide? I would be happy to do some work and submit a pull request for you. My thought would be to use Sphinx and ReadTheDocs.org to build and maintain a live copy of the documentation. Again, would like to get some input before diving into the work.

Thoughts?

watchdog kill bugs

The nhc script has a watchdog timer that uses a construct like

kill pid || return 0

and then proceeds to kill some more if it fails. Well, it always proceeds because in the bash shell a zero return is actually true. So me thinks the killing lines should read

kill pid && return 0

I tested this on a code snippet and indeed, killing the watchdog works fine with && rather than ||.
The easiest way to see this for yourself is to try something like this in a shell:

sleep 60 &
kill $! || echo hello

vs

sleep 60 &
kill $! && echo hello

I'm too lazy this morning to make a fork + pull request for this. The statements to fix are in nhc functions nhcmain_watchdog_timer and kill_watchdog.

Hope this helps, and thanks for nhc !

Not able to provide a custom configuration directory for nhc-genconf. It always picks /etc/nhc

LBNL NHC
The nhc utility has an option to specify the custom configuration directory using the -D option, in order to load the scripts from the custom directory.

Whereas the nhc-genconf utility doesn't have this option. It always loads the '*.nhc' scripts from the /etc/nhc/scripts directory.

If I have configured nhc in a way where I have all my '*.nhc' scripts in the /opt/clustertest/nhc/etc/nhc/scripts directory.

nhc-genconf has the following option:
-c Write config to (default: /etc/nhc/nhc.conf.auto)

But I am not able to generate the configuration file using the -c option, as it is expecting the '*.nhc ' scripts to be in the /etc/nhc/scripts directory.

It always gives me the following error:
/opt/clustertest/nhc/sbin/nhc-genconf: line 330: nhc_common_unparse_size: command not found
/opt/clustertest/nhc/sbin/nhc-genconf: line 334: nhc_common_unparse_size: command not found

NHC hangs when ps has long commands in output

This is an extreme edge case. We have a user who seems to dump a perl script as the command for their job. So /proc/$PID/cmdline ends up being an entire perl script:

# wc /proc/30638/cmdline 
3754 6542 64551 /proc/30638/cmdline

This seems to cause problems for NHC when using check_ps_service. Running the actual ps command does not hang or take a long time.

[root@n0410 ~]# nhc -l- -v
<SNIP>
Running check:  "check_ps_service -r -d sssd_be sssd"
NHC watchdog timer 23612 (30 secs) has expired.  Signaling NHC:  kill -s ALRM -- -23611
ERROR:  nhc:  Health check failed:  Script timed out while executing "check_ps_service -r -d sssd_be sssd".
NHC watchdog timer 23612 terminating NHC:  kill -s TERM -- -23611
[root@n0410 ~]# NHC watchdog timer 23612 terminating NHC with extreme prejudice:  kill -s KILL -- -23611

I've saved the output of the NHC ps command, let me know if it would be useful to send directly via email as a reproducer.

Replacement for NVIDIA_HEALTHMON check?

We have some GPU nodes, and I would like to make the NVIDIA_HEALTHMON check in nhc.conf. Unfortunately, it seems that Nvidia no longer offers the nvidia-healthmon (at least I was unable to find this after a lot of searching).

Question: How may NHC check GPU health in the absense of nvidia-healthmon? One simple-minded check is the existence of the /dev/nvidia* files, like in this example nhc.conf line:

gpu* || check_file_test -c -r /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.