atc0005 / check-vmware Goto Github PK

View Code? Open in Web Editor NEW

16.0 3.0 3.0 6.85 MB

Go-based tooling to monitor VMware environments; NOT affiliated with or endorsed by VMware, Inc.

License: MIT License

Go 95.10% Makefile 3.64% Dockerfile 0.22% Shell 1.04%

vmware nagios tools plugin golang nagios-plugin monitoring metrics

check-vmware's Introduction

About me

Role: Systems Administrator

Experience

support: troubleshooting, training, documentation
proxies & web servers: Squid, Apache, Nginx, HAProxy, IIS
mail servers: Postfix, Dovecot, Roundcube, DKIM, Postgrey
config/change management: Subversion, Git, Ansible
containers: Docker, LXD
virtualization: VMware, Hyper-V, VirtualBox
databases: MySQL/MariaDB, PostgreSQL, Microsoft SQL Server
monitoring: Nagios, custom tooling, Microsoft Teams, fail2ban
logging: rsyslog (local, central receivers), Graylog
ticketing: Redmine, GitHub, GitLab, Service Now

Role: Intermediate developer

Experience

current:
- Go, Python, PowerShell, shell scripting
- MySQL/MariaDB, SQLite
- Docker, LXD
- Markdown, Textile, MediaWiki, reStructuredText, HTML, CSS
- Redmine, GitHub (including GitHub Actions), Gitea, GitLab
past: batch files (don't laugh, it gets the job done), Perl
academic: C, C++

check-vmware's People

Contributors

Stargazers

Watchers

Forkers

byolock wolfwhoami kentbjoh

check-vmware's Issues

check_vmware_tools: Long Service Output listing omits affected VMs when only one affected

Spotted this while working on #87.

check-vmware/internal/vsphere/tools.go

Line 182 in bc6ce79

case len(vmsWithIssues) > 1:

I discovered this when testing various combinations of changes for Resource Pool handling.

Compare old/new plugin output for missing details

From the top of my mind I'm thinking of the CRITICAL, WARNING threshold details shown in the one-line summary output for the older plugins. That is useful to see why at a glance a Service Check state has been determined to be in a non-OK state.

Create plugin to monitor snapshots

Overview

In the old codebase this was implemented as two plugins:

size
age

Both plugins allowed excluding individual VMs or resource pools as did other plugins in the set. I'm not sure yet whether this project will have two plugins or a shared plugin to handle both items. The check-path project uses a shared plugin approach where monitoring criteria can be specified as needed. If not specified, those thresholds are not checked.

Goals

accept CRITICAL/WARNING threshold values (with useful default values)
(IncludeRP) allow restricting VMs to select Resource Pools
optional User Domain (with automatic selection applied if not given)
(ExcludeRP) allow excluding a list of Resource Pools
- reverse mode where VMs from all pools are checked, except for any VMs in this optional list of Resource Pools
(IgnoreVM) allow excluding a list of individual VMs
skip cert validation
emit ManagedObjectReference ID value in the Long Service Output
- won't be needed for the vast majority of use cases, but could be useful with troubleshooting work

References

PowerCLI
govmomi
- vmware/govmomi 1197
- vmware/govmomi 2243

Create plugin for detecting whether a Virtual Machine is running on a snapshot?

Data object: VirtualMachineSnapshotInfo(vim.vm.SnapshotInfo)

Property: currentSnapshot

Description:

Current snapshot of the virtual machine

This property is set by calling RevertToSnapshot_Task or CreateSnapshot_Task. This property will be empty when the working snapshot is at the root of the snapshot tree.

Idea: Report any virtual machines running with a snapshot active. Flags could allow specifying a time range for WARNING and CRITICAL states. Perhaps support a flag that toggles whether any active snapshot is enough to trigger an alert (presumably a WARNING state).

refs https://vdc-download.vmware.com/vmwb-repository/dcr-public/a5f4000f-1ea8-48a9-9221-586adff3c557/7ff50256-2cf2-45ea-aacd-87d231ab1ac7/vim.vm.SnapshotInfo.html#field_detail

Create plugin to monitor vCPUs allocation

Goals/flags:

(IncludeRP) allow restricting VMs to select Resource Pools
accept CRITICAL/WARNING threshold values (with useful default values)
accept Max vCPUs allowed value
optional User Domain (with automatic selection applied if not given)
(ExcludeRP) allow excluding a list of Resource Pools
- reverse mode where VMs from all pools are checked, except for any VMs in this optional list of Resource Pools
(IgnoreVM) allow excluding a list of individual VMs
skip cert validation
optional power state override
- powered on VMs only
- powered off VMs also

check_vmware_datastore command definition template uses wrong plugin name

check-vmware/contrib/nagios/etc/nagios-plugins/config/vmware-datastores.cfg

Lines 11 to 14 in d5d9466

 define command{ 

 command_name check_vmware_datastore 

 command_line /usr/lib/nagios/plugins/check_vmware_tools --server '$HOSTNAME$' --domain '$ARG1$' --username '$ARG2$' --password '$ARG3$' --ds-usage-warning '$ARG4$' --ds-usage-critical '$ARG5$' --ds-name '$ARG6$' --trust-cert --log-level info 

 }

Copy/paste/modify error.

GoDoc coverage missing for project plugins

Documentation coverage per pkg.go.dev listing:

Coverage is already provided by the README, so it shouldn't be too much work to copy/paste into new small doc.go files, one per plugin directory.

Review and update threshold listings in extended output

While working on #66 the language used for that plugin and the other snapshot plugins stood out:

snapshots age
- CRITICAL: 2 day old snapshots present
- WARNING: 1 day old snapshots present
snapshots size
- CRITICAL: snapshots of 50 GB (combined size) present
- WARNING: snapshots of 30 GB (combined size) present

These are thresholds, and the description should clearly indicate that. For example, the present word above makes it sound like having a 1 day old snapshot is enough to trigger a WARNING state (if specifying 1 day), but it's not, that is the threshold. The same for the 30 GB snapshot. Both scenarios are not enough to trigger a WARNING state.

Once the values (age, size) go past the threshold is when the state changes.

In short, the present word will need to go. I'll also need to review the other threshold statements to make sure they're accurate.

check_vmware_snapshots_age: Misreported VMs, snapshots count

Some snapshots taken yesterday during a maintenance window were properly flagged today as having a WARNING state, but the one-line summary counts for affected VMs and snapshots were off by 2.

I checked and the logic problem is here:

check-vmware/internal/vsphere/snapshots.go

Lines 228 to 233 in ffd0c23

 for _, set := range sss { 

 if set.ExceedsAge(days) > 1 { 

 setsExceeded++ 

 snapshotsExceeded += set.ExceedsAge(days) 

 } 

 }

Specifically, here:

check-vmware/internal/vsphere/snapshots.go

Line 229 in ffd0c23

if set.ExceedsAge(days) > 1 {

This should be >=, not just >.

Should snapshots for powered off VMs be ignored by default?

Currently this is not the assumption. Snapshots are subject to both Age and Size checks by default, regardless of a VM's power state. While ignoring issues for powered off VMs by default makes sense to me in some cases (e.g., VMware Tools versions), ignoring powered off VMs seems more risky when dealing with snapshots.

Opening this issue to invite feedback from others.

Snippet currently used in the README, cmd-specific doc files:

The current design of this plugin is to evaluate all Virtual Machines, whether powered off or powered on. If you have a use case for evaluating only powered on VMs by default, please add a comment to GH-79 providing some details for your use-case. In our environment, I have yet to see a need to only evaluate powered on VMs for old snapshots. For cases where the snapshots needed to be ignored, we added the VM to the ignore list. We then relied on datastore usage monitoring to let us know when space was becoming an issue.

Create plugin to monitor VMware Tools status

Base goals

Stretch goals

toggle to extend matches to powered off VMs also
- defaults to limiting results to powered on VMs only
- if enabling this setting, other states would be checked
  - toolsNotInstalled
  - toolsOld

"Snapshots not yet exceeding age thresholds" list not populated

While performing maintenance today I noticed that a fresh snapshot was not showing in the list, just the snapshots which had already hit an age threshold.

Flesh out README

Cover basic ground that I use with other projects.

check_vmware_snapshots_age plugin: wrong state label for OK check results

Noticed this after deploying the plugin today and pruning snapshots from a prior maintenance window.

Create plugin for VMs with an active "Question"

Summary.Runtime.Question
- if set, the VM is waiting for an answer (interactively)

I found that in at least one case a VM crashed due to lack of feedback on one of these prompts. That's been some time, so this is likely not as great an issue as it once was, but this could still prove useful.

Create plugin for detecting whether a host is connected to a vCenter instance?

From vmware/govmomi issue 2257:

If you connect to an ESX host with govc, you can check this way:
% govc object.collect -s -type h / summary.managementServerIp                       
10.182.4.228
It'll be empty if not connected to any vCenter.

See also the managementServerIp field:

IP address of the VirtualCenter server managing this host, if any.

refs https://vdc-download.vmware.com/vmwb-repository/dcr-public/a5f4000f-1ea8-48a9-9221-586adff3c557/7ff50256-2cf2-45ea-aacd-87d231ab1ac7/vim.host.Summary.html

Create plugin to monitor host CPU usage

This plugin is intended to monitor a specific host. This is intended to help identify hosts that are overburdened in a shared hosting environment where an automated rebalancing policy may not be in effect.

Create plugin to monitor VM "power cycle" uptime

Summary.QuickStats.UptimeSeconds
- e.g., too long of an update means that a kernel update didn't install properly (they're usually released monthly)

Not sure if this is based on power state, or guest OS uptime. If the former, this might require setting a lengthy value in order to be useful. For example, the power state "uptime" for some VMs could be many months at a time if there isn't a hard requirement to shut it down. This is with regular maintenance, OS updates and reboots.

If the Summary.QuickStats.UptimeSeconds value is tied to a VM "reboot", then that will do nicely.

Allow specifying lists of values with or without quotes

Flags for these values currently support comma-separated lists of items:

Ignored vms
Ignored datastores
Excluded resource pools
Included resource pools

This works if the whole collection is double-quoted (quotes removed by shell presumably?), but not if the individual items are quoted.

Examples:

works
- "item 1, item 2, item3, item 4"
does not work
- "item1", "item2", "item3", "item4"
- '"item1", "item2", "item3", "item4"'

Create plugin for reporting connected optical drives?

This would likely prove incredibly annoying if it runs frequently, so the docs would need to suggest that the retry frequency be set high enough to reflect a forgotten ISO, vs one in active use to install or rescue an operating system.

Create plugin to monitor virtual hardware version

Perhaps check for the highest hardware version deployed and use that as the baseline for all other VMs?

If there is a boolean attribute we can check that will make it easier and more reliable. Otherwise, this plugin has to end up waiting for one of the VMs to be upgraded so that all others will be measured accordingly.

vCPUs plugin: "more than allowed" error appears to be incorrect

As shown here:

check-vmware/cmd/check_vmware_vcpus/main.go

Lines 251 to 256 in d5d9466

 nagiosExitState.LastError = fmt.Errorf( 

 "%d of %d vCPUs allocated (%0.1f%% more than allowed)", 

 vCPUsAllocated, 

 cfg.VCPUsMaxAllowed, 

 vCPUsPercentageUsedOfAllowed, 

 )

I believe I see what I intended, but I would need to either reword this statement or fix the math.

For example, let's say the allocation percentage is 110%.

This would mean that the wording should be:

110% of allowed
10% more than allowed

VMs outside of Resource Pools excluded from evaluation

A question from @HisArchness on Twitter:

For check_vmware_tools, for instance, it seems it will ignore all virtual machines that is not in a Resource Pool and ignores the default 'Resources' RP altogether. Is there a way to change this behavior with the switches provided?

I don't know the answer off-hand, but this does not sound like the desired behavior for the plugin.

I wrote the original PowerCLI-based Nagios plugin with the intent of using it with standalone ESXi hosts (where on some systems we did not place them in Resource Pools) and with clusters managed by a vCenter instance (where all VMs are managed by Resource Pool). The new plugin is intended to mirror the behavior of the original while adding some additional functionality (and verbose Long Service Output content useful for troubleshooting).

Based on the description alone, there is likely a bug in the plugin's logic. I'll look into this and note my findings.

refs https://twitter.com/HisArchness/status/1353761328591237125

Add thresholds support for virtual hardware version plugin

By default the flag values could be unset or otherwise configured to provide the same behavior as the version of the plugin created for GH-15.

This enhancement would add support for determining a distance from current version to highest version and use that to set CRITICAL or WARNING states.

check_vmware_datastore | Datastore-specific storage usage for VMs appears to be incorrect

While reviewing the vSphere API for work on #4, I took a closer look at how the space used by each VM on a specific datastore was calculated.

This is the logic as of this writing:

check-vmware/internal/vsphere/datastores.go

Lines 326 to 337 in 48fb7ae

 for _, vm := range dsVMs { 

 vmStorageUsed := vm.Summary.Storage.Committed + vm.Summary.Storage.Uncommitted 

 vmPercentOfDSUsed := float64(vmStorageUsed) / float64(dsUsageSummary.StorageTotal) * 100 

 fmt.Fprintf( 

 tw, 

 "%s\t%v\t%1.f%%%s", 

 vm.Name, 

 units.ByteSize(vmStorageUsed), 

 vmPercentOfDSUsed, 

 nagios.CheckOutputEOL, 

 ) 

 }

these lines in particular:

check-vmware/internal/vsphere/datastores.go

Lines 326 to 327 in 48fb7ae

 for _, vm := range dsVMs { 

 vmStorageUsed := vm.Summary.Storage.Committed + vm.Summary.Storage.Uncommitted

Looking at the API docs, it seems that the storage values available from vm.Summary.Storage (vim.vm.Summary.StorageSummary) is an aggregate for all datastores, not just the current one we're examining with this plugin.

refs:

Various linting issues exposed from enabling GHAWs

The following linting issues were exposed from dropping in the https://github.com/atc0005/check-vmware/blob/master/.golangci.yml file as part of enabling GitHub Actions Workflows for this repo:

$ make linting
Running linting tools ...
Running go vet ...
Running golangci-lint ...
internal/vsphere/datastores.go:123:12: string `error: datacenter not provided, failed to fallback to default datacenter` has 3 occurrences, make it a constant (goconst)
                errMsg = "error: datacenter not provided, failed to fallback to default datacenter"
                         ^
internal/vsphere/datastores.go:126:12: string `error: failed to use provided datacenter, failed to fallback to default datacenter` has 3 occurrences, make it a constant (goconst)
                errMsg = "error: failed to use provided datacenter, failed to fallback to default datacenter"
                         ^
internal/config/constants.go:54:2: exported const PluginTypeTools should have comment (or a comment on this block) or be unexported (golint)
        PluginTypeTools                 string = "vmware-tools"
        ^
internal/vsphere/constants.go:10:1: comment on exported const `ParentResourcePool` should be of the form `ParentResourcePool ...` (golint)
// Virtual machine hosts have a hidden resource pool named Resources, which is
^
internal/vsphere/login.go:20:1: exported function `Login` should have comment or be unexported (golint)
func Login(
^
internal/vsphere/resource-pools.go:134:1: exported function `GetEligibleRPs` should have comment or be unexported (golint)
func GetEligibleRPs(ctx context.Context, c *vim25.Client, includeRPs []string, excludeRPs []string, propsSubset bool) ([]mo.ResourcePool, error) {
^
internal/vsphere/tools.go:86:1: exported function `VMToolsOneLineCheckSummary` should have comment or be unexported (golint)
func VMToolsOneLineCheckSummary(stateLabel string, vmsWithIssues []mo.VirtualMachine, evaluatedVMs []mo.VirtualMachine, rps []mo.ResourcePool) string {
^
internal/vsphere/tools.go:110:1: exported function `VMToolsReport` should have comment or be unexported (golint)
func VMToolsReport(
^
internal/vsphere/vms.go:143:1: comment on exported function `GetVMsFromRPs` should be of the form `GetVMsFromRPs ...` (golint)
// GetVMsFromRP receives a list of ResourcePool object references and returns
^
internal/config/config.go:85:13: struct of size 272 bytes could be of size 264 bytes (maligned)
type Config struct {
            ^
internal/vsphere/resource-pools.go:81:2: Consider preallocating `poolNamesFound` (prealloc)
        var poolNamesFound []string
        ^
Makefile:114: recipe for target 'linting' failed
make: *** [linting] Error 1

snapshots size plugin properly detects WARNING cumulative size state, but unhelpfully notes 0 (individual) snapshots exceeding size

Example output:

WARNING: 0 snapshots larger than 20 GB detected (evaluated 86 VMs, 4 Resource Pools)

**ERRORS**

* snapshot exceeds specified size threshold

**THRESHOLDS**

* CRITICAL: 30 GB size snapshots present
* WARNING: 20 GB size snapshots present

**DETAILED INFO**

Snapshots exceeding WARNING (20GB) or CRITICAL (30GB) size thresholds:

* "RHEL7-TEST" [Age: 1059.21 days, Size (item: 27.3KB, sum: 22.5GB), Name: "Fresh install, activation and patches", ID: snapshot-18946]
* "RHEL7-TEST" [Age: 471.91 days, Size (item: 8.4GB, sum: 22.5GB), Name: "2019-10-15", ID: snapshot-126800]
* "RHEL7-TEST" [Age: 420.81 days, Size (item: 6.3GB, sum: 22.5GB), Name: "2019-12-05", ID: snapshot-138143]
* "RHEL7-TEST" [Age: 305.86 days, Size (item: 7.8GB, sum: 22.5GB), Name: "2020-03-29", ID: snapshot-163887]
* "RHEL7-TEST" [Age: 13.73 days, Size (item: 1.0MB, sum: 22.5GB), Name: "Test Snapshot", ID: snapshot-229096]
* "RHEL7-TEST" [Age: 13.73 days, Size (item: 1.0MB, sum: 22.5GB), Name: "Test Child snapshot", ID: snapshot-229097]

Snapshots *not yet* exceeding size thresholds:

* "TEST-AC-000001" [Age: 11.11 days, Size (item: 2.0MB, sum: 2.0MB), Name: "VM Snapshot 1%252f18%252f2021, 3:29:43 AM", ID: snapshot-229822]
* "TEST-hwv10" [Age: 12.95 days, Size (item: 10.1KB, sum: 2.0MB), Name: "Snap1", ID: snapshot-229336]
* "TEST-hwv10" [Age: 12.95 days, Size (item: 2.0MB, sum: 2.0MB), Name: "Snap2", ID: snapshot-229337]

check_vmware_tools plugin does not clearly define what thresholds are used for service check logic

Example output:

OK: No VMware Tools issues detected (evaluated 5 VMs, 1 Resource Pools)

**ERRORS**

* None

**THRESHOLDS**

* Not specified

**DETAILED INFO**

* No VMware Tools issues detected.

The logic for thresholds handling is defined here (from the README):

Tools Status	Nagios State	Description
`toolsOk`	`OK`	Ideal state, no problems with VMware Tools (or `open-vm-tools`) detected.
`toolsOld`	`WARNING`	Outdated VMware Tools installation. The host ESXi system was likely recently updated.
`toolsNotRunning`	`CRITICAL`	VMware Tools (or `open-vm-tools`) not currently running. It likely crashed or was terminated due to low memory scenario.
`toolsNotInstalled`	`CRITICAL`	Fresh virtual environment, or VMware Tools removed as part of an upgrade of an existing installation.

Add support for listing Resource Pool memory usage as percentage of total cluster capacity

OK: Memory usage is at 93.89% of 40 GB allowed (2.45 GB remaining), 0.96% of total capacity. [WARNING: 101% , CRITICAL: 110%]

The 0.96% of total capacity remark seems to be computed using these bits of PowerCLI logic:

$poolDetails = @{
    "name" = $_.Name;
    "cpuActive" = ($_.Runtime.Cpu.OverallUsage / 1000);
    "memoryConsumed" = ($_.Runtime.Memory.OverallUsage / 1GB)
    "memoryTotal" = ($_.Runtime.Memory.MaxUsage / 1GB)
}

and

# This property is attached to each entry in the pool; fetch value from first
# array entry.
if ($detailedPools.Count -gt 0) {
    $totalMemoryAvailable = $detailedPools[0].memoryTotal
}

$memoryPercentageAllowed = [math]::Round(($totalMemoryUsed / $MaxMemoryAllowed) * 100, 2)
$memoryPercentageTotalCapacity = [math]::Round(($totalMemoryUsed / $totalMemoryAvailable) * 100, 2)
$memoryRemaining = [math]::Round(($MaxMemoryAllowed - $totalMemoryUsed), 2)

Per the Data Object - ResourcePoolResourceUsage(vim.ResourcePool.ResourceUsage) doc, this is what the maxUsage field is about:

NAME	TYPE	DESCRIPTION
maxUsage	xsd:long	Current upper-bound on usage. The upper-bound is based on the limit configured on this resource pool, as well as limits configured on any parent resource pool.

It may be that I was able to compute the total memory available in the cluster due to the memory limit on the pool being unlimited? This doesn't seem like a reliable way to list the overall percentage of memory consumed from the cluster. Instead you'd have to get the list of hosts, tally the total memory, then calculate per pool and in aggregate.

If there are pool caps, that would need to factor in somehow?

Originally posted by @atc0005 in #32 (comment)

Recreate shared functionality from prior PowerShell (PowerCLI-based) module

As a checklist for what to create in this project, here is the shared functionality that I created as a VMware.Monitoring PowerShell module at the end of last Summer:

Not all of these items will have the same form in the new codebase, but this checklist is worth having as I begin building Go replacements.

Replace godoc.org badge with pkg.go.dev badge

https://pkg.go.dev/badge/

Review contrib vc1.example.com Nagios host config file

While copy/pasting a block to setup a new example service check I "noticed" this text which has been included in most (all?) of the service checks:

check-vmware/contrib/nagios/etc/nagios3/conf/hosts/servers/vc1.example.com.cfg

Lines 122 to 127 in a1f9b04

 # Virtual machine hosts have a hidden resource pool named 'Resources', 

 # which is a parent of all resource pools of the host. This pool throws 

 # off our calculations, so we explicitly ignore it in the script logic 

 # itself. Because of that, we do NOT have to list it here. 

 # https://code.vmware.com/docs/9638/cmdlet-reference/doc/Get-ResourcePool.html 

 # https://pubs.vmware.com/vsphere-51/topic/com.vmware.powercli.cmdletref.doc/Get-ResourcePool.html

This may still be relevant (haven't read over it in detail yet) for some service check examples, but likely not all where it has been included.

Create plugin to monitor for mismatched storage/host pairings (using Custom Attributes)

This may take some work to get right, but this plugin is intended to detect VMs housed on datastores that are distant to the hosts that are running them.

In our environment we have a total of 6 hosts. Three are in one datacenter, three are in another datacenter. Years ago the workload was light enough and the network connection between the DCs fast enough that most of our VMs could run on any set of hosts with minimal impact. At present attempting this causes no end of headaches.

Even numbered hosts are in one DC, odd numbered hosts are in the other. Datastores are prefixed with DC location. Knowing this, we can hard-code pair patterns to note when we have a mismatch.

The vSphere structure is composed of only a single datacenter, so we can't use DC separation as a search pattern. We could use a set of flags to specify a set of hostnames and datastore prefixes. The plugin could list all VMs housed on the datastores and verify what hosts they're running on. One of two flags would (or a boolean single flag) could identify whether a mismatch is considered WARNING or CRITICAL.

Regarding the service check, I suspect it would be easier to configure one service check per set of datastores & hosts. Presumably this would mean if there were 3 separate locations with storage intended for each (though connected to other hosts as a "fallback" option), this would mean three service checks.

An enhancement to this plugin could pivot to using tags or attributes to identify pairings and alert when a mismatch is found. This is likely the most flexible option for long-term use. This could catch for example an I/O demanding VM running on a lower tier of storage hardware, or a VM used by one team running on a datastore intended for another team.

References:

check_vmware_snapshots_age plugin: duplicated structured logging field

While working on #4 I noticed that the ignored_vms structured logging field is duplicated. I'll need to check other plugins to see if I made the same copy/paste/modify mistake there as well.

Add "contrib" content, expand README to cover v0.2.0 functionality

Now that I've gotten the first four plugins functional and others have shown interest in this project, I've decided to hit pause on further plugin work and document existing functionality.

Once that is done, I'll turn back to further development efforts.

Create plugin to monitor Datastores

Goals:

Accept Datastore name
accept CRITICAL/WARNING threshold values (with useful default values)
skip cert validation

Add GitHub Actions Workflows

The standard suite used in other projects.

check_vmware_datastore | Angle brackets for pre tags (in VM listing) shown in CLI, missing from Nagios generated notifications

I noticed this when deploying the plugin today. Example:

pre
Name                                            Space used  Datastore Usage
Ubuntu-MATE-18.04-disk-test-RES-DC1-S6200-vol12  29.1GB      0.06%
/pre

The output looks fine when displayed in a terminal. I'm not sure if this is due to a Nagios-specific setting or if tabwriter or fmt.Fprintf are encoding the angle brackets somehow.

Create plugin to monitor host memory

Unlike GH-5 which is intended to monitor a percentage of a set amount of memory across a cluster (e.g., are "we" within our leased memory range), this plugin is intended to monitor a specific host. This is intended to help identify hosts that are overburdened in a shared hosting environment where an automated rebalancing policy may not be in effect.

check_vmware_datastore | Datastore-specific storage usage for VMs is rounded without sufficient precision

For example, a VM of roughly 22.1 GB on a 7 TB datastore is reported as using 0% of the total storage, when in reality it is closer to 0.28%, assuming I'm doing my math correctly:

22100000000 / 7696581394432
0.0028714047013117 * 100
0.2871404701311673
0.28%

Create plugin to monitor Resource Pools

Goals:

Accept max memory
accept CRITICAL/WARNING threshold values (with useful default values)
(IncludeRP) allow restricting VMs to select Resource Pools
optional User Domain (with automatic selection applied if not given)
(ExcludeRP) allow excluding a list of Resource Pools
- reverse mode where VMs from all pools are checked, except for any VMs in this optional list of Resource Pools
skip cert validation

Choice of including/excluding VMs from evaluation based on power status not exposed

One thing not clearly noted in current one-line summary or Long Service Output results is whether the plugin was asked to include or exclude VMs from evaluation based on their power status.

This should probably be noted for all plugins which allow filtering on power status.

The same goes for any other explicit evaluation criteria toggled by the sysadmin configuring the service check command definition. Choices there should be explicitly noted in the Long Service Output, if not in the one-line summary.

Originally posted by @atc0005 in #32 (comment)

Create a plugin to report whether a VM has exceeded a specified max number of snapshots

I read recently that VMware supports no more than 32 snapshots per VM. I think the recommended maximum number was around 3-4, but only for a short period of time.

This plugin should look for any VM with more than X snapshots and flag it as problematic. Perhaps the WARNING threshold at 4 (default), then CRITICAL somewhere before 32, maybe 25 snapshots (80% of 32, rounded down).

check_vmware_snapshots_age plugin: incomplete logic for young snapshots switch case

This statement does not handle the zero snapshots scenario properly:

check-vmware/internal/vsphere/snapshots.go

Line 614 in 64b45b0

 case !snapshotSummarySets.IsAgeCriticalState() && !snapshotSummarySets.IsAgeWarningState(): 

Because zero snapshots meets that case statement logic, it triggers instead of allowing the default (and intended here) logic to trigger:

check-vmware/internal/vsphere/snapshots.go

Lines 632 to 634 in 64b45b0

 default: 

 fmt.Fprintln(&report, "* None detected") 

 }

Plugins require write permission on home directory in order to cache login sessions

When deploying the check_vmware_vcpus plugin today I ran into this error:

mkdir .govmomi: permission denied

Light digging indicated it was related to sessions support. We're using that there:

check-vmware/internal/vsphere/login.go

Lines 46 to 52 in 505431e

 // Use session cache to help avoid "leaking sessions"; Session.Login will 

 // only create a new authenticated session if the cached session does not 

 // exist or is invalid. 

 s := &cache.Session{ 

 URL: u, 

 Insecure: trustCert, 

 }

Add CHANGELOG

Use same format as with other projects.

Enable Dependabot updates

This has proven invaluable in other projects. Enable here also.

Create plugin to monitor (vCenter) server time

Per the docs, methods.GetCurrentTime(ctx, c) will retrieve the vCenter server time in UTC.

We should be able to gather the current time from a reference NTP server and compare against this value. If the difference is more than X, then one state, if more than Y, then another state.

Not sure if this is capability is present for standalone ESXi hosts or if only through vCenter.

sphere.getObjects accepts unsupported types.ManagedObjectReference for use with CreateContainerView

While working on #6 earlier I thought I'd be clever and use the current datastore as a "container" for a view. The idea was that the view would be limited to just the VMs in the datastore.

This resulted in this error "bubbling up":

`ServerFaultCode: A specified parameter was not correct: container`

After digging into the docs (see below), I learned that only a subset of vSphere inventory types could be used as a container for a view:

The Folder, Datacenter, ComputeResource, ResourcePool, or HostSystem instance that provides the objects that the view presents.

Since the API docs explicitly note the supported types, we should probably enforce those types and return a more verbose error message if something else (such as a mo.Datastore) is passed in.

Refs:

Add support to toggle all internal/vsphere package debug log messages

Example messages generated now (sent to os.Stderr):

It took 11.2361ms to execute ValidateRPs func (and validate 4 Resource Pools).
It took 11.5973ms to execute getObjects func (and retrieve 6 ResourcePool objects).
It took 11.8661ms to execute GetEligibleRPs func (and retrieve 4 Resource Pools).
It took 33.6523ms to execute getObjects func (and retrieve 4 VirtualMachine objects).
It took 18.5961ms to execute getObjects func (and retrieve 6 VirtualMachine objects).
It took 294.8642ms to execute getObjects func (and retrieve 85 VirtualMachine objects).
It took 5.5516ms to execute getObjects func (and retrieve 85 VirtualMachine objects).
It took 353.6448ms to execute GetVMsFromRPs func (and retrieve 85 VMs).
It took 17.3µs to execute FilterVMsWithSnapshots func (for 83 VMs, yielding 2 VMs).

This has been very useful as I've worked on the package, and I expect the output will continue to be useful when troubleshooting plugins from this project in the future. However, while continuing work on #4 I believe I've hit a point where the output, while useful, may be a bit too much for anyone but myself to deal with.

I think the above is fine, but this block (one of many for a VM's snapshot tree) is an example of content that a sysadmin might not care to see (by default):

Processing snapshot: [ID: snapshot-229096, Name: Test Snapshot, HasParent: true]
Adding key 3 to vmParentSnapshotDiskFileKeys
Adding key 4 to vmParentSnapshotDiskFileKeys
Adding key 26 (vmsn, snapData) to vmSnapshotDiskFileKeys
snapLayout [Name: [HUSVM-Library-vol6] RHEL7-TEST/RHEL7-TEST-Snapshot11.vmsn, Size: 19564 (19.1KB), Key: 26]
Adding key 3 to vmSnapshotDiskFileKeys
Adding key 4 to vmSnapshotDiskFileKeys
Adding key 11 to vmSnapshotDiskFileKeys
Adding key 12 to vmSnapshotDiskFileKeys
Range vmParentSnapshotDiskFileKeys ...
Removing key 3 from vmSnapshotDiskFileKeys
Removing key 4 from vmSnapshotDiskFileKeys
Remaining keys in vmSnapshotDiskFileKeys: map[11:11 12:12 26:26]
Range vmDiskFileKeys ...
Removing key 5 from vmSnapshotDiskFileKeys
Removing key 6 from vmSnapshotDiskFileKeys
Removing key 27 from vmSnapshotDiskFileKeys
Removing key 28 from vmSnapshotDiskFileKeys
Removing key 3 from vmSnapshotDiskFileKeys
Removing key 4 from vmSnapshotDiskFileKeys
Remaining keys in vmSnapshotDiskFileKeys: map[11:11 12:12 26:26]
Tally size of vmSnapshotDiskFileKeys
Size [bytes: 1068140, HR: 1.0MB] calculated for Test Snapshot snapshot

This is output that would be hidden away by default and exposed only when requested.

	define command{
	command_name check_vmware_datastore
	command_line /usr/lib/nagios/plugins/check_vmware_tools --server '$HOSTNAME$' --domain '$ARG1$' --username '$ARG2$' --password '$ARG3$' --ds-usage-warning '$ARG4$' --ds-usage-critical '$ARG5$' --ds-name '$ARG6$' --trust-cert --log-level info
	}

	for _, set := range sss {
	if set.ExceedsAge(days) > 1 {
	setsExceeded++
	snapshotsExceeded += set.ExceedsAge(days)
	}
	}

	nagiosExitState.LastError = fmt.Errorf(
	"%d of %d vCPUs allocated (%0.1f%% more than allowed)",
	vCPUsAllocated,
	cfg.VCPUsMaxAllowed,
	vCPUsPercentageUsedOfAllowed,
	)

	for _, vm := range dsVMs {
	vmStorageUsed := vm.Summary.Storage.Committed + vm.Summary.Storage.Uncommitted
	vmPercentOfDSUsed := float64(vmStorageUsed) / float64(dsUsageSummary.StorageTotal) * 100
	fmt.Fprintf(
	tw,
	"%s\t%v\t%1.f%%%s",
	vm.Name,
	units.ByteSize(vmStorageUsed),
	vmPercentOfDSUsed,
	nagios.CheckOutputEOL,
	)
	}

	# Virtual machine hosts have a hidden resource pool named 'Resources',
	# which is a parent of all resource pools of the host. This pool throws
	# off our calculations, so we explicitly ignore it in the script logic
	# itself. Because of that, we do NOT have to list it here.
	# https://code.vmware.com/docs/9638/cmdlet-reference/doc/Get-ResourcePool.html
	# https://pubs.vmware.com/vsphere-51/topic/com.vmware.powercli.cmdletref.doc/Get-ResourcePool.html

	// Use session cache to help avoid "leaking sessions"; Session.Login will
	// only create a new authenticated session if the cached session does not
	// exist or is invalid.
	s := &cache.Session{
	URL: u,
	Insecure: trustCert,
	}

atc0005 / check-vmware Goto Github PK

check-vmware's Introduction

About me

Role: Systems Administrator

Role: Intermediate developer

check-vmware's People

Contributors

Stargazers

Watchers

Forkers

check-vmware's Issues

Overview

Goals

References

Base goals

Stretch goals

Recommend Projects

Recommend Topics

Recommend Org