atc0005 / check-vmware Goto Github PK
View Code? Open in Web Editor NEWGo-based tooling to monitor VMware environments; NOT affiliated with or endorsed by VMware, Inc.
License: MIT License
Go-based tooling to monitor VMware environments; NOT affiliated with or endorsed by VMware, Inc.
License: MIT License
Data object: VirtualMachineSnapshotInfo(vim.vm.SnapshotInfo)
Property: currentSnapshot
Description:
Current snapshot of the virtual machine
This property is set by calling RevertToSnapshot_Task or CreateSnapshot_Task. This property will be empty when the working snapshot is at the root of the snapshot tree.
Idea: Report any virtual machines running with a snapshot active. Flags could allow specifying a time range for WARNING and CRITICAL states. Perhaps support a flag that toggles whether any active snapshot is enough to trigger an alert (presumably a WARNING state).
While working on #4 I noticed that the ignored_vms
structured logging field is duplicated. I'll need to check other plugins to see if I made the same copy/paste/modify mistake there as well.
The following linting issues were exposed from dropping in the https://github.com/atc0005/check-vmware/blob/master/.golangci.yml file as part of enabling GitHub Actions Workflows for this repo:
$ make linting
Running linting tools ...
Running go vet ...
Running golangci-lint ...
internal/vsphere/datastores.go:123:12: string `error: datacenter not provided, failed to fallback to default datacenter` has 3 occurrences, make it a constant (goconst)
errMsg = "error: datacenter not provided, failed to fallback to default datacenter"
^
internal/vsphere/datastores.go:126:12: string `error: failed to use provided datacenter, failed to fallback to default datacenter` has 3 occurrences, make it a constant (goconst)
errMsg = "error: failed to use provided datacenter, failed to fallback to default datacenter"
^
internal/config/constants.go:54:2: exported const PluginTypeTools should have comment (or a comment on this block) or be unexported (golint)
PluginTypeTools string = "vmware-tools"
^
internal/vsphere/constants.go:10:1: comment on exported const `ParentResourcePool` should be of the form `ParentResourcePool ...` (golint)
// Virtual machine hosts have a hidden resource pool named Resources, which is
^
internal/vsphere/login.go:20:1: exported function `Login` should have comment or be unexported (golint)
func Login(
^
internal/vsphere/resource-pools.go:134:1: exported function `GetEligibleRPs` should have comment or be unexported (golint)
func GetEligibleRPs(ctx context.Context, c *vim25.Client, includeRPs []string, excludeRPs []string, propsSubset bool) ([]mo.ResourcePool, error) {
^
internal/vsphere/tools.go:86:1: exported function `VMToolsOneLineCheckSummary` should have comment or be unexported (golint)
func VMToolsOneLineCheckSummary(stateLabel string, vmsWithIssues []mo.VirtualMachine, evaluatedVMs []mo.VirtualMachine, rps []mo.ResourcePool) string {
^
internal/vsphere/tools.go:110:1: exported function `VMToolsReport` should have comment or be unexported (golint)
func VMToolsReport(
^
internal/vsphere/vms.go:143:1: comment on exported function `GetVMsFromRPs` should be of the form `GetVMsFromRPs ...` (golint)
// GetVMsFromRP receives a list of ResourcePool object references and returns
^
internal/config/config.go:85:13: struct of size 272 bytes could be of size 264 bytes (maligned)
type Config struct {
^
internal/vsphere/resource-pools.go:81:2: Consider preallocating `poolNamesFound` (prealloc)
var poolNamesFound []string
^
Makefile:114: recipe for target 'linting' failed
make: *** [linting] Error 1
Per the docs, methods.GetCurrentTime(ctx, c)
will retrieve the vCenter server time in UTC.
We should be able to gather the current time from a reference NTP server and compare against this value. If the difference is more than X, then one state, if more than Y, then another state.
Not sure if this is capability is present for standalone ESXi hosts or if only through vCenter.
Use same format as with other projects.
Summary.QuickStats.UptimeSeconds
Not sure if this is based on power state, or guest OS uptime. If the former, this might require setting a lengthy value in order to be useful. For example, the power state "uptime" for some VMs could be many months at a time if there isn't a hard requirement to shut it down. This is with regular maintenance, OS updates and reboots.
If the Summary.QuickStats.UptimeSeconds
value is tied to a VM "reboot", then that will do nicely.
Unlike GH-5 which is intended to monitor a percentage of a set amount of memory across a cluster (e.g., are "we" within our leased memory range), this plugin is intended to monitor a specific host. This is intended to help identify hosts that are overburdened in a shared hosting environment where an automated rebalancing policy may not be in effect.
By default the flag values could be unset or otherwise configured to provide the same behavior as the version of the plugin created for GH-15.
This enhancement would add support for determining a distance from current version to highest version and use that to set CRITICAL
or WARNING
states.
While working on #6 earlier I thought I'd be clever and use the current datastore as a "container" for a view. The idea was that the view would be limited to just the VMs in the datastore.
This resulted in this error "bubbling up":
`ServerFaultCode: A specified parameter was not correct: container`
After digging into the docs (see below), I learned that only a subset of vSphere inventory types could be used as a container for a view:
The Folder, Datacenter, ComputeResource, ResourcePool, or HostSystem instance that provides the objects that the view presents.
Since the API docs explicitly note the supported types, we should probably enforce those types and return a more verbose error message if something else (such as a mo.Datastore
) is passed in.
Refs:
Example messages generated now (sent to os.Stderr
):
It took 11.2361ms to execute ValidateRPs func (and validate 4 Resource Pools).
It took 11.5973ms to execute getObjects func (and retrieve 6 ResourcePool objects).
It took 11.8661ms to execute GetEligibleRPs func (and retrieve 4 Resource Pools).
It took 33.6523ms to execute getObjects func (and retrieve 4 VirtualMachine objects).
It took 18.5961ms to execute getObjects func (and retrieve 6 VirtualMachine objects).
It took 294.8642ms to execute getObjects func (and retrieve 85 VirtualMachine objects).
It took 5.5516ms to execute getObjects func (and retrieve 85 VirtualMachine objects).
It took 353.6448ms to execute GetVMsFromRPs func (and retrieve 85 VMs).
It took 17.3µs to execute FilterVMsWithSnapshots func (for 83 VMs, yielding 2 VMs).
This has been very useful as I've worked on the package, and I expect the output will continue to be useful when troubleshooting plugins from this project in the future. However, while continuing work on #4 I believe I've hit a point where the output, while useful, may be a bit too much for anyone but myself to deal with.
I think the above is fine, but this block (one of many for a VM's snapshot tree) is an example of content that a sysadmin might not care to see (by default):
Processing snapshot: [ID: snapshot-229096, Name: Test Snapshot, HasParent: true]
Adding key 3 to vmParentSnapshotDiskFileKeys
Adding key 4 to vmParentSnapshotDiskFileKeys
Adding key 26 (vmsn, snapData) to vmSnapshotDiskFileKeys
snapLayout [Name: [HUSVM-Library-vol6] RHEL7-TEST/RHEL7-TEST-Snapshot11.vmsn, Size: 19564 (19.1KB), Key: 26]
Adding key 3 to vmSnapshotDiskFileKeys
Adding key 4 to vmSnapshotDiskFileKeys
Adding key 11 to vmSnapshotDiskFileKeys
Adding key 12 to vmSnapshotDiskFileKeys
Range vmParentSnapshotDiskFileKeys ...
Removing key 3 from vmSnapshotDiskFileKeys
Removing key 4 from vmSnapshotDiskFileKeys
Remaining keys in vmSnapshotDiskFileKeys: map[11:11 12:12 26:26]
Range vmDiskFileKeys ...
Removing key 5 from vmSnapshotDiskFileKeys
Removing key 6 from vmSnapshotDiskFileKeys
Removing key 27 from vmSnapshotDiskFileKeys
Removing key 28 from vmSnapshotDiskFileKeys
Removing key 3 from vmSnapshotDiskFileKeys
Removing key 4 from vmSnapshotDiskFileKeys
Remaining keys in vmSnapshotDiskFileKeys: map[11:11 12:12 26:26]
Tally size of vmSnapshotDiskFileKeys
Size [bytes: 1068140, HR: 1.0MB] calculated for Test Snapshot snapshot
This is output that would be hidden away by default and exposed only when requested.
Goals:
From the top of my mind I'm thinking of the CRITICAL
, WARNING
threshold details shown in the one-line summary output for the older plugins. That is useful to see why at a glance a Service Check state has been determined to be in a non-OK state.
Currently this is not the assumption. Snapshots are subject to both Age and Size checks by default, regardless of a VM's power state. While ignoring issues for powered off VMs by default makes sense to me in some cases (e.g., VMware Tools versions), ignoring powered off VMs seems more risky when dealing with snapshots.
Opening this issue to invite feedback from others.
Snippet currently used in the README, cmd-specific doc files:
The current design of this plugin is to evaluate all Virtual Machines, whether powered off or powered on. If you have a use case for evaluating only powered on VMs by default, please add a comment to GH-79 providing some details for your use-case. In our environment, I have yet to see a need to only evaluate powered on VMs for old snapshots. For cases where the snapshots needed to be ignored, we added the VM to the ignore list. We then relied on datastore usage monitoring to let us know when space was becoming an issue.
Goals:
IncludeRP
) allow restricting VMs to select Resource PoolsExcludeRP
) allow excluding a list of Resource Pools
When deploying the check_vmware_vcpus
plugin today I ran into this error:
mkdir .govmomi: permission denied
Light digging indicated it was related to sessions support. We're using that there:
check-vmware/internal/vsphere/login.go
Lines 46 to 52 in 505431e
Flags for these values currently support comma-separated lists of items:
This works if the whole collection is double-quoted (quotes removed by shell presumably?), but not if the individual items are quoted.
Examples:
"item 1, item 2, item3, item 4"
"item1", "item2", "item3", "item4"
'"item1", "item2", "item3", "item4"'
Spotted this while working on #87.
check-vmware/internal/vsphere/tools.go
Line 182 in bc6ce79
I discovered this when testing various combinations of changes for Resource Pool handling.
I read recently that VMware supports no more than 32 snapshots per VM. I think the recommended maximum number was around 3-4, but only for a short period of time.
This plugin should look for any VM with more than X snapshots and flag it as problematic. Perhaps the WARNING threshold at 4 (default), then CRITICAL somewhere before 32, maybe 25 snapshots (80% of 32, rounded down).
This would likely prove incredibly annoying if it runs frequently, so the docs would need to suggest that the retry frequency be set high enough to reflect a forgotten ISO, vs one in active use to install or rescue an operating system.
This has proven invaluable in other projects. Enable here also.
This statement does not handle the zero snapshots scenario properly:
check-vmware/internal/vsphere/snapshots.go
Line 614 in 64b45b0
Because zero snapshots meets that case
statement logic, it triggers instead of allowing the default
(and intended here) logic to trigger:
check-vmware/internal/vsphere/snapshots.go
Lines 632 to 634 in 64b45b0
One thing not clearly noted in current one-line summary or Long Service Output results is whether the plugin was asked to include or exclude VMs from evaluation based on their power status.
This should probably be noted for all plugins which allow filtering on power status.
The same goes for any other explicit evaluation criteria toggled by the sysadmin configuring the service check command definition. Choices there should be explicitly noted in the Long Service Output, if not in the one-line summary.
Originally posted by @atc0005 in #32 (comment)
Now that I've gotten the first four plugins functional and others have shown interest in this project, I've decided to hit pause on further plugin work and document existing functionality.
Once that is done, I'll turn back to further development efforts.
From vmware/govmomi issue 2257:
If you connect to an ESX host with govc, you can check this way:
% govc object.collect -s -type h / summary.managementServerIp 10.182.4.228It'll be empty if not connected to any vCenter.
See also the managementServerIp
field:
IP address of the VirtualCenter server managing this host, if any.
As shown here:
check-vmware/cmd/check_vmware_vcpus/main.go
Lines 251 to 256 in d5d9466
I believe I see what I intended, but I would need to either reword this statement or fix the math.
For example, let's say the allocation percentage is 110%.
This would mean that the wording should be:
I noticed this when deploying the plugin today. Example:
pre
Name Space used Datastore Usage
Ubuntu-MATE-18.04-disk-test-RES-DC1-S6200-vol12 29.1GB 0.06%
/pre
The output looks fine when displayed in a terminal. I'm not sure if this is due to a Nagios-specific setting or if tabwriter
or fmt.Fprintf
are encoding the angle brackets somehow.
Summary.Runtime.Question
I found that in at least one case a VM crashed due to lack of feedback on one of these prompts. That's been some time, so this is likely not as great an issue as it once was, but this could still prove useful.
Example output:
WARNING: 0 snapshots larger than 20 GB detected (evaluated 86 VMs, 4 Resource Pools)
**ERRORS**
* snapshot exceeds specified size threshold
**THRESHOLDS**
* CRITICAL: 30 GB size snapshots present
* WARNING: 20 GB size snapshots present
**DETAILED INFO**
Snapshots exceeding WARNING (20GB) or CRITICAL (30GB) size thresholds:
* "RHEL7-TEST" [Age: 1059.21 days, Size (item: 27.3KB, sum: 22.5GB), Name: "Fresh install, activation and patches", ID: snapshot-18946]
* "RHEL7-TEST" [Age: 471.91 days, Size (item: 8.4GB, sum: 22.5GB), Name: "2019-10-15", ID: snapshot-126800]
* "RHEL7-TEST" [Age: 420.81 days, Size (item: 6.3GB, sum: 22.5GB), Name: "2019-12-05", ID: snapshot-138143]
* "RHEL7-TEST" [Age: 305.86 days, Size (item: 7.8GB, sum: 22.5GB), Name: "2020-03-29", ID: snapshot-163887]
* "RHEL7-TEST" [Age: 13.73 days, Size (item: 1.0MB, sum: 22.5GB), Name: "Test Snapshot", ID: snapshot-229096]
* "RHEL7-TEST" [Age: 13.73 days, Size (item: 1.0MB, sum: 22.5GB), Name: "Test Child snapshot", ID: snapshot-229097]
Snapshots *not yet* exceeding size thresholds:
* "TEST-AC-000001" [Age: 11.11 days, Size (item: 2.0MB, sum: 2.0MB), Name: "VM Snapshot 1%252f18%252f2021, 3:29:43 AM", ID: snapshot-229822]
* "TEST-hwv10" [Age: 12.95 days, Size (item: 10.1KB, sum: 2.0MB), Name: "Snap1", ID: snapshot-229336]
* "TEST-hwv10" [Age: 12.95 days, Size (item: 2.0MB, sum: 2.0MB), Name: "Snap2", ID: snapshot-229337]
While working on #66 the language used for that plugin and the other snapshot plugins stood out:
CRITICAL: 2 day old snapshots present
WARNING: 1 day old snapshots present
CRITICAL: snapshots of 50 GB (combined size) present
WARNING: snapshots of 30 GB (combined size) present
These are thresholds, and the description should clearly indicate that. For example, the present word above makes it sound like having a 1 day old snapshot is enough to trigger a WARNING
state (if specifying 1 day), but it's not, that is the threshold. The same for the 30 GB snapshot. Both scenarios are not enough to trigger a WARNING
state.
Once the values (age, size) go past the threshold is when the state changes.
In short, the present
word will need to go. I'll also need to review the other threshold statements to make sure they're accurate.
In the old codebase this was implemented as two plugins:
Both plugins allowed excluding individual VMs or resource pools as did other plugins in the set. I'm not sure yet whether this project will have two plugins or a shared plugin to handle both items. The check-path
project uses a shared plugin approach where monitoring criteria can be specified as needed. If not specified, those thresholds are not checked.
IncludeRP
) allow restricting VMs to select Resource PoolsExcludeRP
) allow excluding a list of Resource Pools
IgnoreVM
) allow excluding a list of individual VMsA question from @HisArchness on Twitter:
For check_vmware_tools, for instance, it seems it will ignore all virtual machines that is not in a Resource Pool and ignores the default 'Resources' RP altogether. Is there a way to change this behavior with the switches provided?
I don't know the answer off-hand, but this does not sound like the desired behavior for the plugin.
I wrote the original PowerCLI-based Nagios plugin with the intent of using it with standalone ESXi hosts (where on some systems we did not place them in Resource Pools) and with clusters managed by a vCenter instance (where all VMs are managed by Resource Pool). The new plugin is intended to mirror the behavior of the original while adding some additional functionality (and verbose Long Service Output content
useful for troubleshooting).
Based on the description alone, there is likely a bug in the plugin's logic. I'll look into this and note my findings.
refs https://twitter.com/HisArchness/status/1353761328591237125
While reviewing the vSphere API for work on #4, I took a closer look at how the space used by each VM on a specific datastore was calculated.
This is the logic as of this writing:
check-vmware/internal/vsphere/datastores.go
Lines 326 to 337 in 48fb7ae
these lines in particular:
check-vmware/internal/vsphere/datastores.go
Lines 326 to 327 in 48fb7ae
Looking at the API docs, it seems that the storage values available from vm.Summary.Storage
(vim.vm.Summary.StorageSummary
) is an aggregate for all datastores, not just the current one we're examining with this plugin.
refs:
Example output:
OK: No VMware Tools issues detected (evaluated 5 VMs, 1 Resource Pools)
**ERRORS**
* None
**THRESHOLDS**
* Not specified
**DETAILED INFO**
* No VMware Tools issues detected.
The logic for thresholds handling is defined here (from the README):
Tools Status | Nagios State | Description |
---|---|---|
toolsOk |
OK |
Ideal state, no problems with VMware Tools (or open-vm-tools ) detected. |
toolsOld |
WARNING |
Outdated VMware Tools installation. The host ESXi system was likely recently updated. |
toolsNotRunning |
CRITICAL |
VMware Tools (or open-vm-tools ) not currently running. It likely crashed or was terminated due to low memory scenario. |
toolsNotInstalled |
CRITICAL |
Fresh virtual environment, or VMware Tools removed as part of an upgrade of an existing installation. |
OK: Memory usage is at 93.89% of 40 GB allowed (2.45 GB remaining), 0.96% of total capacity. [WARNING: 101% , CRITICAL: 110%]
The 0.96% of total capacity
remark seems to be computed using these bits of PowerCLI logic:
$poolDetails = @{
"name" = $_.Name;
"cpuActive" = ($_.Runtime.Cpu.OverallUsage / 1000);
"memoryConsumed" = ($_.Runtime.Memory.OverallUsage / 1GB)
"memoryTotal" = ($_.Runtime.Memory.MaxUsage / 1GB)
}
and
# This property is attached to each entry in the pool; fetch value from first
# array entry.
if ($detailedPools.Count -gt 0) {
$totalMemoryAvailable = $detailedPools[0].memoryTotal
}
$memoryPercentageAllowed = [math]::Round(($totalMemoryUsed / $MaxMemoryAllowed) * 100, 2)
$memoryPercentageTotalCapacity = [math]::Round(($totalMemoryUsed / $totalMemoryAvailable) * 100, 2)
$memoryRemaining = [math]::Round(($MaxMemoryAllowed - $totalMemoryUsed), 2)
Per the Data Object - ResourcePoolResourceUsage(vim.ResourcePool.ResourceUsage) doc, this is what the maxUsage
field is about:
NAME | TYPE | DESCRIPTION |
---|---|---|
maxUsage | xsd:long | Current upper-bound on usage. The upper-bound is based on the limit configured on this resource pool, as well as limits configured on any parent resource pool. |
It may be that I was able to compute the total memory available in the cluster due to the memory limit on the pool being unlimited? This doesn't seem like a reliable way to list the overall percentage of memory consumed from the cluster. Instead you'd have to get the list of hosts, tally the total memory, then calculate per pool and in aggregate.
If there are pool caps, that would need to factor in somehow?
Originally posted by @atc0005 in #32 (comment)
Some snapshots taken yesterday during a maintenance window were properly flagged today as having a WARNING
state, but the one-line summary counts for affected VMs and snapshots were off by 2.
I checked and the logic problem is here:
check-vmware/internal/vsphere/snapshots.go
Lines 228 to 233 in ffd0c23
Specifically, here:
check-vmware/internal/vsphere/snapshots.go
Line 229 in ffd0c23
This should be >=
, not just >
.
While performing maintenance today I noticed that a fresh snapshot was not showing in the list, just the snapshots which had already hit an age threshold.
Copy/paste/modify error.
Perhaps check for the highest hardware version deployed and use that as the baseline for all other VMs?
If there is a boolean attribute we can check that will make it easier and more reliable. Otherwise, this plugin has to end up waiting for one of the VMs to be upgraded so that all others will be measured accordingly.
The standard suite used in other projects.
Noticed this after deploying the plugin today and pruning snapshots from a prior maintenance window.
Cover basic ground that I use with other projects.
For example, a VM of roughly 22.1 GB on a 7 TB datastore is reported as using 0% of the total storage, when in reality it is closer to 0.28%, assuming I'm doing my math correctly:
22100000000 / 7696581394432
0.0028714047013117 * 100
0.2871404701311673
0.28%
While copy/pasting a block to setup a new example service check I "noticed" this text which has been included in most (all?) of the service checks:
This may still be relevant (haven't read over it in detail yet) for some service check examples, but likely not all where it has been included.
toolsNotInstalled
toolsNotRunning
toolsOld
IncludeRP
) allow restricting VMs to select Resource PoolsExcludeRP
) allow excluding a list of Resource Pools
IgnoreVM
) allow excluding a list of individual VMstoolsNotInstalled
toolsOld
This may take some work to get right, but this plugin is intended to detect VMs housed on datastores that are distant to the hosts that are running them.
In our environment we have a total of 6 hosts. Three are in one datacenter, three are in another datacenter. Years ago the workload was light enough and the network connection between the DCs fast enough that most of our VMs could run on any set of hosts with minimal impact. At present attempting this causes no end of headaches.
Even numbered hosts are in one DC, odd numbered hosts are in the other. Datastores are prefixed with DC location. Knowing this, we can hard-code pair patterns to note when we have a mismatch.
The vSphere structure is composed of only a single datacenter, so we can't use DC separation as a search pattern. We could use a set of flags to specify a set of hostnames and datastore prefixes. The plugin could list all VMs housed on the datastores and verify what hosts they're running on. One of two flags would (or a boolean single flag) could identify whether a mismatch is considered WARNING or CRITICAL.
Regarding the service check, I suspect it would be easier to configure one service check per set of datastores & hosts. Presumably this would mean if there were 3 separate locations with storage intended for each (though connected to other hosts as a "fallback" option), this would mean three service checks.
An enhancement to this plugin could pivot to using tags or attributes to identify pairings and alert when a mismatch is found. This is likely the most flexible option for long-term use. This could catch for example an I/O demanding VM running on a lower tier of storage hardware, or a VM used by one team running on a datastore intended for another team.
References:
Goals/flags:
IncludeRP
) allow restricting VMs to select Resource PoolsExcludeRP
) allow excluding a list of Resource Pools
IgnoreVM
) allow excluding a list of individual VMsAs a checklist for what to create in this project, here is the shared functionality that I created as a VMware.Monitoring
PowerShell module at the end of last Summer:
Connect-VMwareEnvironment.ps1
Get-AvailableSnapshotInfo.ps1
Get-EligibleResourcePools.ps1
Get-EligibleVMs.ps1
Get-NagiosCommonEnvironmentSettings.ps1
Get-ResourcePoolsWithStateInfo.ps1
Get-VMsWithToolsIssues.ps1
Set-NagiosCheckStatus.ps1
Set-SnapshotAgeStateInfo.ps1
Set-SnapshotSizeStateInfo.ps1
Set-VMToolsStateInfo.ps1
VMware.Monitoring.psd1
VMware.Monitoring.psm1
Write-SnapshotInfo.ps1
Write-ToolsInfo.ps1
Not all of these items will have the same form in the new codebase, but this checklist is worth having as I begin building Go replacements.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.