Giter VIP home page Giter VIP logo

check-vmware's Introduction

About me

Role: Systems Administrator

Experience
  • support: troubleshooting, training, documentation
  • proxies & web servers: Squid, Apache, Nginx, HAProxy, IIS
  • mail servers: Postfix, Dovecot, Roundcube, DKIM, Postgrey
  • config/change management: Subversion, Git, Ansible
  • containers: Docker, LXD
  • virtualization: VMware, Hyper-V, VirtualBox
  • databases: MySQL/MariaDB, PostgreSQL, Microsoft SQL Server
  • monitoring: Nagios, custom tooling, Microsoft Teams, fail2ban
  • logging: rsyslog (local, central receivers), Graylog
  • ticketing: Redmine, GitHub, GitLab, Service Now

Role: Intermediate developer

Experience
  • current:
    • Go, Python, PowerShell, shell scripting
    • MySQL/MariaDB, SQLite
    • Docker, LXD
    • Markdown, Textile, MediaWiki, reStructuredText, HTML, CSS
    • Redmine, GitHub (including GitHub Actions), Gitea, GitLab
  • past: batch files (don't laugh, it gets the job done), Perl
  • academic: C, C++

check-vmware's People

Contributors

atc0005 avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

check-vmware's Issues

Compare old/new plugin output for missing details

From the top of my mind I'm thinking of the CRITICAL, WARNING threshold details shown in the one-line summary output for the older plugins. That is useful to see why at a glance a Service Check state has been determined to be in a non-OK state.

Create plugin to monitor snapshots

Overview

In the old codebase this was implemented as two plugins:

  • size
  • age

Both plugins allowed excluding individual VMs or resource pools as did other plugins in the set. I'm not sure yet whether this project will have two plugins or a shared plugin to handle both items. The check-path project uses a shared plugin approach where monitoring criteria can be specified as needed. If not specified, those thresholds are not checked.

Goals

  • accept CRITICAL/WARNING threshold values (with useful default values)
  • (IncludeRP) allow restricting VMs to select Resource Pools
  • optional User Domain (with automatic selection applied if not given)
  • (ExcludeRP) allow excluding a list of Resource Pools
    • reverse mode where VMs from all pools are checked, except for any VMs in this optional list of Resource Pools
  • (IgnoreVM) allow excluding a list of individual VMs
  • skip cert validation
  • emit ManagedObjectReference ID value in the Long Service Output
    • won't be needed for the vast majority of use cases, but could be useful with troubleshooting work

References

Create plugin for detecting whether a Virtual Machine is running on a snapshot?

Data object: VirtualMachineSnapshotInfo(vim.vm.SnapshotInfo)

Property: currentSnapshot

Description:

Current snapshot of the virtual machine

This property is set by calling RevertToSnapshot_Task or CreateSnapshot_Task. This property will be empty when the working snapshot is at the root of the snapshot tree.

Idea: Report any virtual machines running with a snapshot active. Flags could allow specifying a time range for WARNING and CRITICAL states. Perhaps support a flag that toggles whether any active snapshot is enough to trigger an alert (presumably a WARNING state).

refs https://vdc-download.vmware.com/vmwb-repository/dcr-public/a5f4000f-1ea8-48a9-9221-586adff3c557/7ff50256-2cf2-45ea-aacd-87d231ab1ac7/vim.vm.SnapshotInfo.html#field_detail

Create plugin to monitor vCPUs allocation

Goals/flags:

  • (IncludeRP) allow restricting VMs to select Resource Pools
  • accept CRITICAL/WARNING threshold values (with useful default values)
  • accept Max vCPUs allowed value
  • optional User Domain (with automatic selection applied if not given)
  • (ExcludeRP) allow excluding a list of Resource Pools
    • reverse mode where VMs from all pools are checked, except for any VMs in this optional list of Resource Pools
  • (IgnoreVM) allow excluding a list of individual VMs
  • skip cert validation
  • optional power state override
    • powered on VMs only
    • powered off VMs also

GoDoc coverage missing for project plugins

Documentation coverage per pkg.go.dev listing:

image

Coverage is already provided by the README, so it shouldn't be too much work to copy/paste into new small doc.go files, one per plugin directory.

Review and update threshold listings in extended output

While working on #66 the language used for that plugin and the other snapshot plugins stood out:

  • snapshots age
    • CRITICAL: 2 day old snapshots present
    • WARNING: 1 day old snapshots present
  • snapshots size
    • CRITICAL: snapshots of 50 GB (combined size) present
    • WARNING: snapshots of 30 GB (combined size) present

These are thresholds, and the description should clearly indicate that. For example, the present word above makes it sound like having a 1 day old snapshot is enough to trigger a WARNING state (if specifying 1 day), but it's not, that is the threshold. The same for the 30 GB snapshot. Both scenarios are not enough to trigger a WARNING state.

Once the values (age, size) go past the threshold is when the state changes.

In short, the present word will need to go. I'll also need to review the other threshold statements to make sure they're accurate.

check_vmware_snapshots_age: Misreported VMs, snapshots count

Some snapshots taken yesterday during a maintenance window were properly flagged today as having a WARNING state, but the one-line summary counts for affected VMs and snapshots were off by 2.

I checked and the logic problem is here:

for _, set := range sss {
if set.ExceedsAge(days) > 1 {
setsExceeded++
snapshotsExceeded += set.ExceedsAge(days)
}
}

Specifically, here:

if set.ExceedsAge(days) > 1 {

This should be >=, not just >.

Should snapshots for powered off VMs be ignored by default?

Currently this is not the assumption. Snapshots are subject to both Age and Size checks by default, regardless of a VM's power state. While ignoring issues for powered off VMs by default makes sense to me in some cases (e.g., VMware Tools versions), ignoring powered off VMs seems more risky when dealing with snapshots.

Opening this issue to invite feedback from others.

Snippet currently used in the README, cmd-specific doc files:

The current design of this plugin is to evaluate all Virtual Machines, whether powered off or powered on. If you have a use case for evaluating only powered on VMs by default, please add a comment to GH-79 providing some details for your use-case. In our environment, I have yet to see a need to only evaluate powered on VMs for old snapshots. For cases where the snapshots needed to be ignored, we added the VM to the ignore list. We then relied on datastore usage monitoring to let us know when space was becoming an issue.

Create plugin to monitor VMware Tools status

Base goals

  • automatically apply Nagios state based on the specific VMware Tools status (as noted later on this issue)
    • toolsNotInstalled
      • CRITICAL
    • toolsNotRunning
      • CRITICAL
    • toolsOld
      • WARNING
    • default, "Unknown issue with VMware Tools, flag this for research"
  • (IncludeRP) allow restricting VMs to select Resource Pools
  • optional User Domain (with automatic selection applied if not given)
  • (ExcludeRP) allow excluding a list of Resource Pools
    • reverse mode where VMs from all pools are checked, except for any VMs in this optional list of Resource Pools
  • (IgnoreVM) allow excluding a list of individual VMs
  • allow skipping cert validation
  • Long Service Output section (i.e., "DETAILED INFO")
    • Explicitly list (even if empty) which VMs have been excluded
      • this feature is present in the PowerShell version of the plugin
    • Explicitly list (even if empty) which resource pools have been excluded or included
      • this feature is not present in either version of the plugin
    • Consider listing remote server
    • Consider listing remote port
    • Consider listing remote vCenter URL

Stretch goals

  • toggle to extend matches to powered off VMs also
    • defaults to limiting results to powered on VMs only
    • if enabling this setting, other states would be checked
      • toolsNotInstalled
      • toolsOld

Create plugin for VMs with an active "Question"

  • Summary.Runtime.Question
    • if set, the VM is waiting for an answer (interactively)

I found that in at least one case a VM crashed due to lack of feedback on one of these prompts. That's been some time, so this is likely not as great an issue as it once was, but this could still prove useful.

Create plugin for detecting whether a host is connected to a vCenter instance?

From vmware/govmomi issue 2257:

If you connect to an ESX host with govc, you can check this way:

% govc object.collect -s -type h / summary.managementServerIp                       
10.182.4.228

It'll be empty if not connected to any vCenter.

See also the managementServerIp field:

IP address of the VirtualCenter server managing this host, if any.

refs https://vdc-download.vmware.com/vmwb-repository/dcr-public/a5f4000f-1ea8-48a9-9221-586adff3c557/7ff50256-2cf2-45ea-aacd-87d231ab1ac7/vim.host.Summary.html

Create plugin to monitor host CPU usage

This plugin is intended to monitor a specific host. This is intended to help identify hosts that are overburdened in a shared hosting environment where an automated rebalancing policy may not be in effect.


See also GH-7 and GH-5.

Note: This may be folded into the work for GH-7.

Create plugin to monitor VM "power cycle" uptime

  • Summary.QuickStats.UptimeSeconds
    • e.g., too long of an update means that a kernel update didn't install properly (they're usually released monthly)

Not sure if this is based on power state, or guest OS uptime. If the former, this might require setting a lengthy value in order to be useful. For example, the power state "uptime" for some VMs could be many months at a time if there isn't a hard requirement to shut it down. This is with regular maintenance, OS updates and reboots.

If the Summary.QuickStats.UptimeSeconds value is tied to a VM "reboot", then that will do nicely.

Allow specifying lists of values with or without quotes

Flags for these values currently support comma-separated lists of items:

  • Ignored vms
  • Ignored datastores
  • Excluded resource pools
  • Included resource pools

This works if the whole collection is double-quoted (quotes removed by shell presumably?), but not if the individual items are quoted.

Examples:

  • works
    • "item 1, item 2, item3, item 4"
  • does not work
    • "item1", "item2", "item3", "item4"
    • '"item1", "item2", "item3", "item4"'

Create plugin for reporting connected optical drives?

This would likely prove incredibly annoying if it runs frequently, so the docs would need to suggest that the retry frequency be set high enough to reflect a forgotten ISO, vs one in active use to install or rescue an operating system.

Create plugin to monitor virtual hardware version

Perhaps check for the highest hardware version deployed and use that as the baseline for all other VMs?

If there is a boolean attribute we can check that will make it easier and more reliable. Otherwise, this plugin has to end up waiting for one of the VMs to be upgraded so that all others will be measured accordingly.

vCPUs plugin: "more than allowed" error appears to be incorrect

As shown here:

nagiosExitState.LastError = fmt.Errorf(
"%d of %d vCPUs allocated (%0.1f%% more than allowed)",
vCPUsAllocated,
cfg.VCPUsMaxAllowed,
vCPUsPercentageUsedOfAllowed,
)

I believe I see what I intended, but I would need to either reword this statement or fix the math.

For example, let's say the allocation percentage is 110%.

This would mean that the wording should be:

  • 110% of allowed
  • 10% more than allowed

VMs outside of Resource Pools excluded from evaluation

A question from @HisArchness on Twitter:

For check_vmware_tools, for instance, it seems it will ignore all virtual machines that is not in a Resource Pool and ignores the default 'Resources' RP altogether. Is there a way to change this behavior with the switches provided?

I don't know the answer off-hand, but this does not sound like the desired behavior for the plugin.

I wrote the original PowerCLI-based Nagios plugin with the intent of using it with standalone ESXi hosts (where on some systems we did not place them in Resource Pools) and with clusters managed by a vCenter instance (where all VMs are managed by Resource Pool). The new plugin is intended to mirror the behavior of the original while adding some additional functionality (and verbose Long Service Output content useful for troubleshooting).

Based on the description alone, there is likely a bug in the plugin's logic. I'll look into this and note my findings.

refs https://twitter.com/HisArchness/status/1353761328591237125

Add thresholds support for virtual hardware version plugin

By default the flag values could be unset or otherwise configured to provide the same behavior as the version of the plugin created for GH-15.

This enhancement would add support for determining a distance from current version to highest version and use that to set CRITICAL or WARNING states.

check_vmware_datastore | Datastore-specific storage usage for VMs appears to be incorrect

While reviewing the vSphere API for work on #4, I took a closer look at how the space used by each VM on a specific datastore was calculated.

This is the logic as of this writing:

for _, vm := range dsVMs {
vmStorageUsed := vm.Summary.Storage.Committed + vm.Summary.Storage.Uncommitted
vmPercentOfDSUsed := float64(vmStorageUsed) / float64(dsUsageSummary.StorageTotal) * 100
fmt.Fprintf(
tw,
"%s\t%v\t%1.f%%%s",
vm.Name,
units.ByteSize(vmStorageUsed),
vmPercentOfDSUsed,
nagios.CheckOutputEOL,
)
}

these lines in particular:

for _, vm := range dsVMs {
vmStorageUsed := vm.Summary.Storage.Committed + vm.Summary.Storage.Uncommitted

Looking at the API docs, it seems that the storage values available from vm.Summary.Storage (vim.vm.Summary.StorageSummary) is an aggregate for all datastores, not just the current one we're examining with this plugin.

refs:

Various linting issues exposed from enabling GHAWs

The following linting issues were exposed from dropping in the https://github.com/atc0005/check-vmware/blob/master/.golangci.yml file as part of enabling GitHub Actions Workflows for this repo:

$ make linting
Running linting tools ...
Running go vet ...
Running golangci-lint ...
internal/vsphere/datastores.go:123:12: string `error: datacenter not provided, failed to fallback to default datacenter` has 3 occurrences, make it a constant (goconst)
                errMsg = "error: datacenter not provided, failed to fallback to default datacenter"
                         ^
internal/vsphere/datastores.go:126:12: string `error: failed to use provided datacenter, failed to fallback to default datacenter` has 3 occurrences, make it a constant (goconst)
                errMsg = "error: failed to use provided datacenter, failed to fallback to default datacenter"
                         ^
internal/config/constants.go:54:2: exported const PluginTypeTools should have comment (or a comment on this block) or be unexported (golint)
        PluginTypeTools                 string = "vmware-tools"
        ^
internal/vsphere/constants.go:10:1: comment on exported const `ParentResourcePool` should be of the form `ParentResourcePool ...` (golint)
// Virtual machine hosts have a hidden resource pool named Resources, which is
^
internal/vsphere/login.go:20:1: exported function `Login` should have comment or be unexported (golint)
func Login(
^
internal/vsphere/resource-pools.go:134:1: exported function `GetEligibleRPs` should have comment or be unexported (golint)
func GetEligibleRPs(ctx context.Context, c *vim25.Client, includeRPs []string, excludeRPs []string, propsSubset bool) ([]mo.ResourcePool, error) {
^
internal/vsphere/tools.go:86:1: exported function `VMToolsOneLineCheckSummary` should have comment or be unexported (golint)
func VMToolsOneLineCheckSummary(stateLabel string, vmsWithIssues []mo.VirtualMachine, evaluatedVMs []mo.VirtualMachine, rps []mo.ResourcePool) string {
^
internal/vsphere/tools.go:110:1: exported function `VMToolsReport` should have comment or be unexported (golint)
func VMToolsReport(
^
internal/vsphere/vms.go:143:1: comment on exported function `GetVMsFromRPs` should be of the form `GetVMsFromRPs ...` (golint)
// GetVMsFromRP receives a list of ResourcePool object references and returns
^
internal/config/config.go:85:13: struct of size 272 bytes could be of size 264 bytes (maligned)
type Config struct {
            ^
internal/vsphere/resource-pools.go:81:2: Consider preallocating `poolNamesFound` (prealloc)
        var poolNamesFound []string
        ^
Makefile:114: recipe for target 'linting' failed
make: *** [linting] Error 1

snapshots size plugin properly detects WARNING cumulative size state, but unhelpfully notes 0 (individual) snapshots exceeding size

Example output:

WARNING: 0 snapshots larger than 20 GB detected (evaluated 86 VMs, 4 Resource Pools)

**ERRORS**

* snapshot exceeds specified size threshold

**THRESHOLDS**

* CRITICAL: 30 GB size snapshots present
* WARNING: 20 GB size snapshots present

**DETAILED INFO**

Snapshots exceeding WARNING (20GB) or CRITICAL (30GB) size thresholds:

* "RHEL7-TEST" [Age: 1059.21 days, Size (item: 27.3KB, sum: 22.5GB), Name: "Fresh install, activation and patches", ID: snapshot-18946]
* "RHEL7-TEST" [Age: 471.91 days, Size (item: 8.4GB, sum: 22.5GB), Name: "2019-10-15", ID: snapshot-126800]
* "RHEL7-TEST" [Age: 420.81 days, Size (item: 6.3GB, sum: 22.5GB), Name: "2019-12-05", ID: snapshot-138143]
* "RHEL7-TEST" [Age: 305.86 days, Size (item: 7.8GB, sum: 22.5GB), Name: "2020-03-29", ID: snapshot-163887]
* "RHEL7-TEST" [Age: 13.73 days, Size (item: 1.0MB, sum: 22.5GB), Name: "Test Snapshot", ID: snapshot-229096]
* "RHEL7-TEST" [Age: 13.73 days, Size (item: 1.0MB, sum: 22.5GB), Name: "Test Child snapshot", ID: snapshot-229097]

Snapshots *not yet* exceeding size thresholds:

* "TEST-AC-000001" [Age: 11.11 days, Size (item: 2.0MB, sum: 2.0MB), Name: "VM Snapshot 1%252f18%252f2021, 3:29:43 AM", ID: snapshot-229822]
* "TEST-hwv10" [Age: 12.95 days, Size (item: 10.1KB, sum: 2.0MB), Name: "Snap1", ID: snapshot-229336]
* "TEST-hwv10" [Age: 12.95 days, Size (item: 2.0MB, sum: 2.0MB), Name: "Snap2", ID: snapshot-229337]

check_vmware_tools plugin does not clearly define what thresholds are used for service check logic

Example output:

OK: No VMware Tools issues detected (evaluated 5 VMs, 1 Resource Pools)

**ERRORS**

* None

**THRESHOLDS**

* Not specified

**DETAILED INFO**

* No VMware Tools issues detected.

The logic for thresholds handling is defined here (from the README):

Tools Status Nagios State Description
toolsOk OK Ideal state, no problems with VMware Tools (or open-vm-tools) detected.
toolsOld WARNING Outdated VMware Tools installation. The host ESXi system was likely recently updated.
toolsNotRunning CRITICAL VMware Tools (or open-vm-tools) not currently running. It likely crashed or was terminated due to low memory scenario.
toolsNotInstalled CRITICAL Fresh virtual environment, or VMware Tools removed as part of an upgrade of an existing installation.

Add support for listing Resource Pool memory usage as percentage of total cluster capacity

OK: Memory usage is at 93.89% of 40 GB allowed (2.45 GB remaining), 0.96% of total capacity. [WARNING: 101% , CRITICAL: 110%]

The 0.96% of total capacity remark seems to be computed using these bits of PowerCLI logic:

$poolDetails = @{
    "name" = $_.Name;
    "cpuActive" = ($_.Runtime.Cpu.OverallUsage / 1000);
    "memoryConsumed" = ($_.Runtime.Memory.OverallUsage / 1GB)
    "memoryTotal" = ($_.Runtime.Memory.MaxUsage / 1GB)
}

and

# This property is attached to each entry in the pool; fetch value from first
# array entry.
if ($detailedPools.Count -gt 0) {
    $totalMemoryAvailable = $detailedPools[0].memoryTotal
}

$memoryPercentageAllowed = [math]::Round(($totalMemoryUsed / $MaxMemoryAllowed) * 100, 2)
$memoryPercentageTotalCapacity = [math]::Round(($totalMemoryUsed / $totalMemoryAvailable) * 100, 2)
$memoryRemaining = [math]::Round(($MaxMemoryAllowed - $totalMemoryUsed), 2)

Per the Data Object - ResourcePoolResourceUsage(vim.ResourcePool.ResourceUsage) doc, this is what the maxUsage field is about:

NAME TYPE DESCRIPTION
maxUsage xsd:long Current upper-bound on usage. The upper-bound is based on the limit configured on this resource pool, as well as limits configured on any parent resource pool.

It may be that I was able to compute the total memory available in the cluster due to the memory limit on the pool being unlimited? This doesn't seem like a reliable way to list the overall percentage of memory consumed from the cluster. Instead you'd have to get the list of hosts, tally the total memory, then calculate per pool and in aggregate.

If there are pool caps, that would need to factor in somehow?

Originally posted by @atc0005 in #32 (comment)

Recreate shared functionality from prior PowerShell (PowerCLI-based) module

As a checklist for what to create in this project, here is the shared functionality that I created as a VMware.Monitoring PowerShell module at the end of last Summer:

  • Connect-VMwareEnvironment.ps1
  • Get-AvailableSnapshotInfo.ps1
  • Get-EligibleResourcePools.ps1
  • Get-EligibleVMs.ps1
  • Get-NagiosCommonEnvironmentSettings.ps1
  • Get-ResourcePoolsWithStateInfo.ps1
  • Get-VMsWithToolsIssues.ps1
  • Set-NagiosCheckStatus.ps1
  • Set-SnapshotAgeStateInfo.ps1
  • Set-SnapshotSizeStateInfo.ps1
  • Set-VMToolsStateInfo.ps1
  • VMware.Monitoring.psd1
  • VMware.Monitoring.psm1
  • Write-SnapshotInfo.ps1
  • Write-ToolsInfo.ps1

Not all of these items will have the same form in the new codebase, but this checklist is worth having as I begin building Go replacements.

Review contrib vc1.example.com Nagios host config file

While copy/pasting a block to setup a new example service check I "noticed" this text which has been included in most (all?) of the service checks:

# Virtual machine hosts have a hidden resource pool named 'Resources',
# which is a parent of all resource pools of the host. This pool throws
# off our calculations, so we explicitly ignore it in the script logic
# itself. Because of that, we do NOT have to list it here.
# https://code.vmware.com/docs/9638/cmdlet-reference/doc/Get-ResourcePool.html
# https://pubs.vmware.com/vsphere-51/topic/com.vmware.powercli.cmdletref.doc/Get-ResourcePool.html

This may still be relevant (haven't read over it in detail yet) for some service check examples, but likely not all where it has been included.

Create plugin to monitor for mismatched storage/host pairings (using Custom Attributes)

This may take some work to get right, but this plugin is intended to detect VMs housed on datastores that are distant to the hosts that are running them.

In our environment we have a total of 6 hosts. Three are in one datacenter, three are in another datacenter. Years ago the workload was light enough and the network connection between the DCs fast enough that most of our VMs could run on any set of hosts with minimal impact. At present attempting this causes no end of headaches.

Even numbered hosts are in one DC, odd numbered hosts are in the other. Datastores are prefixed with DC location. Knowing this, we can hard-code pair patterns to note when we have a mismatch.

The vSphere structure is composed of only a single datacenter, so we can't use DC separation as a search pattern. We could use a set of flags to specify a set of hostnames and datastore prefixes. The plugin could list all VMs housed on the datastores and verify what hosts they're running on. One of two flags would (or a boolean single flag) could identify whether a mismatch is considered WARNING or CRITICAL.

Regarding the service check, I suspect it would be easier to configure one service check per set of datastores & hosts. Presumably this would mean if there were 3 separate locations with storage intended for each (though connected to other hosts as a "fallback" option), this would mean three service checks.

An enhancement to this plugin could pivot to using tags or attributes to identify pairings and alert when a mismatch is found. This is likely the most flexible option for long-term use. This could catch for example an I/O demanding VM running on a lower tier of storage hardware, or a VM used by one team running on a datastore intended for another team.


References:

Create plugin to monitor host memory

Unlike GH-5 which is intended to monitor a percentage of a set amount of memory across a cluster (e.g., are "we" within our leased memory range), this plugin is intended to monitor a specific host. This is intended to help identify hosts that are overburdened in a shared hosting environment where an automated rebalancing policy may not be in effect.

Create plugin to monitor Resource Pools

Goals:

  • Accept max memory
  • accept CRITICAL/WARNING threshold values (with useful default values)
  • (IncludeRP) allow restricting VMs to select Resource Pools
  • optional User Domain (with automatic selection applied if not given)
  • (ExcludeRP) allow excluding a list of Resource Pools
    • reverse mode where VMs from all pools are checked, except for any VMs in this optional list of Resource Pools
  • skip cert validation

Choice of including/excluding VMs from evaluation based on power status not exposed

One thing not clearly noted in current one-line summary or Long Service Output results is whether the plugin was asked to include or exclude VMs from evaluation based on their power status.

This should probably be noted for all plugins which allow filtering on power status.

The same goes for any other explicit evaluation criteria toggled by the sysadmin configuring the service check command definition. Choices there should be explicitly noted in the Long Service Output, if not in the one-line summary.

Originally posted by @atc0005 in #32 (comment)

Create a plugin to report whether a VM has exceeded a specified max number of snapshots

I read recently that VMware supports no more than 32 snapshots per VM. I think the recommended maximum number was around 3-4, but only for a short period of time.

This plugin should look for any VM with more than X snapshots and flag it as problematic. Perhaps the WARNING threshold at 4 (default), then CRITICAL somewhere before 32, maybe 25 snapshots (80% of 32, rounded down).

check_vmware_snapshots_age plugin: incomplete logic for young snapshots switch case

This statement does not handle the zero snapshots scenario properly:

case !snapshotSummarySets.IsAgeCriticalState() && !snapshotSummarySets.IsAgeWarningState():

Because zero snapshots meets that case statement logic, it triggers instead of allowing the default (and intended here) logic to trigger:

default:
fmt.Fprintln(&report, "* None detected")
}

Plugins require write permission on home directory in order to cache login sessions

When deploying the check_vmware_vcpus plugin today I ran into this error:

mkdir .govmomi: permission denied

Light digging indicated it was related to sessions support. We're using that there:

// Use session cache to help avoid "leaking sessions"; Session.Login will
// only create a new authenticated session if the cached session does not
// exist or is invalid.
s := &cache.Session{
URL: u,
Insecure: trustCert,
}

Create plugin to monitor (vCenter) server time

Per the docs, methods.GetCurrentTime(ctx, c) will retrieve the vCenter server time in UTC.

We should be able to gather the current time from a reference NTP server and compare against this value. If the difference is more than X, then one state, if more than Y, then another state.

Not sure if this is capability is present for standalone ESXi hosts or if only through vCenter.

sphere.getObjects accepts unsupported types.ManagedObjectReference for use with CreateContainerView

While working on #6 earlier I thought I'd be clever and use the current datastore as a "container" for a view. The idea was that the view would be limited to just the VMs in the datastore.

This resulted in this error "bubbling up":

`ServerFaultCode: A specified parameter was not correct: container`

After digging into the docs (see below), I learned that only a subset of vSphere inventory types could be used as a container for a view:

The Folder, Datacenter, ComputeResource, ResourcePool, or HostSystem instance that provides the objects that the view presents.

Since the API docs explicitly note the supported types, we should probably enforce those types and return a more verbose error message if something else (such as a mo.Datastore) is passed in.

Refs:

Add support to toggle all internal/vsphere package debug log messages

Example messages generated now (sent to os.Stderr):

It took 11.2361ms to execute ValidateRPs func (and validate 4 Resource Pools).
It took 11.5973ms to execute getObjects func (and retrieve 6 ResourcePool objects).
It took 11.8661ms to execute GetEligibleRPs func (and retrieve 4 Resource Pools).
It took 33.6523ms to execute getObjects func (and retrieve 4 VirtualMachine objects).
It took 18.5961ms to execute getObjects func (and retrieve 6 VirtualMachine objects).
It took 294.8642ms to execute getObjects func (and retrieve 85 VirtualMachine objects).
It took 5.5516ms to execute getObjects func (and retrieve 85 VirtualMachine objects).
It took 353.6448ms to execute GetVMsFromRPs func (and retrieve 85 VMs).
It took 17.3µs to execute FilterVMsWithSnapshots func (for 83 VMs, yielding 2 VMs).

This has been very useful as I've worked on the package, and I expect the output will continue to be useful when troubleshooting plugins from this project in the future. However, while continuing work on #4 I believe I've hit a point where the output, while useful, may be a bit too much for anyone but myself to deal with.

I think the above is fine, but this block (one of many for a VM's snapshot tree) is an example of content that a sysadmin might not care to see (by default):

Processing snapshot: [ID: snapshot-229096, Name: Test Snapshot, HasParent: true]
Adding key 3 to vmParentSnapshotDiskFileKeys
Adding key 4 to vmParentSnapshotDiskFileKeys
Adding key 26 (vmsn, snapData) to vmSnapshotDiskFileKeys
snapLayout [Name: [HUSVM-Library-vol6] RHEL7-TEST/RHEL7-TEST-Snapshot11.vmsn, Size: 19564 (19.1KB), Key: 26]
Adding key 3 to vmSnapshotDiskFileKeys
Adding key 4 to vmSnapshotDiskFileKeys
Adding key 11 to vmSnapshotDiskFileKeys
Adding key 12 to vmSnapshotDiskFileKeys
Range vmParentSnapshotDiskFileKeys ...
Removing key 3 from vmSnapshotDiskFileKeys
Removing key 4 from vmSnapshotDiskFileKeys
Remaining keys in vmSnapshotDiskFileKeys: map[11:11 12:12 26:26]
Range vmDiskFileKeys ...
Removing key 5 from vmSnapshotDiskFileKeys
Removing key 6 from vmSnapshotDiskFileKeys
Removing key 27 from vmSnapshotDiskFileKeys
Removing key 28 from vmSnapshotDiskFileKeys
Removing key 3 from vmSnapshotDiskFileKeys
Removing key 4 from vmSnapshotDiskFileKeys
Remaining keys in vmSnapshotDiskFileKeys: map[11:11 12:12 26:26]
Tally size of vmSnapshotDiskFileKeys
Size [bytes: 1068140, HR: 1.0MB] calculated for Test Snapshot snapshot

This is output that would be hidden away by default and exposed only when requested.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.