daviswr / zenpacks.daviswr.zfs Goto Github PK
View Code? Open in Web Editor NEWZFS monitoring for Zenoss
License: MIT License
ZFS monitoring for Zenoss
License: MIT License
Investigate if viable/useful
OpenZFS 2.1 host with 3 zpools hangs on modeling eternally - had to disable the ZFS plugin to get the rest of it commit into the DB. With the plugin enabled, even after zpools are discovered, the modeler hangs indefinitely.
There is an error message in Zenoss:
stderr | interval cannot be zero usage: status [-c [script1,script2,...]] [-igLpPstvxD] [-T d\|u] [pool] ... [interval [count]]
-- | --
The pool which is failing to model is a raidz2 of 6 drives. There's another raidz2 in there with more disks, and both of them have faults showing. The one which works has one UNAVAIL and one FAULTED - both show up as events in Zenoss. The failing pool has a single FAULTED disk in it.
I'm seeing a host with 4 pools - single vdev rpool, raidz1 pool, another raidz1, and a 5-wide span of 2-disk mirrors with a SLOG mirror and a 2-diskL2ARC span, only return data for two of the pools (rpool and a raidz1). The other two pools show up as having a warning of Code: 2 - Msg: Misuse of shell builtins
and no pool state whatsoever. The "stateless" pools VDEVs are not accounted for, nor are their comprising storage devices.
The host systems are Arch Linux (so tip bash), its sudoless as root (isolated env), and the ZFS revision is 2.0.0.
Currently both modelers and all ZenCommand parsers implement parsing of tool output individually. Shared parsing functions for each output type would reduce complexity and perhaps make modeling a little more fault-tolerant.
Should be considered a prerequisite to #6.
@tcaputi has pretty much completed work on native crypto implementation for OpenZFS (openzfs/zfs#4329). This work adds some complexity to how information is stored and presented, as well as CLI interface. Given that the ZenPack works off zdb output, and that dataset-level attributes remain CT, i'm assuming that we should be able to see all relevant attributes whether we have a key loaded or not (aka, should still work while DS is encrypted). We would however want to output information regarding the crypto config (on/off, keysource, cipher, and pbkdfiters) to be logged by Zenoss.
@daviswr: Could i ask you to take a look toward implementation? Every time i start working on this ZenPack i get bogged down by the idiosyncratic differences between Python and my 3rd gen language of choice (Ruby) as relating to string parsing, indents, and set manipulation. I should have some cycles in Jan, but i'm massively behind on Metasploit work, so am throwing this up as an issue instead of a PR presuming you have the cycles to tackle it. Thanks as always.
The zpool.status parser should generate events based on messages in the status
and errors
fields for the pool.
Additionally, vdev & device events if other components in the output have error messages.
Collect error counters from 'zpool status -v' output
We had a failure go unnoticed this morning - pool shows up as ONLINE in Zenoss (with no IO in the graphs) but went SUSPENDED on the host hours ago.
Zenoss 6.3 with zenpack built off of 8a17ca6
Thanks as always
We use /dev/disk/by-id/ paths referencing the ata- or scsi-/sas- symlinks pointing to our devices in our zpool configurations. Just noticed a pool lost a drive, OS removed it altogether at fail-time, and the zenpack is having some issues with this.
The disk is still showing as online, but there is a warning message generated saying:
Component: raidz1-0
Event Class: /Cmd/Fail
Status: New
Message: Traceback (most recent call last):
File "/opt/zenoss/Products/ZenRRD/zencommand.py", line 819, in _processDatasourceResults
parser.processResults(datasource, results)
File "/opt/zenoss/packs/ZenPacks.daviswr.ZFS/ZenPacks/daviswr/ZFS/parsers/zpool/status.py", line 68, in processResults
health = pool_match.groups()[0]
AttributeError: 'NoneType' object has no attribute 'groups'
I'm assuming a problem in the zpool status output parser.
As a result, the disk itself is not marked as being offline in Zenoss, but the VDEV does show yellow (warning state) due to the parsing problem resulting in the reference to a 'NoneType' object.
There is no order in which modelers are executed, so it's possible for the ZFS modeler to run prior to the ZPool modeler the first time a system is modeled, this missing the datasets due to ZPool components not yet having been created.
Probably can't stub-out Pools in case ZFS runs after ZPool, which would replace the previously-made components (I think...)
Subsequent models are normally fine.
Will track down the cause but noting this here for record-keeping - catching this error against a 2.2.3 built, packaged, and installed on Ubuntu 22.04. May happen on others, will check Arch shortly:
2024-03-26 23:38:57,653 ERROR zen.ZenModeler: Traceback (most recent call last):
File "/opt/zenoss/Products/DataCollector/zenmodeler.py", line 669, in processClient
datamaps = plugin.process(device, results, self.log)
File "/opt/zenoss/ZenPacks/ZenPacks.daviswr.ZFS-0.8.0-py2.7.egg/ZenPacks/daviswr/ZFS/modeler/plugins/daviswr/cmd/ZFS.py", line 200, in process
comp[key] = int(datasets[ds][key])
ValueError: invalid literal for int() with base 10: 'none'
Thank you for this zenpack - its a lifesaver in our environment. At the latest version (0.7.0), cache drive vdev enumeration fails, i've had to comment it out (https://github.com/daviswr/ZenPacks.daviswr.ZFS/blob/master/ZenPacks/daviswr/ZFS/modeler/plugins/daviswr/cmd/ZPool.py#L153). I'll spin up a lab system to replicate the error, but along the lines of "NoneType has no member named 'dev'".
Separately, i've added local thresholds for pool capacity notification - may be useful to have them in the zenpack. A pool at 90% is something to be concerned about (especially with automated snapshots or heavy use). Also, would be very useful to have a configuration option to disable enumeration of snapshots. Some of our systems have thousands of snapshots across datasets, it gets painful pretty quick (we are only monitoring pools for now anyway, but DS usage and ZVOL IO would be nice).
The monkeypatch for TALES evaluation in CommandPlugin's command should be broken out into a separate ZenPack. This pack can list it as a requirement.
Might be better handled in the modeler/template commands, finding zpool
, zdb
, and zfs
, determining if escalation is necessary, and what the best route for escalation is, similar to the SMART pack.
Early versions based directly on Chet 's and Jane's examples are still causing problems.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.