sscargal / pmemchk Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: ThermalThrottleLossPercent
Description: This rule should check if the ReservedCapacity is not equal to "N/A" or 0 and return a warning to the user
Expect:
ThermalThrottleLossPercent=N/A
- or -
ThermalThrottleLossPercent=0
man page description
ThermalThrottleLossPercent
The average performance loss percentage due to thermal throttling in the current boot of the PMem module.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: ConfigurationStatus
Description: This rule should check if the ConfigurationStatus != "Valid" and report an INFO message to the user
Rule Passes:
ConfigurationStatus=Valid
Possible Values
ConfigurationStatus
The status of the PMem module memory configuration. One of:
• Valid: The configuration is valid.
• Not Configured: The PMem module has not been configured.
• Failed - Bad configuration: The configuration is corrupt.
• Failed - Broken interleave: This PMem module is part of an interleave set that is not complete.
• Failed - Reverted: The configuration failed and was reverted to the last known good configuration.
• Failed - Unsupported: The configuration is not compatible with the installed BIOS.
• Unknown: The configuration cannot be determined.
Similar to the SOS Report, customers running the SUSE LInux distro will collect host configuration data with the SupportConfig utility. pmemchk should support analyzing the data collected by SupportConfig as it's very similar to the data collected with this collector and the SOS Report.
pmemchk should support collecting data only and not automatically running the analyzer. Propose using -C
option.
We need to validate pull requests using https://www.shellcheck.net/ and GitHub Actions
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: MasterPassphraseEnabled
Description: This rule should check if the MasterPassphraseEnabled has been enabled (1) and return an INFO message to the user. Enabling the MasterPassphraseEnabled is not a failure, but it's an uncommon situation that the user should be made aware of in case it's relevant to a current or future issue.
Rule Passes:
MasterPassphraseEnabled=0
Possible Values
MasterPassphraseEnabled
This property indicates if master passphrase is enabled. If it is disabled, then it cannot be enabled. One of:
• 0: Disabled - Cannot be enabled.
• 1: Enabled - Master passphrase can be changed. Cannot be disabled.
Example:
# ./pmemchk -A ./pmemchk.hostname.0113-1210
=======================================================================
Starting PMem Checker
pmemchk Version 0.1.0
Started: Wed Jan 19 03:26:37 PM MST 2022
=======================================================================
Using NDCTL command: /usr/local/bin/ndctl
NDCTL version: 72
Using IPMCTL command: /usr/local/bin/ipmctl
IPMCTL version: 02.00.00.3871
Using CXL command: /usr/local/bin/cxl
CXL version: 72
Operating System: Fedora Linux 35 (Server Edition)
Kernel Version : 5.15.10-200.fc35.x86_64
CPU(s): 96
Model name: Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz
Socket(s): 2
NUMA node(s): 2
NUMA node0 CPU(s): 0-23,48-71
NUMA node1 CPU(s): 24-47,72-95
=======================================================================
Starting analysis of the data
=======================================================================
[ INFO ] optane_check_region_freecapacity : Could not process './pmemchk.hostname.0113-1210/ipmctl_show_-region'. File not found
[ INFO ] optane_check_dimm_lockstate : Could not process './pmemchk.hostname.0113-1210/ipmctl_show_-dimm'. File not found
[ INFO ] optane_check_dimm_firmware_version : Could not process './pmemchk.hostname.0113-1210/ipmctl_show_-dimm'. File not found
[ INFO ] optane_check_dimm_health_status : Could not process './pmemchk.hostname.0113-1210/ipmctl_show_-dimm'. File not found
[ INFO ] optane_check_dimm_population : Could not process './pmemchk.hostname.0113-1210/ipmctl_show_-dimm'. File not found
[ INFO ] optane_check_region_capacity : Could not process './pmemchk.hostname.0113-1210/ipmctl_show_-region'. File not found
[ INFO ] optane_check_region_health : Could not process './pmemchk.hostname.0113-1210/ipmctl_show_-region'. File not found
[ INFO ] optane_check_dimm_capacity : Could not process './pmemchk.hostname.0113-1210/ipmctl_show_-dimm'. File not found
=======================================================================
Data analysis completed
=======================================================================
=======================================================================
Analysis Report Summary
=======================================================================
[ PASSED ] = 0
[ FAILED ] = 0
[ INFO ] = 8
[ WARNING ] = 0
=======================================================================
PMem Checker Complete
Ended: Wed Jan 19 03:26:38 PM MST 2022
Duration: 1 seconds
Results: ./pmemchk.hostname.0113-1210
=======================================================================
Add a new (-m) option to allow the user to specify which Analyzer module(s) to execute or exclude. Regular expressions should be supported.
In combination with the list (-l), show a list of filtered modules rather than execute the pmemchk tool.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: SKUViolation
Description: This rule should check if SKUViolation != 0 and report an INFO message to the user
Rule Passes:
SKUViolation=0
Possible values:
SKUViolation
The configuration of the PMem module is unsupported due to a license issue. One of:
• 0: False
• 1: True
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: SoftwareTriggersEnabled & SoftwareTriggersEnabledDetails
Applies to: Optane 100 or later
Description: This rule should first check if SoftwareTriggersEnabled is enabled, then check SoftwareTriggersEnabledDetails
Expect:
SoftwareTriggersEnabled=0
SoftwareTriggersEnabledDetails=None
Possible Values
SoftwareTriggersEnabled
Software trigger status.
• 0: Disabled - This is the default.
• 1: At least one software trigger enabled.
SoftwareTriggersEnabledDetails
Comma separated list of software triggers currently enabled. One or more of:
• None
• Package Sparing
• Fatal Error
• Percentage Remaining
• Dirty Shutdown
The usage information shows "-c" is a valid option, but we're missing the implementation logic in process_args()
to support it.
The non-all (-a) outputs from ipmctl
are displayed in a pipe separated value table format, eg:
$ head -n5 ipmctl_show_-firmware
DimmID | ActiveFWVersion | StagedFWVersion
============================================
0x0001 | 01.02.00.5444 | N/A
0x0011 | 01.02.00.5444 | N/A
0x0021 | 01.02.00.5444 | N/A
When adding the all (-a) option, the output format is different, eg:
$ head ipmctl_show_-a_-firmware
---DimmID=0x0001---
ActiveFWVersion=01.02.00.5444
StagedFWVersion=N/A
StagedFWActivatable=Not activatable, reboot is required
FWUpdateStatus=Update loaded successfully
FWImageMaxSize=266240
QuiesceRequired=Not required
ActivationTime=0
To make the rules easier to implement and consistent, the output from the 'all' command should be reformatted into the pipe-separated value table format.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: LatchedLastShutdownStatus
Applies to: Optane 100 or later
Description: This rule should check the status bits and report when unexpected conditions were reported by the PMem module
Expect:
LatchedLastShutdownStatus=PM ADR Command Received, DDRT Power Fail Command Received, PMIC 12V/DDRT 1.2V Power Loss (PLI), Controller's FW State Flush Complete, Write Data Flush Complete, Extended Flush Not Complete
Possible Values
LatchedLastShutdownStatus
The status of the last shutdown of the PMem module. One or more of:
• Unknown: The last shutdown status cannot be determined.
• PM ADR Command Received: Power management ADR command received.
• PM S3 Received: Power management S3 command received.
• PM S5 Received: Power management S5 command received.
• DDRT Power Fail Command Received: DDR power fail command received.
• PMIC 12V/DDRT 1.2V Power Loss (PLI)
• PM Warm Reset Received: Power management warm reset received.
• Thermal Shutdown Received: Thermal shutdown triggered.
• Controller’s FW State Flush Complete: Flush Completed.
• Viral Interrupt Received: Viral interrupt received.
• Surprise Clock Stop Received: Surprise clock stop received.
• Write Data Flush Complete: Write data flush completed.
• PM S4 Received: Power management S4 command received.
• PM Idle Received: Power management idle received.
• DDRT Surprise Reset Received: Surprise reset received.
• Extended Flush Not Complete.
• Extended Flush Complete.
TODO: Mark which events are Passes, and which are Fails
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: BootStatus
Applies to: Optane 100 or later
Description: This rule should check the value of BootStatus and BootStatusRegister
Expected:
BootStatus=Success
For all other non-Success values, this rule should report an error along with the BootStatusRegister value. Ideally, the BootStatusRegister should be decoded to explain WHY the PMem failed to boot.
Possible Values
BootStatus
The initialization status of the PMem module as reported by the firmware in the boot status register. One or more
of:
• Unknown - The boot status register cannot be read.
• Success - No errors were reported during initialization.
The following statuses indicate that the media is not functional and, therefore, access to user data and
operations that require use of the media will fail.
• Media Not Ready - The firmware did not complete media training.
• Media Error - The firmware detected an error during media training.
• Media Disabled - The firmware disabled the media due to a critical issue.
The following statuses indicate that communication with the firmware is not functional.
• FW Assert - The firmware reported an assert during initialization.
BootStatusRegister
The raw hex value of the PMem module Boot Status Register of the PMem module
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: ExtendedAdrEnabled
Applies to: Optane 200 or later
Description: This rule should report if eADR is Enabled or Disabled in the BIOS. Both are acceptable, the rule just needs to report it as an INFO message when it's Enabled as this is an uncommon scenario.
Possible values
ExtendedAdrEnabled
Specifies whether extended ADR flow is enabled in the FW.
• 0: Disabled
• 1: Enabled
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: ManageabilityState
Applies to: Optane 100 or later
Description: This rule should fail when ManageabilityState = "Unmanageable". It could be caused by a bad Optane module, or that the system has non-Intel Optane PMem installed.
Possible Values
ManageabilityState
Ability of the PMem module host software to manage the PMem module. Manageability is determined by the interface
format code, the vendor identifier, device identifier and the firmware API version. One of:
• Manageable: The PMem module is manageable by the software.
• Unmanageable: The PMem module is not supported by this version of the software.
An SOS Report is a common method for collecting data from Linux hosts. A Pmem module exists to collect ipmctl and ndctl data. pmemchk should support analyzing the data collected from an SOS report as it's very similar to the data the pmemchk collector gathers.
As the number of tests/rules grows, the output may become quite large. We need a new argument/option that allows the user to suppress tests/rules that pass, and only display those that fail, or report info messages.
Add a new '-q' option that makes the output quiet. Optionally, the user could supply the minimum level of message type and the tool will display that type or higher. ie:
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: ReservedCapacity
Applies to: Optane 100 or later
Description: This rule should fail when the error count is > 0.
Expect:
PoisonErrorInjectionsCounter=0
PoisonErrorClearCounter=0
If the number of injected errors == the number of cleared errors, then it's okay, but should be reported. Otherwise, somebody forgot to clear the poison.
Test outcome:
Man page entry
PoisonErrorInjectionsCounter
This counter is incremented each time the set poison error is successfully executed.
PoisonErrorClearCounter
This counter is incremented each time the clear poison error is successfully executed.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: ViralState
Applies to: Optane 200 or later
Description: This rule should check the value of ViralState
Possible Values
ViralState
Whether the PMem module is currently viral. One of:
• 0: Not Viral
• 1: Viral - The viral policies of the PMem module have switched the persistent memory to read-only mode due to
the host operating system software detecting an uncorrectable error situation and indicating a viral state.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: ReservedCapacity
Description: This rule should check if the ReservedCapacity is > 0GiB and report an INFO message to the user
Expected:
ReservedCapacity=0.000 GiB
From the ipmctl-show-device man page:
ReservedCapacity
PMem module capacity reserved for proper alignment.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: PpcExtendedAdrEnabled
Applies to: Optane 200 or later
Description: This rule should report the value of PpcExtendedAdrEnabled
Possible Values
PpcExtendedAdrEnabled
Specifies whether extended ADR flow was enabled in the FW during the last power cycle.
• 0: Disabled
• 1: Enabled
Currently, the collector displays a single dot/period (.) for each command executed. This gives the user feedback that the tool is doing something, but doesn't provide accurate progress. Implementing a progress bar with a % complete value gives much better user feedback.
Data collected by the collector should be tar'd and [g]zipped before transit.
Implement '-Z' to enable/disable the zip feature
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: PackageSparingCapable
Description: This rule should check if the PackageSparingCapable is not TRUE (1) and report it to the user
Expected (Pass):
PackageSparingCapable=1
Not Expect (Fail)
PackageSparingCapable=0
Possible Values
PackageSparingCapable
Whether or not the PMem module supports package sparing. One of:
• 0: False
• 1: True
We need a way to validate the code works and doesn't break. Unit Tests need to be created to validate the code.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: MixedSKU
Applies to: Optane 100 or later
Description: This rule should fail when MixedSKU =1
Expect:
MixedSKU=0
Possible Values
MixedSKU
One or more PMem modules in the system have different SKUs. One of:
• 0: False
• 1: True - In this case, the host software operates in a read-only mode and does not allow changes to the PMem
modules and their associated capacity.
Many of the rules use a while loop assuming grep finds data. When no data is found in the input file(s), the test should be skipped as it's likely due to an older/newer Optane product that doesn't support the feature.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: MediaTemperatureInjectionsCounter
Applies to: Optane 100 or later
Description: This test should report non-zero values for MediaTemperatureInjectionsCounter
Man page
MediaTemperatureInjectionsCounter
This counter is incremented each time the media temperature is injected.
The Collector and Analyzer needs to be modularized.
Each of the current collector functions for ipmctl, ndctl, cxl, and log files should be modularized and capable of running in isolation when the pmemchk program detects the tool/command/utility isn't available - see pmemchk::verify_cmds(). For example: if ipmctl isn't installed, we should skip collecting the data using ipmctl, but continue with the other functions/modules. Currently, pmemchk exits and requests the user install the missing tool.
Similarly, the Analyzer should only run tests for data that is collected. Currently, it executes all tests and prints an INFO message to the user when the file(s) the test requires are not available. These INFO messages could be silenced and only shown if the '-v' or higher is provided.
One idea is to store which Collector modules ran, then the Analyzer can read this information and only execute the appropriate modules.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: ARSStatus
Applies to: Optane 100 or later
Description: This rule should check if the ARSStatus is Complete and report INFO or WARN messages depending on the status
Expected
ARSStatus=Completed
Possible Values
ARSStatus
The address range scrub (ARS) operation status for the PMem module. The status is a reflection of the last
requested ARS, but not necessarily within the current platform power cycle. One of:
• Unknown - The ARS operation status cannot be determined.
• Not started - An ARS operation has not started.
• In progress - An ARS operation is currently in progress.
• Completed - The last ARS operation has completed.
• Aborted - The last ARS operation was aborted.
In a datacenter or cloud environment where there can be 10s-1000s of nodes, a single command should run pmemchk against a list of hostnames or IP addresses.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-firmware
Property: FWUpdateStatus, StagedFWVersion, & StagedFWActivatable
Applies to: Optane 100 or later
Description: This rule should check if there is a staged firmware. If so, provide more information.
Expected
StagedFWVersion=N/A
StagedFWActivatable=Not activatable, reboot is required
Man Page
FWUpdateStatus
The status of the last firmware update operation. One of:
• Unknown
• Staged successfully
• Update loaded successfully
• Update failed to load, fell back to previous firmware
StagedFWVersion
(Default) The BCD-formatted revision of the firmware staged for execution on the next power cycle in the format
PN.RN.SV.bbbb where:
• PN = 2-digit product number
• RN = 2-digit revision number
• SV = 2-digit security version number
• bbbb = 4-digit build version
StagedFWActivatable
The state of whether the staged firmware is activatable or not, where:
• 0 = Not activatable, reboot is required
• 1 = Activatable
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: AitDramEnabled
Applies to: Optane 100 or later
Description: Report the value of AitDramEnabled
Rule Passes:
AitDramEnabled=1
Possible Values
AitDramEnabled
If the PMem module AIT DRAM is enabled. One of:
• 0: Disabled - The device will suffer performance degradation if the AIT DRAM becomes disabled.
• 1: Enabled
pmemchk should allow a user to configure an email address, or list of email addresses, to send the output to once complete. This will be particularly useful when pmemchk is run from a cron job.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: ErrorInjectionEnabled
Applies to: Optane 100 or later
Description: This rule should report when ErrorInjectionEnabled = 1. This is not recommended in production environments, but is okay in a test/dev environment. The test should FAIL when ErrorInjectionEnabled = 1.
Possible Values
ErrorInjectionEnabled
Error injection status.
• 0: Disabled - This is the default.
• 1: Enabled
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: OverwriteStatus
Applies to: Optane 100 or later
Description: This rule should report the OverwriteStatus to the user
Possible Values
OverwriteStatus
The overwrite PMem module operation status for the PMem module. One of:
• Unknown - The overwrite PMem module operation status cannot be determined. This may occur if the status gets
overwritten due to a different long operation running on this PMem module.
• Not started - An overwrite PMem module operation was not started on the last boot.
• In progress - An overwrite PMem module operation is currently in progress.
• Completed - An overwrite PMem module operation completed and a reboot is required to use the PMem module.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: PackageSparesAvailable
Description: This rule should check if the number of PackageSparesAvailable is less than 1 and report a Critical error to the user
Expected
PackageSparesAvailable=1
Failure Condition
PackageSparesAvailable=0
Man page description
PackageSparesAvailable
The number of spare devices available for package sparing.
Add a new list command that lists available Analyzer modules and rules (-l). The output should group the modules and rules.
An example output could look like this:
Modules
----------
module1
module2
...
Rules
------
rule1
rule2
...
A filter could be supplied to list only modules OR rules, eg -l modules
or -l rules
or -l -m
and -l -r
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: PackageSparingEnabled
Description: This rule should check if the PackageSparingEnabled is True (1) and report if not
Expected:
PackageSparingEnabled=1
Not Expected:
PackageSparingEnabled=1
Possible Values
PackageSparingEnabled
Whether or not the PMem module package sparing policy is enabled. One of:
• 0: Disabled
• 1: Enabled
The -F
option caused ndctl to run slow as it calls into ipmctl for that data. ipmctl already collects the data, so we don't need it. In the future, when systems have CXL only, the -F may be required.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: SKUViolation
Description: This rule should check if SKUViolation != 0 and report an INFO message to the user
Rule Passes:
SKUViolation=0
Possible values:
SKUViolation
The configuration of the PMem module is unsupported due to a license issue. One of:
• 0: False
• 1: True
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm.psv
Property: PartNumber
Applies to: Optane 100 or later
Description: This rule should validate that all the PartNumber's for the PMem modules are the same.
Rule Passes: If all PartNumber's are the same
Rule Fails: When multiple part numbers are detected
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: MemoryBandwidthBoostFeature
Applies to: Optane 200 or later
Description: This rule should report whether the feature is enabled or disabled. Both are expected (pass). The intent is to provide this info to the user in case of performance related issues.
Possible Values
MemoryBandwidthBoostFeature
Returns if the Memory Bandwidth Boost Feature is currently enabled or not. One of:
• 0x0: Disabled
• 0x1: Enabled
Analyzer Module: Optane
Input File: ipmctl_show_-a_-goal
Applies to: Optane 100 or later
Description: This rule should check if an uncommitted goal is present
Rule Passes:
There are no goal configs defined in the system.
Please use 'show -region' to display currently valid persistent memory regions.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: ViralPolicy
Applies to: Optane 200 or later
Description: This rule should check if the ViralPolicy is Enabled/Disabled
Possible Values
ViralPolicy
Whether viral policies are enabled on the PMem module. One of:
• 0: Disabled - This is the default.
• 1: Enabled - The persistent memory on the PMem module will be put into read-only mode if the host operating
system software detects an uncorrectable error situation and indicates a viral state in order to prevent the
spread of damage.
The Exec ACL is not required except on the pmemchk script
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: SoftwareTriggersCounter
Applies to: Optane 100 or later
Description: This rule should fail when SoftwareTriggersCounter is non-zero
Expect:
SoftwareTriggersCounter=0
Man page:
SoftwareTriggersCounter
This counter is incremented each time a software trigger is enabled.
Add a new (-r) option to allow the user to specify which Analyzer rule(s) to execute or exclude. Regular expressions should be supported.
In combination with the list (-l), show a list of filtered rules rather than execute the pmemchk tool.
Analyzer Module: Optane
Input File: ipmctl_show_-a_-dimm
Parameter: MediaTemperatureInjectionEnabled
Applies to: Optane 100 or later
Description: This rule should fail when MediaTemperatureInjectionEnabled is enabled (1) as this is not recommend in production environments, but may be used in test/dev environments. Fail anyway as pmemchk doesn't know when the host is used for.
Possible Values
MediaTemperatureInjectionEnabled
Media temperature injection status.
• 0: Disabled - This is the default.
• 1: Enabled
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.