sscargal / cxlchk-bash Goto Github PK
View Code? Open in Web Editor NEWSystem Checker for CXL Devices
License: MIT License
System Checker for CXL Devices
License: MIT License
cxlchk should check and report if Kernel TPP is enabled or disabled. There are two things to check. Firstly if TPP is enabled by looking at/proc/sys/kernel/numa_balancing
. A value of 2 means TPP is enabled. Secondly, check /sys/kernel/mm/numa/demotion_enabled
to validate if page demotion is enabled (1) or disabled (0).
From numa_balancing
numa_balancing
Enables/disables and configures automatic page fault based NUMA memory balancing. Memory is moved automatically to nodes that access it often. The value to set can be the result of ORing the following:
0 NUMA_BALANCING_DISABLED
1 NUMA_BALANCING_NORMAL
2 NUMA_BALANCING_MEMORY_TIERING
Or NUMA_BALANCING_NORMAL to optimize page placement among different NUMA nodes to reduce remote accessing. On NUMA machines, there is a performance penalty if remote memory is accessed by a CPU. When this feature is enabled the kernel samples what task thread is accessing memory by periodically unmapping pages and later trapping a page fault. At the time of the page fault, it is determined if the data being accessed should be migrated to a local memory node.
The unmapping of pages and trapping faults incur additional overhead that ideally is offset by improved memory locality but there is no universal guarantee. If the target workload is already bound to NUMA nodes then this feature should be disabled.
Or NUMA_BALANCING_MEMORY_TIERING to optimize page placement among different types of memory (represented as different NUMA nodes) to place the hot pages in the fast memory. This is implemented based on unmapping and page fault too.
numa_balancing_promote_rate_limit_MBps
Too high promotion/demotion throughput between different memory types may hurt application latency. This can be used to rate limit the promotion throughput. The per-node max promotion throughput in MB/s will be limited to be no more than the set value.
A rule of thumb is to set this to less than 1/10 of the PMEM node write bandwidth.
From demotion_enabled
/sys/kernel/mm/numa/demotion_enabled
Defined on file sysfs-kernel-mm-numa
Enable/disable demoting pages during reclaim
Page migration during reclaim is intended for systems with tiered memory configurations. These systems have multiple types of memory with varied performance characteristics instead of plain NUMA systems where the same kind of memory is found at varied distances. Allowing page migration during reclaim enables these systems to migrate pages from fast tiers to slow tiers when the fast tier is under pressure. This migration is performed before swap. It may move data to a NUMA node that does not fall into the cpuset of the allocating process which might be construed to violate the guarantees of cpusets. This should not be enabled on systems which need strict cpuset location guarantees.
Hardware errors are reported to dmesg and look similar to the following:
[1206211.934903] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[1206211.934911] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[1206211.934913] {2}[Hardware Error]: event severity: corrected
[1206211.934915] {2}[Hardware Error]: Error 0, type: corrected
[1206211.934917] {2}[Hardware Error]: section_type: PCIe error
[1206211.934918] {2}[Hardware Error]: port_type: 10, root complex event collector
[1206211.934920] {2}[Hardware Error]: version: 3.0
[1206211.934921] {2}[Hardware Error]: command: 0x0504, status: 0x0010
[1206211.934923] {2}[Hardware Error]: device_id: 0000:00:00.4
[1206211.934925] {2}[Hardware Error]: slot: 0
[1206211.934927] {2}[Hardware Error]: secondary_bus: 0x00
[1206211.934928] {2}[Hardware Error]: vendor_id: 0x8086, device_id: 0x0b23
[1206211.934929] {2}[Hardware Error]: class_code: 080700
[1206211.934946] pcieport 0000:00:00.4: AER: aer_status: 0x00004000, aer_mask: 0x00000000
[1206211.934993] pcieport 0000:00:00.4: [14] CorrIntErr
[1206211.934997] pcieport 0000:00:00.4: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID
cxlchk should report the section_type
and Error 0, type
with a counter to show how many events of each type have been reported.
A CXL device should present itself in the lspci
, eg:
# lspci | grep -i CXL
ab:00.0 CXL: Device 1f2d:1031 (rev 01)
If no 'CXL' device is found, return an error
If a dataset exists, the cxlchk
script should not require root privileges to read the files within the dataset.
./cxlchk -A ./cxlchk.hostname.0829-1451/
Please run this script with root privilege or use -h to display help information.
Check if SELinux is enabled and in Enforcing mode. This isn't CXL specific, but can cause 3rd party apps from starting, such as MemVerge Memory Viewer. See https://www.golinuxcloud.com/disable-selinux/ for info on how to check.
With recent Kernel versions, it is possible for a user to explicitly disable the Special Purpose Memory UEFI feature with the efi=nosoftreserve
. For example:
$ cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.3.4-350.vanilla.fc38.x86_64 root=UUID=ebbbb014-e320-4361-b90c-e70f519a0fee ro rhgb quiet efi=nosoftreserve nopat
A test should be made to detect and report this as an INFO message.
If CXL.mem is configured as a memory node, whether it be Special Purpose Memory to not, there should be a cpu-less/memory-only NUMA node.
Example:
$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 0 size: 128383 MB
node 0 free: 126151 MB
node 1 cpus:
node 1 size: 8053 MB
node 1 free: 7929 MB
node distances:
node 0 1
0: 10 14 <<< DRAM
1: 14 10. <<< CXL
The Kernel config should check to see if the following are 'm' or 'y'
CONFIG_X86_PMEM_LEGACY=m
CONFIG_ZONE_DEVICE=y
CONFIG_LIBNVDIMM=m
CONFIG_BLK_DEV_PMEM=m
CONFIG_BTT=y
CONFIG_NVDIMM_PFN=y
CONFIG_NVDIMM_DAX=y
CONFIG_DEV_DAX_PMEM=m
CONFIG_ENCRYPTED_KEYS=y
This is a minimum list. If these are not present or incorrect, return an error.
# dmesg | grep e820
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000003dfff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000003e000-0x000000000003efff] reserved
[ 0.000000] BIOS-e820: [mem 0x000000000003f000-0x000000000009ffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000006d3c6fff] usable
[ 0.000000] BIOS-e820: [mem 0x000000006d3c7000-0x000000006f4c6fff] reserved
[ 0.000000] BIOS-e820: [mem 0x000000006f4c7000-0x000000006fdc6fff] ACPI data
[ 0.000000] BIOS-e820: [mem 0x000000006fdc7000-0x00000000734f1fff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x00000000734f2000-0x00000000777fefff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000777ff000-0x00000000777fffff] usable
[ 0.000000] BIOS-e820: [mem 0x0000000077800000-0x000000008fffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fe010000-0x00000000fe010fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000807fffffff] usable
[ 0.000000] BIOS-e820: [mem 0x0000008080000000-0x000000907fffffff] soft reserved <<<
Intel MLC tests (benchmarks/IntelMLC/mlc.sh) may fail with:
alloc_mem_onnode(): unable to mbind: : Invalid argument
Buffer allocation failed!
This is caused when the memory blocks for the selected DRAM or CXL node are OFFLINE. In the following example, NUMA Node 2 is CXL memory:
# numactl -H
available: 5 nodes (0-4)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155
node 0 size: 64206 MB
node 0 free: 54657 MB
node 1 cpus: 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
node 1 size: 128947 MB
node 1 free: 122649 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus:
node 3 size: 64496 MB
node 3 free: 64245 MB
node 4 cpus:
node 4 size: 0 MB
node 4 free: 0 MB
node distances:
node 0 1 2 3 4
0: 10 21 14 14 24
1: 21 10 24 24 14
2: 14 24 10 16 26
3: 14 24 16 10 26
4: 24 14 26 26 10
# lsmem -o+ZONES,NODE
RANGE SIZE STATE REMOVABLE BLOCK ZONES NODE
0x0000000000000000-0x000000007fffffff 2G online yes 0 None 0
0x0000000100000000-0x000000107fffffff 62G online yes 2-32 Normal 0
0x0000001080000000-0x000000307fffffff 128G online yes 33-96 Normal 1
0x0000003080000000-0x000000407fffffff 64G online yes 97-128 Normal 3
0x0000004080000000-0x000000607fffffff 128G offline 129-192 Normal/Movable 2. <<<
0x0000006080000000-0x000000707fffffff 64G offline 193-224 Normal/Movable 4
Memory block size: 2G
Total online memory: 256G
Total offline memory: 192G
Bring the memory ONLINE using:
# cd /sys/bus/node/devices/node2
# for m in `find . -name "memory*[0-9]"`
do
sudo echo online > $m/state
done
# lsmem -o+ZONES,NODE
RANGE SIZE STATE REMOVABLE BLOCK ZONES NODE
0x0000000000000000-0x000000007fffffff 2G online yes 0 None 0
0x0000000100000000-0x000000107fffffff 62G online yes 2-32 Normal 0
0x0000001080000000-0x000000307fffffff 128G online yes 33-96 Normal 1
0x0000003080000000-0x000000407fffffff 64G online yes 97-128 Normal 3
0x0000004080000000-0x000000607fffffff 128G online yes 129-192 Normal 2 <<<
0x0000006080000000-0x000000707fffffff 64G offline 193-224 Normal/Movable 4
Memory block size: 2G
Total online memory: 384G
Total offline memory: 64G
Verify the correct Kernel modules/drivers are loaded
From dmidecode -t memory
, verify 'Speed' == 'Configured Memory Speed'
Handle 0x0021, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x001F
Error Information Handle: No Error
Total Width: 80 bits
Data Width: 64 bits
Size: 64 GB
Form Factor: DIMM
Set: None
Locator: P1-DIMMA1
Bank Locator: P0_Node0_Channel0_Dimm0
Type: DDR5
Type Detail: Synchronous Registered (Buffered)
Speed: 4800 MT/s
Manufacturer: Samsung
Serial Number: 80CE04230143FAE831
Asset Tag: P1-DIMMA1_AssetTag (date:23/01)
Part Number: M321R8GA0BB0-CQKMS
Rank: 2
Configured Memory Speed: 4800 MT/s
Minimum Voltage: 1.1 V
Maximum Voltage: 1.1 V
Configured Voltage: 1.1 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 64 GB
Cache Size: None
Logical Size: None
This test should check the output of lsmod
to ensure the correct CXL related drivers are loaded
$ cat /sys/devices/system/memory/auto_online_blocks
Auto-onlining can be enabled by writing online, online_kernel or online_movable to that file, like:
$ echo online > /sys/devices/system/memory/auto_online_blocks
cxlchk
already has a command option that supports providing the path to the cxl
utility. cxlchk
should also provide a path to daxctl
.
-d <Path to the DAXCTL executable>
Specify the path to the DAXCTL executable
# ./cxl list -M
Warning: no matching devices found
[
]
This may happen if the devdax is disabled
. We can check with cxl list -Mi
cxlchk should report a count of how many oom-kill
events were found in dmesg. An example of part of the message looks like this:
[539824.438309] oom-kill:constraint=CONSTRAINT_MEMORY_POLICY,nodemask=1,cpuset=user.slice,mems_allowed=0-3,global_oom,task_memcg=/user.slice/user-0.slice/session-241.scope,task=numa_alloc,pid=2638010,uid=0
[539824.438321] Out of memory: Killed process 2638010 (numa_alloc) total-vm:15731464kB, anon-rss:14473472kB, file-rss:1664kB, shmem-rss:0kB, UID:0 pgtables:28788kB oom_score_adj:0
[539891.267033] numa_alloc invoked oom-killer: gfp_mask=0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=0
[539891.267041] CPU: 145 PID: 2641883 Comm: numa_alloc Tainted: G S 6.2.16-060216-generic #202305171336
Example:
# dmesg | grep "oom-kill\:" | wc -l
2
We would expect to see CXL memory as 'Movable' in the following output
# lsmem -o+ZONES,NODE
RANGE SIZE STATE REMOVABLE BLOCK ZONES NODE
0x0000000000000000-0x000000007fffffff 2G online yes 0 None 0
0x0000000100000000-0x000000407fffffff 254G online yes 2-128 Normal 0
0x0000004080000000-0x000000807fffffff 256G online yes 129-256 Normal 1
0x0000008080000000-0x000000887fffffff 32G online yes 257-272 Movable 2
Memory block size: 2G
Total online memory: 544G
Total offline memory: 0B
Warn if not. It's not fatal because if the CXL memory is not presented as Special Purpose Memory (SPM), then it will appear as a memory NUMA node without a devdax.
Look for disabled dax devices
# daxctl list
[
{
"chardev":"dax1.0",
"size":34359738368,
"target_node":3,
"align":2097152,
"mode":"devdax"
},
{
"chardev":"dax2.0",
"size":68719476736,
"target_node":2,
"align":2097152,
"mode":"devdax",
"state":"disabled"
},
{
"chardev":"dax0.0",
"size":34359738368,
"target_node":2,
"align":2097152,
"mode":"system-ram",
"online_memblocks":16,
"total_memblocks":16,
"movable":true
}
]
If a cxl driver failes to load, an error message will be displayed in dmesg
, for example:
[ 464.285235] cxl_pci: Unknown symbol cxl_enumerate_cmds (err -2)
[ 464.285265] cxl_pci: Unknown symbol devm_cxl_add_nvdimm (err -2)
[ 464.285284] cxl_pci: Unknown symbol cxl_find_regblock (err -2)
[ 464.285298] cxl_pci: Unknown symbol cxl_map_component_regs (err -2)
[ 464.285311] cxl_pci: Unknown symbol devm_cxl_add_memdev (err -2)
[ 464.285321] cxl_pci: Unknown symbol cxl_probe_device_regs (err -2)
[ 464.285335] cxl_pci: Unknown symbol cxl_map_device_regs (err -2)
[ 464.285348] cxl_pci: Unknown symbol cxl_dev_state_identify (err -2)
[ 464.285367] cxl_pci: Unknown symbol cxl_mem_create_range_info (err -2)
[ 464.285379] cxl_pci: Unknown symbol cxl_dev_state_create (err -2)
This test should check for these failures and report a FAILED status.
For each CXL device, check the Health field(s). Many are boolean (True | False)
$ sudo cxl list --health -i
[
{
"memdev":"mem0",
"ram_size":137438953472,
"health":{
"maintenance_needed":true,
"performance_degraded":false,
"hw_replacement_needed":false,
"media_normal":false,
"media_not_ready":false,
"media_persistence_lost":true,
"media_data_lost":false,
"media_powerloss_persistence_loss":false,
"media_shutdown_persistence_loss":false,
"media_persistence_loss_imminent":false,
"media_powerloss_data_loss":false,
"media_shutdown_data_loss":false,
"media_data_loss_imminent":false,
"ext_life_used":"unknown",
"ext_temperature":"normal",
"ext_corrected_volatile":"normal",
"ext_corrected_persistent":"normal",
"life_used_percent":4,
"temperature":0,
"dirty_shutdowns":0,
"volatile_errors":0,
"pmem_errors":0
},
"serial":9947034466371306773,
"numa_node":0,
"host":"0000:38:00.0",
"state":"disabled"
}
]
If the host is configured to map CXL.mem as Special Purpose Memory, there should be entries in /dev/cxl/mem*
and /dev/dax*
Using dmidecode -t memory
, check that all the DDR module Size, Manufacturer, and Part Numbers are the same for all DIMMs installed in the system. PASS if True, FAIL if False. It may be okay to have a substitute part number installed, but flag it anyway.
Handle 0x0021, DMI type 17, 92 bytes
Memory Device
Array Handle: 0x001F
Error Information Handle: No Error
Total Width: 80 bits
Data Width: 64 bits
Size: 64 GB
Form Factor: DIMM
Set: None
Locator: P1-DIMMA1
Bank Locator: P0_Node0_Channel0_Dimm0
Type: DDR5
Type Detail: Synchronous Registered (Buffered)
Speed: 4800 MT/s
Manufacturer: Samsung
Serial Number: 80CE04230143FAE831
Asset Tag: P1-DIMMA1_AssetTag (date:23/01)
Part Number: M321R8GA0BB0-CQKMS
Rank: 2
Configured Memory Speed: 4800 MT/s
Minimum Voltage: 1.1 V
Maximum Voltage: 1.1 V
Configured Voltage: 1.1 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 64 GB
Cache Size: None
Logical Size: None
The cxlchk
partially implements the -m
command option that should allow a user to provide a comma-separated list of modules and tests to execute.
# echo " -m <module1, module2, ..., moduleN>"
# echo " Specify which Analyzer modules to include or exclude"
# echo " "
A user can implicitly exclude a module/test by prefixing the name with a minus sign (-)
. Tests or modules in the list are implicitly included.
Write a rule that looks for errors similar to these in dmesg
# dmesg | grep -i dax
[ 20.551421] device_dax dax0.0: mapping0: 0x880000000-0x187fffffff could not reserve range
[ 20.553485] device_dax: probe of dax0.0 failed with error -16
[ 138.941186] kmem dax0.0: mapping0: 0x880000000-0x187fffffff could not reserve region
[ 138.941206] kmem: probe of dax0.0 failed with error -16
[ 138.941253] kmem dax0.0: mapping0: 0x880000000-0x187fffffff could not reserve region
[ 138.941259] kmem: probe of dax0.0 failed with error -16
[ 158.604997] device_dax dax0.0: dynamic-dax with pre-populated page map
[ 158.605010] device_dax: probe of dax0.0 failed with error -22
The dax device may not be visible, or could be in a disabled state:
# daxctl list
[
{
"chardev":"dax0.0",
"size":68719476736,
"target_node":2,
"align":2097152,
"mode":"devdax",
"state":"disabled"
}
]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.