Giter VIP home page Giter VIP logo

cxlchk-bash's People

Contributors

asemmaa avatar sscargal avatar

Watchers

 avatar

cxlchk-bash's Issues

[Linux] Check if Linux TPP is Enabled

cxlchk should check and report if Kernel TPP is enabled or disabled. There are two things to check. Firstly if TPP is enabled by looking at/proc/sys/kernel/numa_balancing. A value of 2 means TPP is enabled. Secondly, check /sys/kernel/mm/numa/demotion_enabled to validate if page demotion is enabled (1) or disabled (0).

From numa_balancing

numa_balancing

Enables/disables and configures automatic page fault based NUMA memory balancing. Memory is moved automatically to nodes that access it often. The value to set can be the result of ORing the following:

0	NUMA_BALANCING_DISABLED
1	NUMA_BALANCING_NORMAL
2	NUMA_BALANCING_MEMORY_TIERING

Or NUMA_BALANCING_NORMAL to optimize page placement among different NUMA nodes to reduce remote accessing. On NUMA machines, there is a performance penalty if remote memory is accessed by a CPU. When this feature is enabled the kernel samples what task thread is accessing memory by periodically unmapping pages and later trapping a page fault. At the time of the page fault, it is determined if the data being accessed should be migrated to a local memory node.
The unmapping of pages and trapping faults incur additional overhead that ideally is offset by improved memory locality but there is no universal guarantee. If the target workload is already bound to NUMA nodes then this feature should be disabled.

Or NUMA_BALANCING_MEMORY_TIERING to optimize page placement among different types of memory (represented as different NUMA nodes) to place the hot pages in the fast memory. This is implemented based on unmapping and page fault too.

numa_balancing_promote_rate_limit_MBps

Too high promotion/demotion throughput between different memory types may hurt application latency. This can be used to rate limit the promotion throughput. The per-node max promotion throughput in MB/s will be limited to be no more than the set value.

A rule of thumb is to set this to less than 1/10 of the PMEM node write bandwidth.

From demotion_enabled

/sys/kernel/mm/numa/demotion_enabled

Defined on file sysfs-kernel-mm-numa

Enable/disable demoting pages during reclaim

Page migration during reclaim is intended for systems with tiered memory configurations. These systems have multiple types of memory with varied performance characteristics instead of plain NUMA systems where the same kind of memory is found at varied distances. Allowing page migration during reclaim enables these systems to migrate pages from fast tiers to slow tiers when the fast tier is under pressure. This migration is performed before swap. It may move data to a NUMA node that does not fall into the cpuset of the allocating process which might be construed to violate the guarantees of cpusets. This should not be enabled on systems which need strict cpuset location guarantees.

Check for '[Hardware Error]' in dmesg

Hardware errors are reported to dmesg and look similar to the following:

[1206211.934903] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[1206211.934911] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[1206211.934913] {2}[Hardware Error]: event severity: corrected
[1206211.934915] {2}[Hardware Error]:  Error 0, type: corrected
[1206211.934917] {2}[Hardware Error]:   section_type: PCIe error
[1206211.934918] {2}[Hardware Error]:   port_type: 10, root complex event collector
[1206211.934920] {2}[Hardware Error]:   version: 3.0
[1206211.934921] {2}[Hardware Error]:   command: 0x0504, status: 0x0010
[1206211.934923] {2}[Hardware Error]:   device_id: 0000:00:00.4
[1206211.934925] {2}[Hardware Error]:   slot: 0
[1206211.934927] {2}[Hardware Error]:   secondary_bus: 0x00
[1206211.934928] {2}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x0b23
[1206211.934929] {2}[Hardware Error]:   class_code: 080700
[1206211.934946] pcieport 0000:00:00.4: AER: aer_status: 0x00004000, aer_mask: 0x00000000
[1206211.934993] pcieport 0000:00:00.4:    [14] CorrIntErr            
[1206211.934997] pcieport 0000:00:00.4: AER: aer_layer=Transaction Layer, aer_agent=Receiver ID

cxlchk should report the section_type and Error 0, type with a counter to show how many events of each type have been reported.

Check `lscpi` for 'CXL'

A CXL device should present itself in the lspci, eg:

# lspci | grep -i CXL 
ab:00.0 CXL: Device 1f2d:1031 (rev 01)

If no 'CXL' device is found, return an error

Analyzing a dataset doesn't require root

If a dataset exists, the cxlchk script should not require root privileges to read the files within the dataset.

./cxlchk -A ./cxlchk.hostname.0829-1451/
Please run this script with root privilege or use -h to display help information.

Check for `efi=nosoftreserve` in the Kernel boot entry

With recent Kernel versions, it is possible for a user to explicitly disable the Special Purpose Memory UEFI feature with the efi=nosoftreserve. For example:

$ cat /proc/cmdline 
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-6.3.4-350.vanilla.fc38.x86_64 root=UUID=ebbbb014-e320-4361-b90c-e70f519a0fee ro rhgb quiet efi=nosoftreserve nopat

A test should be made to detect and report this as an INFO message.

Check `numactl -H` for CPU-less NUMA Nodes

If CXL.mem is configured as a memory node, whether it be Special Purpose Memory to not, there should be a cpu-less/memory-only NUMA node.

Example:

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 0 size: 128383 MB
node 0 free: 126151 MB
node 1 cpus:
node 1 size: 8053 MB
node 1 free: 7929 MB
node distances:
node   0   1 
  0:  10  14  <<< DRAM
  1:  14  10. <<< CXL

Check `/boot/config-$(uname -r)` for config options

The Kernel config should check to see if the following are 'm' or 'y'

CONFIG_X86_PMEM_LEGACY=m
CONFIG_ZONE_DEVICE=y
CONFIG_LIBNVDIMM=m
CONFIG_BLK_DEV_PMEM=m
CONFIG_BTT=y
CONFIG_NVDIMM_PFN=y
CONFIG_NVDIMM_DAX=y
CONFIG_DEV_DAX_PMEM=m
CONFIG_ENCRYPTED_KEYS=y

This is a minimum list. If these are not present or incorrect, return an error.

Check e820 tables in `dmesg` for 'soft reserved'

# dmesg | grep e820
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000003dfff] usable
[    0.000000] BIOS-e820: [mem 0x000000000003e000-0x000000000003efff] reserved
[    0.000000] BIOS-e820: [mem 0x000000000003f000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000006d3c6fff] usable
[    0.000000] BIOS-e820: [mem 0x000000006d3c7000-0x000000006f4c6fff] reserved
[    0.000000] BIOS-e820: [mem 0x000000006f4c7000-0x000000006fdc6fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x000000006fdc7000-0x00000000734f1fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x00000000734f2000-0x00000000777fefff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000777ff000-0x00000000777fffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000077800000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fe010000-0x00000000fe010fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000807fffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000008080000000-0x000000907fffffff] soft reserved <<<

Memory in the NUMA node should be ONLINE before use

Intel MLC tests (benchmarks/IntelMLC/mlc.sh) may fail with:

alloc_mem_onnode(): unable to mbind: : Invalid argument
Buffer allocation failed!

This is caused when the memory blocks for the selected DRAM or CXL node are OFFLINE. In the following example, NUMA Node 2 is CXL memory:

# numactl -H
available: 5 nodes (0-4)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155
node 0 size: 64206 MB
node 0 free: 54657 MB
node 1 cpus: 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
node 1 size: 128947 MB
node 1 free: 122649 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus:
node 3 size: 64496 MB
node 3 free: 64245 MB
node 4 cpus:
node 4 size: 0 MB
node 4 free: 0 MB
node distances:
node   0   1   2   3   4 
  0:  10  21  14  14  24 
  1:  21  10  24  24  14 
  2:  14  24  10  16  26 
  3:  14  24  16  10  26 
  4:  24  14  26  26  10 

# lsmem -o+ZONES,NODE
RANGE                                  SIZE   STATE REMOVABLE   BLOCK          ZONES NODE
0x0000000000000000-0x000000007fffffff    2G  online       yes       0           None    0
0x0000000100000000-0x000000107fffffff   62G  online       yes    2-32         Normal    0
0x0000001080000000-0x000000307fffffff  128G  online       yes   33-96         Normal    1
0x0000003080000000-0x000000407fffffff   64G  online       yes  97-128         Normal    3
0x0000004080000000-0x000000607fffffff  128G offline           129-192 Normal/Movable    2. <<<
0x0000006080000000-0x000000707fffffff   64G offline           193-224 Normal/Movable    4

Memory block size:         2G
Total online memory:     256G
Total offline memory:    192G

Bring the memory ONLINE using:

# cd /sys/bus/node/devices/node2
# for m in `find . -name "memory*[0-9]"`
do
  sudo echo online > $m/state
done

# lsmem -o+ZONES,NODE
RANGE                                  SIZE   STATE REMOVABLE   BLOCK          ZONES NODE
0x0000000000000000-0x000000007fffffff    2G  online       yes       0           None    0
0x0000000100000000-0x000000107fffffff   62G  online       yes    2-32         Normal    0
0x0000001080000000-0x000000307fffffff  128G  online       yes   33-96         Normal    1
0x0000003080000000-0x000000407fffffff   64G  online       yes  97-128         Normal    3
0x0000004080000000-0x000000607fffffff  128G  online       yes 129-192         Normal    2 <<<
0x0000006080000000-0x000000707fffffff   64G offline           193-224 Normal/Movable    4

Memory block size:         2G
Total online memory:     384G
Total offline memory:     64G

Check Memory Speed == Configured Memory Speed

From dmidecode -t memory, verify 'Speed' == 'Configured Memory Speed'

Handle 0x0021, DMI type 17, 92 bytes
Memory Device
        Array Handle: 0x001F
        Error Information Handle: No Error
        Total Width: 80 bits
        Data Width: 64 bits
        Size: 64 GB
        Form Factor: DIMM
        Set: None
        Locator: P1-DIMMA1
        Bank Locator: P0_Node0_Channel0_Dimm0
        Type: DDR5
        Type Detail: Synchronous Registered (Buffered)
        Speed: 4800 MT/s
        Manufacturer: Samsung
        Serial Number: 80CE04230143FAE831
        Asset Tag: P1-DIMMA1_AssetTag (date:23/01)
        Part Number: M321R8GA0BB0-CQKMS  
        Rank: 2
        Configured Memory Speed: 4800 MT/s
        Minimum Voltage: 1.1 V
        Maximum Voltage: 1.1 V
        Configured Voltage: 1.1 V
        Memory Technology: DRAM
        Memory Operating Mode Capability: Volatile memory
        Firmware Version: Unknown
        Module Manufacturer ID: Bank 1, Hex 0xCE
        Module Product ID: Unknown
        Memory Subsystem Controller Manufacturer ID: Unknown
        Memory Subsystem Controller Product ID: Unknown
        Non-Volatile Size: None
        Volatile Size: 64 GB
        Cache Size: None
        Logical Size: None

Report 'oom-kill' activity in dmesg

cxlchk should report a count of how many oom-kill events were found in dmesg. An example of part of the message looks like this:

[539824.438309] oom-kill:constraint=CONSTRAINT_MEMORY_POLICY,nodemask=1,cpuset=user.slice,mems_allowed=0-3,global_oom,task_memcg=/user.slice/user-0.slice/session-241.scope,task=numa_alloc,pid=2638010,uid=0
[539824.438321] Out of memory: Killed process 2638010 (numa_alloc) total-vm:15731464kB, anon-rss:14473472kB, file-rss:1664kB, shmem-rss:0kB, UID:0 pgtables:28788kB oom_score_adj:0
[539891.267033] numa_alloc invoked oom-killer: gfp_mask=0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), order=0, oom_score_adj=0
[539891.267041] CPU: 145 PID: 2641883 Comm: numa_alloc Tainted: G S                 6.2.16-060216-generic #202305171336

Example:

# dmesg | grep "oom-kill\:" | wc -l 
2

Verify the CXL memory is 'Movable' in `lsmem`

We would expect to see CXL memory as 'Movable' in the following output

# lsmem -o+ZONES,NODE
RANGE                                  SIZE  STATE REMOVABLE   BLOCK   ZONES NODE
0x0000000000000000-0x000000007fffffff    2G online       yes       0    None    0
0x0000000100000000-0x000000407fffffff  254G online       yes   2-128  Normal    0
0x0000004080000000-0x000000807fffffff  256G online       yes 129-256  Normal    1
0x0000008080000000-0x000000887fffffff   32G online       yes 257-272 Movable    2

Memory block size:         2G
Total online memory:     544G
Total offline memory:      0B

Warn if not. It's not fatal because if the CXL memory is not presented as Special Purpose Memory (SPM), then it will appear as a memory NUMA node without a devdax.

Look for `disabled` entries in`daxctl list`

Look for disabled dax devices

# daxctl list
[
  {
    "chardev":"dax1.0",
    "size":34359738368,
    "target_node":3,
    "align":2097152,
    "mode":"devdax"
  },
  {
    "chardev":"dax2.0",
    "size":68719476736,
    "target_node":2,
    "align":2097152,
    "mode":"devdax",
    "state":"disabled"
  },
  {
    "chardev":"dax0.0",
    "size":34359738368,
    "target_node":2,
    "align":2097152,
    "mode":"system-ram",
    "online_memblocks":16,
    "total_memblocks":16,
    "movable":true
  }
]

Check `dmesg` for errors related to CXL drivers

If a cxl driver failes to load, an error message will be displayed in dmesg, for example:

[  464.285235] cxl_pci: Unknown symbol cxl_enumerate_cmds (err -2)
[  464.285265] cxl_pci: Unknown symbol devm_cxl_add_nvdimm (err -2)
[  464.285284] cxl_pci: Unknown symbol cxl_find_regblock (err -2)
[  464.285298] cxl_pci: Unknown symbol cxl_map_component_regs (err -2)
[  464.285311] cxl_pci: Unknown symbol devm_cxl_add_memdev (err -2)
[  464.285321] cxl_pci: Unknown symbol cxl_probe_device_regs (err -2)
[  464.285335] cxl_pci: Unknown symbol cxl_map_device_regs (err -2)
[  464.285348] cxl_pci: Unknown symbol cxl_dev_state_identify (err -2)
[  464.285367] cxl_pci: Unknown symbol cxl_mem_create_range_info (err -2)
[  464.285379] cxl_pci: Unknown symbol cxl_dev_state_create (err -2)

This test should check for these failures and report a FAILED status.

Check CXL Device Health

For each CXL device, check the Health field(s). Many are boolean (True | False)

$ sudo cxl list --health -i
[
  {
    "memdev":"mem0",
    "ram_size":137438953472,
    "health":{
      "maintenance_needed":true,
      "performance_degraded":false,
      "hw_replacement_needed":false,
      "media_normal":false,
      "media_not_ready":false,
      "media_persistence_lost":true,
      "media_data_lost":false,
      "media_powerloss_persistence_loss":false,
      "media_shutdown_persistence_loss":false,
      "media_persistence_loss_imminent":false,
      "media_powerloss_data_loss":false,
      "media_shutdown_data_loss":false,
      "media_data_loss_imminent":false,
      "ext_life_used":"unknown",
      "ext_temperature":"normal",
      "ext_corrected_volatile":"normal",
      "ext_corrected_persistent":"normal",
      "life_used_percent":4,
      "temperature":0,
      "dirty_shutdowns":0,
      "volatile_errors":0,
      "pmem_errors":0
    },
    "serial":9947034466371306773,
    "numa_node":0,
    "host":"0000:38:00.0",
    "state":"disabled"
  }
]

Check all DDR modules are the same

Using dmidecode -t memory, check that all the DDR module Size, Manufacturer, and Part Numbers are the same for all DIMMs installed in the system. PASS if True, FAIL if False. It may be okay to have a substitute part number installed, but flag it anyway.

Handle 0x0021, DMI type 17, 92 bytes
Memory Device
        Array Handle: 0x001F
        Error Information Handle: No Error
        Total Width: 80 bits
        Data Width: 64 bits
        Size: 64 GB
        Form Factor: DIMM
        Set: None
        Locator: P1-DIMMA1
        Bank Locator: P0_Node0_Channel0_Dimm0
        Type: DDR5
        Type Detail: Synchronous Registered (Buffered)
        Speed: 4800 MT/s
        Manufacturer: Samsung
        Serial Number: 80CE04230143FAE831
        Asset Tag: P1-DIMMA1_AssetTag (date:23/01)
        Part Number: M321R8GA0BB0-CQKMS  
        Rank: 2
        Configured Memory Speed: 4800 MT/s
        Minimum Voltage: 1.1 V
        Maximum Voltage: 1.1 V
        Configured Voltage: 1.1 V
        Memory Technology: DRAM
        Memory Operating Mode Capability: Volatile memory
        Firmware Version: Unknown
        Module Manufacturer ID: Bank 1, Hex 0xCE
        Module Product ID: Unknown
        Memory Subsystem Controller Manufacturer ID: Unknown
        Memory Subsystem Controller Product ID: Unknown
        Non-Volatile Size: None
        Volatile Size: 64 GB
        Cache Size: None
        Logical Size: None

Implement the `-m` option to allow the user to specify which modules or tests to execute

The cxlchk partially implements the -m command option that should allow a user to provide a comma-separated list of modules and tests to execute.

  # echo "   -m <module1, module2, ..., moduleN>"
  # echo "      Specify which Analyzer modules to include or exclude"
  # echo " "

A user can implicitly exclude a module/test by prefixing the name with a minus sign (-). Tests or modules in the list are implicitly included.

Check for dax errors in dmesg

Write a rule that looks for errors similar to these in dmesg

# dmesg | grep -i dax
[   20.551421] device_dax dax0.0: mapping0: 0x880000000-0x187fffffff could not reserve range
[   20.553485] device_dax: probe of dax0.0 failed with error -16
[  138.941186] kmem dax0.0: mapping0: 0x880000000-0x187fffffff could not reserve region
[  138.941206] kmem: probe of dax0.0 failed with error -16
[  138.941253] kmem dax0.0: mapping0: 0x880000000-0x187fffffff could not reserve region
[  138.941259] kmem: probe of dax0.0 failed with error -16
[  158.604997] device_dax dax0.0: dynamic-dax with pre-populated page map
[  158.605010] device_dax: probe of dax0.0 failed with error -22

The dax device may not be visible, or could be in a disabled state:

# daxctl list
[
  {
    "chardev":"dax0.0",
    "size":68719476736,
    "target_node":2,
    "align":2097152,
    "mode":"devdax",
    "state":"disabled"
  }
]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.