Giter VIP home page Giter VIP logo

xpumanager's Introduction

Intel(R) XPU Manager and XPU System Management Interface

Intel(R) XPU Manager is a free and open-source tool for monitoring and managing Intel data center GPUs.

It is designed to simplify administration, maximize reliability and uptime, and improve utilization.

XPU Manager can be used standalone through its command line interface (CLI) to manage GPUs locally, or through its RESTful APIs to manage GPUs remotely. Intel(R) XPU System Management Interface (XPU-SMI) is the daemon-less version of XPU Manager and it only provides the local interface. XPU-SMI feature scope is the subset of XPU Manager. Their features are listed in the table below. Please note that XPU-SMI and XPU Manager can't be installed or executed on the same system due to some resource conflict. XPU-SMI has been included in the GPU driver repository. If you want to use XPU Manager, please uninstall XPU-SMI and install XPU Manager.

amcmcli is a portable CLI tool to manage GPU AMC firmware on Linux OS. It is independent of GPU driver.

3rd party open-source and commercial workload and cluster managers, job schedulers, and monitoring solutions can also integrate the XPU Manager or XPU-SMI to manage Intel data center GPUs.

Intel(R) XPU Manager features

  • Administration:
    • GPU discovery and information - name, model, serial, stepping, location, frequency, memory capacity, firmware version
    • GPU topology and grouping
    • GPU Firmware updating, including GPU GFX firmware and AMC (Add-in card Management Controller) firmware updating.
  • Monitoring:
    • GPU telemetry – utilization, power, frequency, temperature, fabric speed, memory throughput, errors
    • GPU health – memory, power, temperature, fabric port, etc.
  • Diagnostics:
    • 3 levels of GPU diagnostic tests
    • Pre-check GPU hardware and driver critical issues
    • GPU log collection for the issue investigation
  • Configuration:
    • GPU Settings - GPU power limits, frequency range, standby mode, scheduler mode, ECC On/Off, performance factor, fabric port status
    • GPU policies - Throttle GPU when the temperature set threshold is reached
  • Supported Frameworks:
    • Prometheus exporter, Docker container support, Icinga plugin

CLI output of GPU device info, telemetries and firmware update

xpumcli discovery -d 0
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Type: GPU                                                                     |
|           | Device Name: Intel(R) Graphics [0x56c0]                                              |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | UUID: 01000000-0000-0000-0000-0000004d0000                                           |
|           | Serial Number: LQAC20305316                                                          |
|           | Core Clock Rate: 2050 MHz                                                            |
|           | Stepping: C0                                                                         |
|           |                                                                                      |
|           | Driver Version:                                                                      |
|           | Kernel Version: 5.15.47+prerelease3762                                               |
|           | GFX Firmware Name: GFX                                                               |
|           | GFX Firmware Version: DG02_1.3170                                                    |
|           | GFX Data Firmware Name: GFX_DATA                                                     |
|           | GFX Data Firmware Version: 0x12d                                                     |
|           |                                                                                      |
|           | PCI BDF Address: 0000:4d:00.0                                                        |
|           | PCI Slot: J37 - Riser 1, Slot 1                                                      |
|           | PCIe Generation: 4                                                                   |
|           | PCIe Max Link Width: 16                                                              |
+-----------+--------------------------------------------------------------------------------------+

xpumcli dump -d 0 -m 0,1,2,3
Timestamp, DeviceId, GPU Utilization (%), GPU Power (W), GPU Frequency (MHz), GPU Core Temperature (Celsius Degree)
21:23:00.000,    0, 99.55, 119.61, 1800, 49.00
21:23:01.000,    0, 99.45, 119.36, 1800, 50.00
21:23:02.000,    0, 99.48, 119.55, 1750, 50.50
21:23:03.000,    0, 99.65, 119.24, 1700, 51.00


sudo xpumcli updatefw -d 0 -t GFX -f ATS_M150_512_C0_PVT_ES_032_gfx_fwupdate_SOC1.bin
Device 0 FW version: DG02_1.3170
Image FW version: DG02_1.3172
Do you want to continue? (y/n) y
Start to update firmware
Firmware Name: GFX
Image path: /home/dcm/ATS_M150_512_C0_PVT_ES_032_gfx_fwupdate_SOC1.bin
[============================================================] 100 %
Update firmware successfully.

Feature set of XPU Manager, XPU-SMI and XPU-SMI Windows CLI tool

XPU Manager XPU-SMI XPU-SMI Windows CLI amcmcli
Device Info and Topology Yes Yes Yes No
GPU Telemetries Yes (aggregated data) Yes (real-time data) Yes (real-time data) No
GPU Firmware Update GFX, GFX_Data, AMC GFX, GFX_Data, AMC GFX, GFX_Data, AMC AMC (IPMI)
GPU Configuration Yes Yes Yes No
GPU Diagnostics Yes Yes No No
GPU Health Yes Yes No No
GPU Grouping Yes No No No
GPU policy Yes No No No
Architecture Daemon based Daemon-less Daemon-less Daemon-less
Interfaces CLI, RESTFul, Library CLI, Library CLI, Library CLI

How to get XPU Manager, XPU-SMI, Windows CLI and amcmcli binaries.

You may get the latest installers or binaries in Releases.

Supported Devices

Supported OSes

  • XPU Manager
    • Ubuntu 20.04.3/22.04
    • RHEL 8.8/9.2
    • CentOS 8/9 Stream
    • CentOS 7.4/7.9
    • SLES 15 SP4/SP5
  • XPU-SMI
    • Ubuntu 20.04.3/22.04
    • RHEL 8.8/9.2
    • CentOS 8/9 Stream
    • CentOS 7.4/7.9
    • SLES 15 SP4/SP5
    • Debian 10.13
    • Windows Server 2019/2022 (limited features including: GPU device info, GPU telemetry, GPU firmware update and GPU configuration)

Documentation

Architecture

XPU Manager Architecture

GPU telemetry exported to Grafana

GPU telemetry exported from XPU Manager to Grafana

xpumanager's People

Contributors

huiqiwa avatar pwu6 avatar sfblackl-intel avatar sysxpum avatar taotod avatar uniemimu avatar ywang82 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xpumanager's Issues

xpumcli for Windows failed to work

The tool xpumcli.exe failed to execute with a dynamic library error. Either executable file download from the releases page or built from the source code has the same issue.
I suppose ze_loader.dll is not consistent with ze_loader.lib. The ze_loader.dll on my machine is installed with the latest oneapi installer with version 1.2.3. So should the ze_loader.lib in the source code be updated?

Capture
Capture1

xpumd / xpumcli do not error out & show help on incorrect arguments

Version: V1.2.9

Issues

xpumd runs although it's given arguments it does not recognized / support, instead of exiting with help output:

# xpumd policy
[2023-05-23 16:34:58.288] [I] [247-247] XPUM: Init xpum library
[2023-05-23 16:34:58.288] [I] [247-247] XPU Manager:	1.2.9.20230517
...
^C

After accidentally running second instance of XPUM like above and terminating that with ^C, the correct XPUM instance dies also for some reason.

After that, xpumcli does not show help output anymore when required options are missing, or it's given argument it does not recognize at all:

# xpumcli agentset
Error: XPUM Service Status Error.
# xpumcli foobar
Error: XPUM Service Status Error.

Document needed on how to build/install on fedora platform.

I'm always seeing missing below libs when try to install xpumanager
"
$ rpm -i xpu-smi-1.2.36-20240428.081009.377f9162.x86_64.rpm
warning: xpu-smi-1.2.36-20240428.081009.377f9162.x86_64.rpm: Header V4 RSA/SHA256 Signature, key ID 15ef8f2b: NOKEY
error: Failed dependencies:
intel-gsc >= 0.8.4 is needed by xpu-smi-1.2.36-20240428.081009.377f9162.x86_64
intel-level-zero-gpu >= 1.3.23726 is needed by xpu-smi-1.2.36-20240428.081009.377f9162.x86_64
level-zero >= 1.7.9.1 is needed by xpu-smi-1.2.36-20240428.081009.377f9162.x86_64
"
I can't find intel-gsc anywhere. And I have intel-level-zero-gpu and level-zero installed already.
OS is fedora 38
GPU is A770

Environment variables are not documented

There should be a proper document describing environment variables affecting XPUM working, and what possible values they could have.

For example, XPUM_METRICS environment variable values needed to avoid XPUM filling disk with log spam, seem to be listed only in its header file:
https://github.com/intel/xpumanager/blob/master/core/include/xpum_api.h

Additionally, any other document mentioning these environment variables, should link the environment variable document. For example the references to XPUM_METRICS in XPUM DockerHub page: https://hub.docker.com/r/intel/xpumanager

Integrated GPU support

Do we support iGPU?
I am getting all N/A for iGPU and Freq is not the frequency in realtime, but the max freq the SoC support.

$ xpu-smi stats -d 0000:00:02.0
+-----------------------------+--------------------------------------------------------------------+
| Device ID                   | 0                                                                  |
+-----------------------------+--------------------------------------------------------------------+
| GPU Utilization (%)         | N/A                                                                |
| EU Array Active (%)         | N/A                                                                |
| EU Array Stall (%)          | N/A                                                                |
| EU Array Idle (%)           | N/A                                                                |
|                             |                                                                    |
| Compute Engine Util (%)     | N/A                                                                |
| Render Engine Util (%)      | N/A                                                                |
| Media Engine Util (%)       | N/A                                                                |
| Decoder Engine Util (%)     | N/A                                                                |
| Encoder Engine Util (%)     | N/A                                                                |
| Copy Engine Util (%)        | N/A                                                                |
| Media EM Engine Util (%)    | N/A                                                                |
| 3D Engine Util (%)          | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
| Reset                       | N/A                                                                |
| Programming Errors          | N/A                                                                |
| Driver Errors               | N/A                                                                |
| Cache Errors Correctable    | N/A                                                                |
| Cache Errors Uncorrectable  | N/A                                                                |
| Mem Errors Correctable      | N/A                                                                |
| Mem Errors Uncorrectable    | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
| GPU Power (W)               | N/A                                                                |
| GPU Frequency (MHz)         | 1400                                                               |
| Media Engine Freq (MHz)     | N/A                                                                |
| GPU Core Temperature (C)    | N/A                                                                |
| GPU Memory Temperature (C)  | N/A                                                                |
| GPU Memory Read (kB/s)      | N/A                                                                |
| GPU Memory Write (kB/s)     | N/A                                                                |
| GPU Memory Bandwidth (%)    | N/A                                                                |
| GPU Memory Used (MiB)       | N/A                                                                |
| GPU Memory Util (%)         | N/A                                                                |
| Xe Link Throughput (kB/s)   | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+

GPUDeviceStub::init zeInit error

Hi,

I can not initlaize xpumanager in my device.

[2022-06-08 16:43:10.937] [I] [2925-2925] XPUM: Init xpum library
[2022-06-08 16:43:10.937] [I] [2925-2925] XPU Manager: 1.0.0.20220406
[2022-06-08 16:43:10.937] [I] [2925-2925] Build: 0ba1c207
[2022-06-08 16:43:10.937] [I] [2925-2925] Level Zero: 1.8.1
[2022-06-08 16:43:10.937] [I] [2925-2925] xpumd core starts to initialize
[2022-06-08 16:43:10.938] [I] [2925-2925] initialize configuration
[2022-06-08 16:43:10.938] [I] [2925-2925] initialize datalogic
[2022-06-08 16:43:10.938] [I] [2925-2925] initialize device manager
[2022-06-08 16:43:10.940] [E] [2925-2925] GPUDeviceStub::init zeInit error: 78000001
[2022-06-08 16:43:10.940] [I] [2925-2925] GPUDeviceStub::checkInitDependency start
[2022-06-08 16:43:10.940] [I] [2925-2925] Environment variables check pass
[2022-06-08 16:43:10.940] [I] [2925-2925] Libraries check pass.
[2022-06-08 16:43:10.940] [I] [2925-2925] Permission check pass.
[2022-06-08 16:43:10.940] [I] [2925-2925] GPUDeviceStub::checkInitDependency done
[2022-06-08 16:43:10.940] [E] [2925-2925] xpumInit LevelZeroInitializationException
[2022-06-08 16:43:10.940] [E] [2925-2925] Failed to init xpum core: zeInit error
[2022-06-08 16:43:10.940] [E] [2925-2925] XPUM: Load xpum library failed! 37

[2022-06-08 16:43:10.940] [I] [2925-2925] XPUM: start XPUM RPC Server.
[2022-06-08 16:43:10.940] [I] [2925-2925] XPUM: start RPC server ...
[2022-06-08 16:43:10.941] [I] [2925-2925] XPUM: RPC server is listening at /tmp/xpum.sock

Do I need additional information to understand this problem?

meet XPUM Service Status Error on centos8(ATS-m3)

xpumcli -v
Error: XPUM Service Status Error.

cat /proc/version
Linux version 5.15.47 (ubit@fm6pudocker160) (gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), GNU ld version 2.30-113.el8) #4485.el8 SMP Fri Feb 10 23:56:46 UTC 2023


xpumcli -h
Intel XPU Manager Command Line Interface -- v1.2
Intel XPU Manager Command Line Interface provides the Intel data center GPU model and monitoring capabilities. It can also be used to change the Intel data center GPU settings and update the firmware.
Intel XPU Manager is based on Intel oneAPI Level Zero. Before using Intel XPU Manager, the GPU driver and Intel oneAPI Level Zero should be installed rightly.

Supported devices:
  - Intel Data Center GPU

Usage: xpumcli [Options]
  xpumcli -v
  xpumcli -h
  xpumcli discovery

Options:
  -h,--help                   Print this help message and exit
  -v,--version                Display version information and exit.

Subcommands:
  discovery                   Discover the GPU devices installed on this machine and provide the device info.
  topology                    Get the system topology.
  group                       Group the managed GPU devices.
  diag                        Run some test suites to diagnose GPU.
  health                      Get the GPU device component health status.
  policy                      Get and set the GPU policies.
  updatefw                    Update GPU firmware
  config                      Get and change the GPU settings.
  topdown                     Expected feature.
  ps                          List status of processes.
  stats                       List the GPU aggregated statistics since last execution of this command or XPU Manager daemon is started.
  dump                        Dump device statistics data.
  log                         Collect GPU debug logs.
  agentset                    Get or change some XPU Manager settings.
  amcsensor                   List the AMC real-time sensor readings.

Support for Arc GPUs?

Hi!

Are there any plans to include at least partial support for Arc GPUs, which don't seem to be currently supported? It would be very useful to be able to see the output similar to nvidia-smi for NVIDIA GPUs, where one can (among other things) see:

  • the temperature
  • driver version
  • total VRAM
  • currently used VRAM along with the list of processes using that
  • GPU utilization

If there are no plans for that currently, is there any documentation for any specific kind of API that one should consult in order to implement such a tool? I'm very new to Intel's stack, so any pointers would be much appreciated!

Thanks in advance for your time!

xpumanager xpumd container fails with errors

Hi, for Intel Data Center GPU Flex 140, on OCP- with the Intel device plugins operator GPU plugin, xpumanager daemonset and xpumanager_side car it fails with error below. Used the kustomization yaml with xpumanager master branch, v1.2.18 latest release and v1.2.13 for the docker image intel/xpumanager:v1.2.13 tag.

[2023-09-12 18:50:09.947] [I] [1-1] XPUM: Init xpum library
[2023-09-12 18:50:09.947] [I] [1-1] XPU Manager:        1.2.13.20230629
[2023-09-12 18:50:09.947] [I] [1-1] Build:              aeeedfec
[2023-09-12 18:50:09.947] [I] [1-1] Level Zero: 1.9.0
[2023-09-12 18:50:09.947] [I] [1-1] xpumd core starts to initialize
[2023-09-12 18:50:09.947] [I] [1-1] initialize configuration
[2023-09-12 18:50:09.947] [I] [1-1] xpum mode: xpum
[2023-09-12 18:50:09.947] [I] [1-1] The environment variable XPUM_METRICS is detected: 0-38
[2023-09-12 18:50:09.947] [I] [1-1] initialize datalogic
[2023-09-12 18:50:09.947] [I] [1-1] initialize device manager
[2023-09-12 18:50:09.975] [E] [1-1] Failed to load msr kernel module
sh: 1: modprobe: not found
[2023-09-12 18:50:10.815] [W] [1-25] Device Intel(R) Data Center GPU Flex 1400000:3c:00.0 has no Memory Temperature capability.
[2023-09-12 18:50:10.815] [W] [1-25] Capability Memory Temperature detection returned: No temperature sensor detected
[2023-09-12 18:50:10.815] [W] [1-24] Device Intel(R) Data Center GPU Flex 1400000:37:00.0 has no Memory Temperature capability.
[2023-09-12 18:50:10.815] [W] [1-24] Capability Memory Temperature detection returned: No temperature sensor detected
[2023-09-12 18:50:10.815] [W] [1-25] Device Intel(R) Data Center GPU Flex 1400000:3c:00.0 has no Memory Bandwidth capability.
[2023-09-12 18:50:10.815] [W] [1-25] Capability Memory Bandwidth detection returned: [toGetMemoryBandwidth:1978] zesMemoryGetBandwidth-1:0x78000003
[2023-09-12 18:50:10.815] [W] [1-25] Device Intel(R) Data Center GPU Flex 1400000:3c:00.0 has no Memory Read Write Throughput capability.
[2023-09-12 18:50:10.815] [W] [1-25] Capability Memory Read Write Throughput detection returned: [toGetMemoryReadWrite:2056] zesMemoryGetBandwidth:0x78000003
[2023-09-12 18:50:10.815] [W] [1-24] Device Intel(R) Data Center GPU Flex 1400000:37:00.0 has no Memory Bandwidth capability.
[2023-09-12 18:50:10.815] [W] [1-24] Capability Memory Bandwidth detection returned: [toGetMemoryBandwidth:1978] zesMemoryGetBandwidth-1:0x78000003
[2023-09-12 18:50:10.815] [W] [1-24] Device Intel(R) Data Center GPU Flex 1400000:37:00.0 has no Memory Read Write Throughput capability.
[2023-09-12 18:50:10.815] [W] [1-24] Capability Memory Read Write Throughput detection returned: [toGetMemoryReadWrite:2056] zesMemoryGetBandwidth:0x78000003
malloc(): unaligned tcache chunk detected

Is it recommended to build specific release image from scratch to deploy? Or any specific requirements that I missed in the deployment? Thank you!

build.sh does not return an error when build fails

Build scripts should return an error when build fails, but "build.sh" does NOT.

You should either use "set -e" in the script, or check return values of all commands called in "build.sh" (and print error message + return with an error from the script if command fails).

XPU-SMI not working with A770

Running Ubuntu 22.04 with kernel 5.19.0-41-generic with an Intel Arc A770, XPU-SMI is not working. It mostly reports empty fields when running xpu-smi stats -d 0 and when it does report something, the values don't make sense. For example, GPU Memory Used doesn't concord with the values I am getting from IPEX (more than an order of magnitude of difference...).

It's probably not a driver issue on my system, XPU Manager is somewhat working and I have no trouble with IPEX.

If XPU Manager/XPU-SMI is not planning on any more comprehensive support for Arc cards, is there any other tools from Intel that would offer basic support for things like checking temperatures, memory usage, ... Also, not necessarily something concerning XPU Manager, but in general more documentation would be useful. For example, the documentation for XPU Manager is the only place I can find refering to updating the device firmware, is it something that needs to be done on Arc card? Or only on data center gpus?

GPU temperature is not reported by Prometheus exporter fox Max 1100

Steps to reproduce:

  • Install the latest xpumanager (V1.2.29) as Kubernetes DaemonSet.
  • Install the provided Grafana dashboard.

Result:

  • Grafana shows metrics such as "GPU Utilization"
    image

  • Grafana shows "No data" for "GPU Temperature"
    image

GPU: Max 1100
Driver: Agama 775.20
xpumanager: 1.2.29

Documentation mismatches in regards to what metrics XPUM supports

Compared following documents:

And which metrics they list XPU manager to provide. Especially CSV file info seems very out of data, but also install guide eg. lists frequency throttle ratio (as not supported by current L0 backend), but not user guide. IMHO it would be better to have supported metrics list in single place, and to refer to that from the other documents.

Dump FW versions.

This is more of a feature request. Is it possible to add the ability to dump the FW versions so fleets of cards can be maintained when new drivers or editions are released? Is it possible to add this to IGSC/L?

Regards.

Invalid SUSE distribution detection in CMakeLists.txt

CMakeLists.txt has this:

if(NOT DEFINED CPACK_GENERATOR)
  if(EXISTS "/etc/debian_version")
    set(CPACK_GENERATOR "DEB")
  elseif(EXISTS "/etc/redhat-release")
    set(CPACK_GENERATOR "RPM")
  elseif(EXISTS "/etc/SUSE-brand" OR EXISTS "/etc/SUSE-release")
    set(CPACK_GENERATOR "RPM")
  else()
    set(CPACK_GENERATOR "ZIP")
  endif()
endif()

Which is wrong for both SLES and OpenSuse (base containers). Neither of them includes files named like SUSE* in their /etc directory. Check should be done against the actual needed tooling (dpkg/rpm/zip), or e.g. /etc/os-release contents instead.

As a result of failing to detect that SLES is RPM based, XPUM CmakeLists.txt defaults to ZIP, which also appears to be buggy:

---------Create installation package-----------
CPack: Create package using ZIP
CPack: Install projects
CPack: - Run preinstall target for: xpumanager
CPack: - Install project: xpumanager []
CMake Error at /home/root/xpumanager/build/cmake_install.cmake:182 (file):
  file INSTALL cannot find
  "/home/someuser/xpumanager/daemon/xpum.service.template": No such file or directory.

xpu-smi dump -m 1,2,3,4,5 not reporting temperature.

As a non-root user on a Ubuntu 22.04.2 LT system, xpu-smi dump -m 1,2,3,4,5 does not report temperatures for GPU or GPU Memory. Refer to output below.
xpu-smi -v
CLI:
Version: 1.2.5.20230313
Build ID: f458af77

Service:
Version: 1.2.5.20230313
Build ID: f458af77
Level Zero Version: 1.8.8

randalls@DUT5169PVC: xpu-smi dump -m 1,2,3,4,5
Timestamp, DeviceId, GPU Power (W), GPU Frequency (MHz), GPU Core Temperature (Celsius Degree), GPU Memory Temperature (Celsius Degree), GPU Memory Utilization (%)
18:10:25.000, 0, 265.02, 1600.00, , , 0.06
18:10:25.000, 1, 270.21, 1600.00, , , 0.06
18:10:26.000, 0, 265.16, 1600.00, , , 0.06
18:10:26.000, 1, 269.84, 1600.00, , , 0.06

xpu-smi can be displayed like nvidia-smi

nvidia-smi can directly show the GPU status including but not limiting GPU utilization, Memory utilization, and running progress.
xpu-smi looks a bit hard to use.
Can your team improve the UX and UE?
Thanks.

Contact xpum test method

Hello

We are a company called Teratec located in Korea. PVC graphics cards to be launched next year are being prepared to develop a management solution.
I am interested in the released xpummanager.
We're going to test it after setting it up, but we don't have PVC. Is there a way to test it?
Is it possible to test with the Intel GPU on board in the laptop?

Thanks
CS.HAN

xpum won't install in a rockylinux container

$ sudo rpm -i tmp/xpumanager_centos.1.0.0.20220610.164551.01e95b37.rpm
warning: tmp/xpumanager_centos.1.0.0.20220610.164551.01e95b37.rpm: Header V4 RSA/SHA256 Signature, key ID a0661990: NOKEY
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
warning: %post(xpumanager-1.0.0-1.0.0.20220610.164551.01e95b37.x86_64) scriptlet failed, exit status 1

Any ideas? Is there a way to load just the xpucli without binding in with systemd?

Thanks,
James

xpu-smi dump stats shows incorrect GPU utilization

I am running a workload on Intel Max Series 1550. I have xpu-smi installed. I can see the model is training on the GPU but xpu-smi dump -d 0 -m 0 shows N/A

image

But i can see percentage GPU memory utilization xpu-smi dump -d 0 -m 5

image

Here is the version details of xpu-smi on my machine
image

xpu-smi and PyTorch on GPUs

On NVIDIA GPUs, there is a relation between nvidia-smi and PyTorch, nvidia-smi, which is similar to xpu-smi is used to detect and monitor GPU telemetry. However, absence of nvidia-smi on the host, makes torch.cuda.is_available as False. However, for Intel GPUs, there seems to be no relation between PyTorch GPU support and xpu-smi. PyTorch detects xpu (via ipex.xpu.is_available()) as True even when xpu-smi is not installed.

Is this integrated or am I missing something?

Container images on Docker Hub

The latest version of container image available on docker hub is v1.2.13. Will newer versions continue to be published?
If not, which Dockerfile should be used and what are prerequisites prior to building the image? What is difference between Dockerfile.ubuntu.22.04 and Dockerfile.ubuntu.22.04.max?

xpumd fails to start if -m is specified

If I try to start xpumd with the -m option, it fails regardless of what I metrics I specify. e.g. even "xpumd -m 1" fails. The error is:

2022-07-13 08:59:50.193] [I] [85075-85075] XPUM: Init xpum library
[2022-07-13 08:59:50.193] [I] [85075-85075] XPU Manager:	1.0.0.20220711
[2022-07-13 08:59:50.193] [I] [85075-85075] Build:		00000000
[2022-07-13 08:59:50.193] [I] [85075-85075] Level Zero:	1.8.0
[2022-07-13 08:59:50.193] [I] [85075-85075] xpumd core starts to initialize
[2022-07-13 08:59:50.193] [I] [85075-85075] initialize configuration
[2022-07-13 08:59:50.193] [I] [85075-85075] The environment variable XPUM_METRICS is detected: 1
[2022-07-13 08:59:50.193] [I] [85075-85075] initialize datalogic
[2022-07-13 08:59:50.193] [I] [85075-85075] initialize device manager
[2022-07-13 08:59:51.023] [E] [85075-85075] GPUDeviceStub::init zeInit error: 70020000
[2022-07-13 08:59:51.023] [I] [85075-85075] GPUDeviceStub::checkInitDependency start
[2022-07-13 08:59:51.023] [I] [85075-85075] Environment variables check pass
[2022-07-13 08:59:51.028] [I] [85075-85075] Libraries check pass.
[2022-07-13 08:59:51.028] [I] [85075-85075] Permission check pass.
[2022-07-13 08:59:51.028] [I] [85075-85075] GPUDeviceStub::checkInitDependency done
[2022-07-13 08:59:51.028] [E] [85075-85075] xpumInit LevelZeroInitializationException
[2022-07-13 08:59:51.028] [E] [85075-85075] Failed to init xpum core: zeInit error
[2022-07-13 08:59:51.028] [E] [85075-85075] XPUM: Load xpum library failed! 37
[2022-07-13 08:59:51.028] [I] [85075-85075] XPUM: start XPUM RPC Server.
[2022-07-13 08:59:51.028] [I] [85075-85075] XPUM: start RPC server ...
[2022-07-13 08:59:51.029] [I] [85075-85075] XPUM: RPC server is listening at /tmp/xpum.sock
^C[2022-07-13 09:00:05.894] [W] [85075-85075] XPUM: recieved SIGTERM signal 2, service shutdown.
[2022-07-13 09:00:05.894] [I] [85075-85075] XPUM: Shutting down RPC server...
[2022-07-13 09:00:05.894] [I] [85075-85075] XPUM: Waiting for RPC server shutdown...
[2022-07-13 09:00:05.894] [I] [85075-85075] XPUM: Shut down.
[2022-07-13 09:00:05.895] [I] [85075-85075] xpumd stopped
[2022-07-13 09:00:05.895] [I] [85075-85075] XPUM: xpum service is closed.

xpu-smi unable to set powerlimit

I am using Arc-770 and Ubuntu 20.04 with kernel 5.14.0-1034-oem.

I want to power limit the GPU TDP to 50W to see impact of GPU performance.
root@ubuntuserver20:/home# xpu-smi config -d 0 --powerlimit 50
Return: Succeed to set the power limit on GPU 0.

Message seems to indicate power-limit is successfully set.
root@ubuntuserver20:/home# xpu-smi config -d 0
+-------------+-------------------+----------------------------------------------------------------+
| Device Type | Device ID/Tile ID | Configuration |
+-------------+-------------------+----------------------------------------------------------------+
| GPU | 0 | Power Limit (w): 50 |
| | | Valid Range: 1 to 0 |
| | | |
| | | Memory ECC: |
| | | Current: |
| | | Pending: |
+-------------+-------------------+----------------------------------------------------------------+

However when I run the GPU workloads I still see the GPU consuming more than 50W of power.

root@ubuntuserver20:/home/openvino# xpu-smi dump -d 0 -m 0,1,2
Timestamp, DeviceId, GPU Utilization (%), GPU Power (W), GPU Frequency (MHz)
05:07:33.000, 0, 94.10, 104.60, 2000
05:07:34.000, 0, 94.04, 103.34, 2000
05:07:35.000, 0, 93.31, 103.36, 2000
05:07:36.000, 0, 93.37, 102.90, 2000
05:07:37.000, 0, 93.47, 102.56, 2150

Am I missing something here?

Will ARC be supported?

There's currently no way to get most performance statistics on ARC GPUs. intel_gpu_top doesn't have memory usage, and while it appears xpu-smi has some metrics it's missing a lot on ARC.

I'm working on a multi-GPU ARC system and it's hard to troubleshoot certain things without knowing what the GPUs are doing outside of code.

Thanks!

"xpumcli dump" doesn't report all stats unless you use the --rawdata option

If I run xpumcli dump and start a raw data dump like this:

xpumcli dump --rawdata --start -d 0 -t 0 -m 0,1,2,3,4,5

It writes the requested metrics to the dump file as expected:

Timestamp, DeviceId, TileId, GPU Utilization (%), GPU Power (W), GPU Frequency (MHz), GPU Core Temperature (Celsius Degree), GPU Memory Temperature (Celsius Degree), GPU Memory Utilization (%)
2022-07-13T20:41:44.046Z,0,0,0.00,11.00,1400,31.00,32.00,0.00
2022-07-13T20:41:44.546Z,0,0,0.00,11.00,1400,31.00,32.00,0.00
2022-07-13T20:41:45.046Z,0,0,0.00,11.00,1400,30.00,32.00,0.00
2022-07-13T20:41:45.546Z,0,0,0.00,11.00,1400,31.00,32.00,0.00

If I run xpumcli dump on the command-line to retrieve stats on-demand, most fields are blank:

# ./xpumcli dump -d 0 -m 0,1,2,3,4,5 -n 4
Timestamp, DeviceId, GPU Utilization (%), GPU Power (W), GPU Frequency (MHz), GPU Core Temperature (Celsius Degree), GPU Memory Temperature (Celsius Degree), GPU Memory Utilization (%)
2022-07-13T21:04:48.000Z,    0,     , 44.60,     ,     ,     ,     
2022-07-13T21:04:49.000Z,    0,     , 44.60,     ,     ,     ,     
2022-07-13T21:04:50.000Z,    0,     , 44.61,     ,     ,     ,     
2022-07-13T21:04:51.000Z,    0,     , 44.62,     ,     ,     ,     

This is the CentOS build. Tested both in a CentOS 8.4 container, and on bare metal with Rocky 8.5.

xpumd generates the same error messages ad nauseum

My instances of xpumd generate this error message for every card, at every poll interval:

[2022-07-14 13:37:47.002] [W] [144540-144567] partial monitoring failure: [toGetMemoryWrite:1394] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:47.002] [W] [144540-144580] partial monitoring failure: [toGetMemoryRead:1343] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:48.502] [W] [144540-144567] partial monitoring failure: [toGetMemoryRead:1343] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:49.002] [W] [144540-144568] partial monitoring failure: [toGetMemoryWrite:1394] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:49.003] [W] [144540-144567] partial monitoring failure: [toGetMemoryReadThroughput:1446] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:49.192] [W] [144540-144578] partial monitoring failure: [toGetMemoryBandwidth:1292] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:50.000] [W] [144540-144574] partial monitoring failure: [toGetMemoryReadThroughput:1446] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:50.076] [W] [144540-144578] partial monitoring failure: [toGetMemoryBandwidth:1292] zesMemoryGetBandwidth:0x7ffffffe
[2022-07-14 13:37:50.502] [W] [144540-144580] partial monitoring failure: [toGetMemoryWriteThroughput:1498] zesMemoryGetBandwidth:0x7ffffffe

And it repeats forever. This floods the console/tty unless you redirect stderr. If you turn on the logfile option, it the log file grows and grows and grows...

There needs to be a mechanism that keeps xpumd from repeating the same information basically forever, especially rapid-fire like this. It makes the logging feature unusable because the logs are filled with noise.

"xpumcli stats" JSON format is awkward

xpumcli stats JSON output puts all metric types to a same array. That makes it hard to retrieve specific metric in automated way, as one can gets basically a random metric type (here it happens to be power):

$ xpumcli stats --json --device 0 | jq .device_level[0]
{
  "avg": 40.12,
  "max": 41.15,
  "metrics_type": "XPUM_STATS_POWER",
  "min": 40.11,
  "value": 40.63
}

If each metric type would be under its own key, it would be trivial to get specific metric values:

$ xpumcli stats --json --device 0 | jq .device_level.XPUM_STATS_POWER[].avg
40.12

(If there were multiple power values, above jq clause would list them all.)

Provide kustomize overlays for configuring k8s deployment

Current XPUM deployment enables and requires a lot of privileges, which are not really needed in most installations: intel/intel-device-plugins-for-kubernetes#1342 (comment)

Deployment should be split into a basic base one, and overlays that add the extra features, similarly to how e.g. GPU plugin features are handled.

See:

Sampling interval option for "xpumd"

Currently "xpumd" internal sampling interval can be set only using "xpumcli agentset -t" external command.

While its nice to be able to change that at run-time, it should be possible to set the interval also directly from "xpum" command line.

In some situations using external utility can be either awkward, or a potential security issue, compared to just restarting "xpumd" container with a new sampling interval option value.

Currently supported set of sampling intervals is also very limited:

# xpumcli agentset -t 5000
--time: 5000 not in {100,200,500,1000}
Run with --help for more information.

IMHO it would be better to allow any value, and return error only when counters for the selected metrics can overflow with that interval.

Intel Arc?

Any plan on adding support for Intel Arc GPUs? Or any CLI to configure Arc GPUs on Linux?

CentOS build container fails

If you try to build XPUM using the CentOS build container it fails when running pip3:

      File "/tmp/pip-build-42dfn1zm/sphinxcontrib-openapi/.eggs/setuptools_scm-7.0.5-py3.6.egg/setuptools_scm/__init__.py", line 5
        from __future__ import annotations
        ^
    SyntaxError: future feature annotations is not defined
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-42dfn1zm/sphinxcontrib-openapi/
The command '/bin/sh -c pip3 --proxy=$http_proxy install grpcio-tools mistune==0.8.4 apispec apispec_webframeworks Sphinx     sphinx_rtd_theme sphinxcontrib-openapi apispec-webframeworks myst-parser marshmallow     prometheus-client flask flask_httpauth' returned a non-zero code: 1

The issue here is that Python support for import future annotations was added in 3.7, but CentOS 8.4 ships with python 3.6. An easy fix here is to replace references to python3 here:

python3 python3-devel python3-pip rpm-build wget && \

with:

python38 python38-devel python38-pip

"_total" suffix missing from counter metrics with older Prometheus clients

When using python3-prometheus-client in OpenSuse 15.4 (that corresponds to SLES 15.4), XPUM counter metrics are missing _total suffix required by the OpenMetrics spec: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md

XPUM should not rely on (newer) prometheus-client adding the required suffix, but directly use correct metric names.

PS. While OpenSuse prometheus client is very old v0.0.20 (from 2017), _total suffix enforcement is more recent feature. I.e. this bug will manifest also with never client versions than that.

"xpumcli vgpu --precheck" does not return error on failures

Version: V1.2.9

Test-case:
xpumcli vgpu --precheck

Expected outcome:

  • Non-zero error code returned on failures

Actual outcome:

  • Success (0) returned on failures:
+----------+---------------------------------------------------------------------------------------+
| VMX Flag | Result: Pass                                                                          |
|          | Message:                                                                              |
+----------+---------------------------------------------------------------------------------------+
| SR-IOV   | Result: Fail                                                                          |
|          | Message: SR-IOV is disabled.                                                          |
+----------+---------------------------------------------------------------------------------------+
| IOMMU    | Result: Fail                                                                          |
|          | Message: IOMMU is disabled                                                            |
+----------+---------------------------------------------------------------------------------------+
# echo $?
0

Multiple copies for parsing "uevent" sysfs file

Noticed multiple places having almost identical code for parsing sysfs uevent file:

$ git grep printf.*/uevent
cli/src/local_functions.cpp:            snprintf(path, PATH_MAX, "/sys/class/drm/%s/device/uevent", pdirent->d_name);
core/src/device/gpu/gpu_device_stub.cpp:        len = snprintf(path, PATH_MAX, "/sys/class/drm/%s/device/uevent",
core/src/device/gpu/gpu_device_stub.cpp:        len = snprintf(path, PATH_MAX, "/sys/class/drm/%s/device/uevent",
core/src/diagnostic/diagnostic_manager.cpp:        len = snprintf(path, PATH_MAX, "/sys/class/drm/%s/device/uevent",
core/src/diagnostic/precheck.cpp:                    snprintf(path, PATH_MAX, "/sys/class/drm/%s/device/uevent", pdirent->d_name);

I would suggest consolidating all of them to a single helper function that is provided e.g. the PCI BDF string that should be matched.

request: info per subdevice

Since each device may have multiple subdevices, it would be extremely useful if we can have info per subdevice. For example the output of xpu-smi ps could have an option to display per subdevice info, and other tools that have -d,--device option could accept <num>.<num> for specifying subdevices.

fail to change ARC 770 frequency.

hi, I try to use this tool to limit GPU's frequency in my specified range.
i install v1.2.29 deb package(xpumanager_1.2.29_20240201.035533.2b2f658d.u22.04_amd64.deb) on my machine
after i execute
xpumcli discovery
i got an error
Error: XPUM Service Status Error.
then i check my xpum-service state and i got

 systemctl status xpum
× xpum.service - XPUM daemon
     Loaded: loaded (/lib/systemd/system/xpum.service; enabled; vendor preset: enabled)
     Active: failed (Result: signal) since Mon 2024-02-05 06:49:38 UTC; 22min ago
    Process: 9781 ExecStartPre=/bin/sh -c ulimit -c unlimited (code=exited, status=0/SUCCESS)
    Process: 9782 ExecStart=/usr/bin/xpumd -p /var/xpum_daemon.pid -d /usr/lib/xpum/dump (code=killed, signal=FPE)
   Main PID: 9782 (code=killed, signal=FPE)
        CPU: 150ms

Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.593] [I] [9782-9782] Level Zero:        1.15.0
Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.593] [I] [9782-9782] xpumd core starts to initialize
Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.593] [I] [9782-9782] initialize configuration
Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.593] [I] [9782-9782] xpum mode: xpum
Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.594] [I] [9782-9782] initialize datalogic
Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.594] [I] [9782-9782] initialize device manager
Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.675] [W] [9782-9816] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Memory Temperatur>
Feb 05 06:49:38 DUT001DG2SVC xpumd[9782]: [2024-02-05 06:49:38.676] [W] [9782-9816] Capability Memory Temperature detection returned:
Feb 05 06:49:38 DUT001DG2SVC systemd[1]: xpum.service: Main process exited, code=killed, status=8/FPE
Feb 05 06:49:38 DUT001DG2SVC systemd[1]: xpum.service: Failed with result 'signal'.

but if i try to run xpumd directly, the service not be killed, but you can find some warnings & errors

xpumd
[2024-02-05 07:14:11.566] [I] [15258-15258] XPUM: Init xpum library
[2024-02-05 07:14:11.566] [I] [15258-15258] XPU Manager:        1.2.28.20240118
[2024-02-05 07:14:11.566] [I] [15258-15258] Build:              89af66d7
[2024-02-05 07:14:11.566] [I] [15258-15258] Level Zero: 1.15.0
[2024-02-05 07:14:11.566] [I] [15258-15258] xpumd core starts to initialize
[2024-02-05 07:14:11.566] [I] [15258-15258] initialize configuration
[2024-02-05 07:14:11.566] [I] [15258-15258] xpum mode: xpum
[2024-02-05 07:14:11.566] [I] [15258-15258] initialize datalogic
[2024-02-05 07:14:11.566] [I] [15258-15258] initialize device manager
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no GPU Temperature capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability GPU Temperature detection returned: [toGetTemperature:1827] zesTemperatureGetState:0x70020000
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Memory Temperature capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability Memory Temperature detection returned: 
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Memory Throughput and Bandwidth capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability Memory Throughput and Bandwidth detection returned: [toGetMemoryThroughputAndBandwidth:1954] zesMemoryGetBandwidth:0x70020000
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no GPU Utilization capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability GPU Utilization detection returned: 
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Engine Utilization capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability Engine Utilization detection returned: 
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Ras Error capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability Ras Error detection returned: toGetRasErrorOnSubdevice error
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Frequency Throttle capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability Frequency Throttle detection returned: [toGetFrequencyThrottle:1680] zesFrequencyGetThrottleTime:0x78000003
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no fabric throughput capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Capability fabric throughput detection returned: fabric port not found
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Compute Engine Group Utilization monitoring capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Media Engine Group Utilization monitoring capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Copy Engine Group Utilization monitoring capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Render Engine Group Utilization monitoring capability.
[2024-02-05 07:14:11.645] [W] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no 3D Engine Group Utilization monitoring capability.
[2024-02-05 07:14:11.645] [I] [15258-15273] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has the following monitoring metric types: power, energy, frequency, request frequency, throttle reason, media engine frequency.
[2024-02-05 07:14:11.730] [I] [15258-15258] initialize health manager
[2024-02-05 07:14:11.730] [I] [15258-15258] initialize group manager
[2024-02-05 07:14:11.735] [I] [15258-15258] initialize diagnostic manager
[2024-02-05 07:14:11.735] [I] [15258-15258] initialize policy manager
[2024-02-05 07:14:11.735] [I] [15258-15258] initialize dump raw data manager
[2024-02-05 07:14:11.736] [I] [15258-15258] initialize firmware manager
[2024-02-05 07:14:11.736] [E] [15258-15289] Fail to get SoC fw version from device: /dev/mei2
[2024-02-05 07:14:11.736] [I] [15258-15258] IpmiAmcManager preInit
[2024-02-05 07:14:11.736] [E] [15258-15258] Unable to open /dev/ipmi0. errno: 2(No such file or directory)

[2024-02-05 07:14:11.736] [I] [15258-15258] IpmiAmcManager can not find AMC device
[2024-02-05 07:14:11.737] [I] [15258-15258] SMCRedfishAmcManager preInit
[2024-02-05 07:14:11.739] [I] [15258-15258] fail to parse redfish host interface
[2024-02-05 07:14:11.739] [I] [15258-15258] initialize monitor manager
[2024-02-05 07:14:11.739] [I] [15258-15258] xpumd core initialization completed
[2024-02-05 07:14:11.739] [I] [15258-15258] xpumd is providing services
[2024-02-05 07:14:11.739] [I] [15258-15258] XPUM: start XPUM RPC Server.
[2024-02-05 07:14:11.739] [I] [15258-15258] XPUM: start RPC server ...
[2024-02-05 07:14:11.741] [I] [15258-15258] XPUM: RPC server is listening at /tmp/xpum_p.sock

btw, if i execute xpumd with sudo, the service will crash

sudo xpumd
[2024-02-05 07:14:07.813] [I] [15219-15219] XPUM: Init xpum library
[2024-02-05 07:14:07.813] [I] [15219-15219] XPU Manager:        1.2.28.20240118
[2024-02-05 07:14:07.813] [I] [15219-15219] Build:              89af66d7
[2024-02-05 07:14:07.813] [I] [15219-15219] Level Zero: 1.15.0
[2024-02-05 07:14:07.813] [I] [15219-15219] xpumd core starts to initialize
[2024-02-05 07:14:07.813] [I] [15219-15219] initialize configuration
[2024-02-05 07:14:07.813] [I] [15219-15219] xpum mode: xpum
[2024-02-05 07:14:07.813] [I] [15219-15219] initialize datalogic
[2024-02-05 07:14:07.813] [I] [15219-15219] initialize device manager
[2024-02-05 07:14:07.893] [W] [15219-15234] Device Intel(R) Arc(TM) A770 Graphics0000:03:00.0 has no Memory Temperature capability.
[2024-02-05 07:14:07.893] [W] [15219-15234] Capability Memory Temperature detection returned: 
Floating point exception (core dumped)

anyway, after i run xpumd, xpumcli seems can give me some useful msg:

sudo xpumcli config -d 0 -t 0
+-------------+-------------------+----------------------------------------------------------------+
| Device Type | Device ID/Tile ID | Configuration                                                  |
+-------------+-------------------+----------------------------------------------------------------+
| GPU         | 0                 | Power Limit (w): 190                                           |
|             |                   |   Valid Range: 1 to 0                                          |
|             |                   |                                                                |
|             |                   | Memory ECC:                                                    |
|             |                   |   Current: N/A                                                 |
|             |                   |   Pending: N/A                                                 |
+-------------+-------------------+----------------------------------------------------------------+
| GPU         | 0/0               | GPU Min Frequency (MHz): 300                                   |
|             |                   | GPU Max Frequency (MHz): 2400                                  |
|             |                   |   Valid Options: 300, 350, 400, 450, 500, 550, 600, 650, 700,  |
|             |                   |     750, 800, 850, 900, 950, 1000, 1050, 1100, 1150, 1200,     |
|             |                   |     1250, 1300, 1350, 1400, 1450, 1500, 1550, 1600, 1650,      |
|             |                   |     1700, 1750, 1800, 1850, 1900, 1950, 2000, 2050, 2100,      |
|             |                   |     2150, 2200, 2250, 2300, 2350, 2400                         |
|             |                   |                                                                |
|             |                   | Standby Mode: default                                          |
|             |                   |   Valid Options: default, never                                |
|             |                   |                                                                |
|             |                   | Scheduler Mode: timeslice                                      |
|             |                   |   Timeout (us): N/A                                            |
|             |                   |   Interval (us): 5000                                          |
|             |                   |   Yield Timeout (us): 640000                                   |
|             |                   |                                                                |
|             |                   | Engine Type: compute                                           |
|             |                   |   Performance Factor: N/A                                      |
|             |                   | Engine Type: media                                             |
|             |                   |   Performance Factor: 50                                       |
|             |                   |                                                                |
|             |                   | Xe Link ports:                                                 |
|             |                   |   Up: N/A                                                      |
|             |                   |   Down: N/A                                                    |
|             |                   |   Beaconing On: N/A                                            |
|             |                   |   Beaconing Off: N/A                                           |
+-------------+-------------------+----------------------------------------------------------------+

but if i give the frequency range, xpumcli will throw an error without any hint.

sudo xpumcli config -d 0 -t 0 --frequencyrange 2400,2400
Error: Error

Write timestamps with data during dumping

I'm using xpu-smi to track an application which runs for multiple hours. I'd like to get xpu-smi to write a date along with the time for each interval so I don't have to detect the hours switching from 23->0 at midnight.

Build error

Hi,

I want to build this tool, while it occurs some issue as below.
I use CMake 3.25 and run "sh build.sh" in the repo home path.
image

Could someone help take a look ?
The using commit is: 04ebef9

Thank you.

xpu-smi doesn't show anything about gpu utilization on a380

Here's my output, any idea what went wrong ? I got the similar output on the UHD card.

$ xpu-smi stats -d 0
+-----------------------------+--------------------------------------------------------------------+
| Device ID | 0 |
+-----------------------------+--------------------------------------------------------------------+
| GPU Utilization (%) | |
| EU Array Active (%) | |
| EU Array Stall (%) | |
| EU Array Idle (%) | |
| | |
| Compute Engine Util (%) | |
| Render Engine Util (%) | |
| Media Engine Util (%) | |
| Decoder Engine Util (%) | |
| Encoder Engine Util (%) | |
| Copy Engine Util (%) | |
| Media EM Engine Util (%) | |
| 3D Engine Util (%) | |
+-----------------------------+--------------------------------------------------------------------+
| Reset | |
| Programming Errors | |
| Driver Errors | |
| Cache Errors Correctable | |
| Cache Errors Uncorrectable | |
| Mem Errors Correctable | |
| Mem Errors Uncorrectable | |
+-----------------------------+--------------------------------------------------------------------+
| GPU Power (W) | 43 |
| GPU Frequency (MHz) | 2400 |
| GPU Core Temperature (C) | |
| GPU Memory Temperature (C) | |
| GPU Memory Read (kB/s) | |
| GPU Memory Write (kB/s) | |
| GPU Memory Bandwidth (%) | |
| GPU Memory Used (MiB) | 2112 |
| GPU Memory Util (%) | 35 |
| Xe Link Throughput (kB/s) | |

xpu-smi returns no or unexpected info

  1. Looks like temperature is missing and temperature/freq columns are reversed
(base) ats@localhost:~> xpu-smi dump -d 0 -m 0,1,2,3
Timestamp, DeviceId, GPU Utilization (%), GPU Power (W), GPU Frequency (MHz), GPU Core Temperature (Celsius Degree)
18:38:50.000,    0,     , 43.72, 2050,
18:38:51.000,    0,     , 43.74, 2050,
18:38:52.000,    0,     , 43.72, 2050,
18:38:53.000,    0,     , 43.72, 2050,
18:38:54.000,    0,     , 43.72, 2050,
18:38:55.000,    0,     , 43.70, 2050,

  1. xpu-smi returns PCIe Generation and Max Link width as -1

System is openSUSE 15.5 Leap.

(base) ats@localhost:~> xpu-smi discovery -d 0
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Type: GPU                                                                     |
|           | Device Name: Intel(R) Data Center GPU Flex 170                                       |
|           | PCI Device ID: 0x56c0                                                                |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | UUID: 00000000-0000-0000-b58c-cba33270b5fc                                           |
|           | Core Clock Rate: 2050 MHz                                                            |
|           | Stepping: C0                                                                         |
|           | SKU Type:                                                                            |
|           |                                                                                      |
|           | Driver Version: I915_23.6.37_PSB_230425.49                                           |
|           | Kernel Version: 5.14.21-150500.55.36-default                                         |
|           | GFX Firmware Name: GFX                                                               |
|           | GFX Firmware Version: unknown                                                        |
|           | GFX Firmware Status: normal                                                          |
|           |                                                                                      |
|           | PCI BDF Address: 0000:3a:00.0                                                        |
|           | PCI Slot:                                                                            |
|           | PCIe Generation: -1                                                                  |
|           | PCIe Max Link Width: -1                                                              |
|           |                                                                                      |
|           | Memory Physical Size: 16288.00 MiB                                                   |
|           | Max Mem Alloc Size: 4095.99 MiB                                                      |
|           | ECC State:                                                                           |
|           | Number of Memory Channels: 2                                                         |
|           | Memory Bus Width: 128                                                                |
|           | Max Hardware Contexts: 65536                                                         |
|           | Max Command Queue Priority: 0                                                        |
|           |                                                                                      |
|           | Number of EUs: 512                                                                   |
|           | Number of Tiles: 1                                                                   |
|           | Number of Slices: 1                                                                  |
|           | Number of Sub Slices per Slice: 32                                                   |
|           | Number of Threads per EU: 8                                                          |
|           | Physical EU SIMD Width: 8                                                            |
|           | Number of Media Engines: 0                                                           |
|           | Number of Media Enhancement Engines: 0                                               |
|           |                                                                                      |
|           | Xe Link Calibration Date: Not Calibrated                                             |
+-----------+--------------------------------------------------------------------------------------+
(base) ats@localhost:~>

System:

CPU: Intel(R) Xeon(R) Platinum 8480+
(base) ats@localhost:~> cat /etc/os-release
NAME="openSUSE Leap"
VERSION="15.5"
ID="opensuse-leap"
ID_LIKE="suse opensuse"
VERSION_ID="15.5"
PRETTY_NAME="openSUSE Leap 15.5"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:leap:15.5"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:Leap"
LOGO="distributor-logo-Leap"
(base) ats@localhost:~>

zero and empty values in xpu monitor

frequency and temperature info in xpu monitor don't seem right. Thanks for your updates.

Device Name: Intel(R) Graphics [0x56a0]

Kernel version: 5.17.0-1020-oem

PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy

xpu-smi stats -d 0

| GPU Frequency (MHz)         | 0                                                                  |
| GPU Memory Temperature (C)  |                                                                    |

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.