utkuozdemir / nvidia_gpu_exporter Goto Github PK

View Code? Open in Web Editor NEW

670.0 9.0 88.0 1.15 MB

Nvidia GPU exporter for prometheus using nvidia-smi binary

License: MIT License

Dockerfile 0.27% Go 93.76% PowerShell 4.36% Shell 1.60%

prometheus prometheus-exporter nvidia-gpu nvidia nvidia-smi monitoring ai cryptocurrency gaming llm

nvidia_gpu_exporter's Introduction

nvidia_gpu_exporter

Nvidia GPU exporter for prometheus, using nvidia-smi binary to gather metrics.

⚠️ Maintenance status: I get that it can be frustrating not to hear back about the stuff you've brought up or the changes you've suggested. But honestly, for over a year now, I've hardly had any time to keep up with my personal open-source projects, including this one. I am still committed to keep this tool working and slowly move it forward, but please bear with me if I can't tackle your fixes or check out your code for a while. Thanks for your understanding.

Introduction

There are many Nvidia GPU exporters out there however they have problems such as not being maintained, not providing pre-built binaries, having a dependency to Linux and/or Docker, targeting enterprise setups (DCGM) and so on.

This is a simple exporter that uses nvidia-smi(.exe) binary to collect, parse and export metrics. This makes it possible to run it on Windows and get GPU metrics while gaming - no Docker or Linux required.

This project is based on a0s/nvidia-smi-exporter. However, this one is written in Go to produce a single, static binary.

If you are a gamer who's into monitoring, you are in for a treat.

Highlights

Will work on any system that has nvidia-smi(.exe)? binary - Windows, Linux, MacOS... No C bindings required
Doesn't even need to run the monitored machine: can be configured to execute nvidia-smi command remotely
No need for a Docker or Kubernetes environment
Auto-discovery of the metric fields nvidia-smi can expose (future-compatible)
Comes with its own Grafana dashboard

Visualization

You can use the official Grafana dashboard to see your GPU metrics in a nicely visualized way.

Here's how it looks like:

Installation

See INSTALL.md for details.

Configuration

See CONFIGURE.md for details.

Metrics

See METRICS.md for details.

Contributing

See CONTRIBUTING.md for details.

nvidia_gpu_exporter's People

Contributors

Stargazers

Watchers

Forkers

irrational-nx kokkini jhamot dwbxm sathishrs snowmoon-dev badwolfbay kaleex sburns17 harish-dori laiwei 21d5 metalsm7 kitter yimingren30 onepiecelover nvliyuan xiaoxiong581 chenjunyu0103 huyuan1999 echoblag rustyc0der likejian kpj dennisyoung96 zhangdanyangcherry apppur leonidstarykh 56quarters nxf129 devops-jaeoh-kim gorkemgoknar xtypebee wangyc26 mahershahin discdetsu 5l1v3r1 oyerpes godzilathakur johnjhr dijiujun zxy1113 petronny suchisur sunhui2013 iq-scm shanks1127 paulzyf pauliustumas arcsurf breakwang g711ab raymondhtue iudanet gengafdafd apecloud-inc lonewayren oliverhua haozi4go lucienshui kitsuya0828 leochen12-rgb tghfly allenk japv2976 ivanrockmen cloudgprabhu cameronraysmith matthudsonatx retrodaredevil nuttakulbeanx itoktsnhc johnnynunez fernandezr squat schwesig ghhpc stay-foolish-forever day112 zealot88 makenv winca miaoxiang-philips a1rwalk3r vinaysharma023 uppercaveman yangxggo

nvidia_gpu_exporter's Issues

Metric per process/pod

Is is possible to see memory utilization per process instead of just the total memory usage on a specific gpu?

If not this could be quite useful. Given that this information is already available through nvidia-smi I imagine it should be doable.

relabel fail

Problem Description：
Before relabeling, the instance I got was like this: 192.168.1.1:9835 which contains port :9835. Whenever an alarm is triggered, it carries: 9835 port information, because it is not very beautiful, so I try to use relabel. After I relabel, I found that all the original instances will not be overwritten, which leads to the pre-relabel and The data after relabel coexists! This is a very serious bug. I don't know if something was missed during the development of nvidia_gpu_exporter, so I submitted an issue. I tried to do the same with the officially maintained node_exporter, and found that after relabeling, the data before relabeling would be replaced. The two will not coexist. Below is my relabel configuration file:

relabel_configs:
   - action: replace
     source_labels: [__address__]
     regex: (.*):(.*)
     replacement: $1
     target_label: instance

Access remote metrics

Running this exporter with docker-compose on a machine that do not have any nvidia GPU. The command override works when executed remotely from my local host itself, I received the nvidia-smi from the remote host.

from within the container, the remote ssh command fails because ssh is not found
maybe have a look here, you will maybe need to add openssh-client and get the host ssh key so the container can connect to the remote host.

docker-compose.yml

nvidia-truenas-exporter:
image: utkuozdemir/nvidia_gpu_exporter:1.1.0
container_name: nvidia-truenas-exporter
hostname: nvidia-truenas-exporter
environment:
 - web.listen-address=":9835"
 - web.telemetry-path="/metrics"
 - nvidia-smi-command="ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null [email protected] -p yyy nvidia-smi"
 - query-field-names="AUTO"
 - log.level=info
ports:
 - 9835:9835

Container logs:
ts=2023-03-19T10:19:49.184Z caller=exporter.go:130 level=warn msg="Failed to auto-determine query field names, falling back to the built-in list" error="error running command: exec: "nvidia-smi": executable file not found in $PATH: command failed. code: -1 | command: nvidia-smi --help-query-gpu | stdout: | stderr: "
ts=2023-03-19T10:19:49.186Z caller=main.go:84 level=info msg="Listening on address" address=:9835
ts=2023-03-19T10:19:49.186Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
ts=2023-03-19T10:12:04.680Z caller=exporter.go:184 level=error error="error running command: exec: "nvidia-smi": executable file not found in $PATH: command failed. code: -1 | command: nvidia-smi --query-gpu=driver_version,temperature.gpu,clocks.max.sm,fan.speed,memory.total,ecc.errors.uncorrected.volatile.device_memory,enforced.power.limit,persistence_mode,ecc.errors.corrected.aggregate.dram,power.default_limit,pci.domain,inforom.ecc,power.management,ecc.errors.corrected.volatile.register_file,ecc.errors.uncorrected.aggregate.l1_cache,serial,gom.pending,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.texture_memory,clocks.current.sm,clocks_throttle_reasons.active,ecc.errors.corrected.volatile.sram,power.draw,clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sync_boost,ecc.mode.pending,ecc.errors.corrected.volatile.l1_cache,ecc.errors.corrected.volatile.total,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.texture_memory,memory.used,ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.cbu,temperature.memory,pci.device_id,inforom.pwr,pcie.link.width.max,utilization.memory,encoder.stats.averageLatency,ecc.errors.uncorrected.volatile.l1_cache,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.sw_thermal_slowdown,ecc.errors.corrected.volatile.dram,ecc.errors.uncorrected.aggregate.total,clocks.max.memory,uuid,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.sram,retired_pages.single_bit_ecc.count,pci.bus,vbios_version,ecc.errors.corrected.aggregate.device_memory,ecc.errors.uncorrected.volatile.register_file,retired_pages.double_bit.count,mig.mode.current,utilization.gpu,clocks.current.graphics,clocks.applications.graphics,pci.device,accounting.buffer_size,ecc.mode.current,power.limit,count,pcie.link.gen.current,ecc.errors.corrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.aggregate.device_memory,mig.mode.pending,clocks_throttle_reasons.hw_thermal_slowdown,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.dram,memory.free,encoder.stats.sessionCount,encoder.stats.averageFps,ecc.errors.corrected.aggregate.l1_cache,ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.cbu,compute_mode,driver_model.pending,ecc.errors.corrected.volatile.device_memory,ecc.errors.uncorrected.volatile.cbu,power.max_limit,clocks.default_applications.graphics,clocks.default_applications.memory,pci.bus_id,name,pcie.link.gen.max,display_mode,clocks_throttle_reasons.hw_slowdown,power.min_limit,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,retired_pages.pending,clocks.max.graphics,driver_model.current,gom.current,ecc.errors.corrected.aggregate.sram,ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.register_file,clocks.applications.memory,pcie.link.width.current,inforom.img,inforom.oem,clocks_throttle_reasons.supported,clocks.current.memory,clocks.current.video,index,display_active,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.cbu,ecc.errors.uncorrected.aggregate.sram,timestamp,accounting.mode,pci.sub_device_id,pstate --format=csv | stdout: | stderr: "

Run nvidia-gpu-exporter in k8s an error

1、use nvidia-gpu-exporter 0.3.0 images

this is error：
in nvidia-gpu-exporter container input nvidia-smi then output error：
/tmp/cuda-control/src/register.c: 66 can't register to manager, error No such file or directory
/tmp/cuda-control/src/register.c: 87 rpc client exit with 255

Create a Grafana dashboard

The holy grail of monitoring

Browser access mode in the Prometheus datasource is no longer available. Switch to server access mode.

I did run (admin mode) https://raw.githubusercontent.com/utkuozdemir/nvidia_gpu_exporter/master/install/windows.ps1
After importing the dashboard I get the following error message (grafana):
Browser access mode in the Prometheus datasource is no longer available. Switch to server access mode.

Prometheus did drop support for Browser access mode (related!? grafana/grafana#43859)

Number of process per GPU

Hello, thanks for the project, it is very clear!

I was wondering if it were possible to add a metric that counts the number of process per GPU ?
This feature is available in nvidia-smi and it would be a great help.

Thanks

Question on displaying node name for a multinode setup

hey, sorry for bombarding you with issues)

could you give an advice on how to add node name to the dashboard for multinode setups?
currently it's identified by GPU UUID, we have a lot of nodes in our cluster and knowing UUIDs of all nodes GPUs is a bit impractical, so i would like to extend the grafana dashboard, and, ideally, have something like this

that would switch to correct nodename when you switch by GPU UUID in the top left dropdown.

it seems that it could be possible to derive node name combining Prometheus queries, grabbing ip from instance in nvidia_smi_gpu_info and somehow filtering results of kube_pod_info{pod=~"ozdemir-nvidia-gpu-expo.*"} on host_ip, from where node name is available.

but not sure, if that's possible.
i will continue to dig into prometheus queries, would appreciate any advice on that!

got errors "couldn't parse number from: [n/a]"

Describe the bug
executed command: # ./nvidia_gpu_exporter --web.listen-address :20127 --nvidia-smi-command="nvidia-smi" --log.level=debug
refresh nvidia-gpu-metrics dashboard in Grafana, then command console throws errors and dashboard shows nothing

To Reproduce
Steps to reproduce the behavior:

Run command './nvidia_gpu_exporter --web.listen-address :20127 --nvidia-smi-command="nvidia-smi" --log.level=debug'
See error

Expected behavior
dashboard shows metrics data normally

Model and Version
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |

GPU Model [e.g. GeForce RTX 2080 TI]
App version and architecture [' linux_x86_64']
Operating System [e.g. Ubuntu 18.04]
Nvidia GPU driver version [e.g. Linux driver nvidia-driver-450]

root@4d15723e44d8:/home# ./nvidia_gpu_exporter --version
nvidia_gpu_exporter, version 0.4.0 (branch: HEAD, revision: 76d7496)
build user: goreleaser
build date: 2022-02-08T00:42:44Z
go version: go1.17.5
platform: linux/amd64

could you give a suggestion? Thx!

Can't visualize metrics on Grafana Dashboard

Hey, currently I'm facing a problem that I can't visualize the metrics on my Grafana Dashboard. I'll describe the steps that I followed

I installed the .deb package according to what is described in INSTALL.md. My laptop has Ubuntu 20.04 and my GPU is a GeForce 3060.
On CONFIGURE.md I set the command of nvidia-smi as

nvidia_gpu_exporter --nvidia-smi-command 'nvidia-smi'

I imported the dashboard that is ready-to-use on Grafana, but no metrics appeared.

After step (3), I was looking in this GitHub for similar issues and I found this one: #7 . However, one of the suggestions is to verify the metrics log on Prometheus on http://localhost:9090, but this page throws me the 404 Error Page not Found. So I believe that there's a Prometheus setup step that I'm missing. And that's what I think it's my problem.

How could I setup Prometheus properly to be able to visualize the metrics on the Dashboard?

scoop install is not updated

Describe the bug
https://github.com/utkuozdemir/nvidia_gpu_exporter/blob/master/INSTALL.md
To Reproduce
Steps to reproduce the behavior:

Run command 'Invoke-Expression (New-Object System.Net.WebClient).DownloadString('https://get.scoop.sh')'
See error Running the installer as administrator is disabled by default, see https://github.com/ScoopInstaller/Install#for-admin for details.

Expected behavior
accoding to https://github.com/ScoopInstaller/Install#for-admin
the command should be

Set-ExecutionPolicy RemoteSigned -Scope CurrentUser
iex "& {$(irm get.scoop.sh)} -RunAsAdmin

Exporter not able to recover after fist scrape failure

Describe the bug
We are using this Exporter for Datacenter Monitoring and Metrics analysis.
Sometimes the exporter fail to gather information and can not recover from this.

To Reproduce
Steps to reproduce the behavior:
Hard to say! It happens from time to time. Service runs fine, then the error appears and is not able to recover.

Expected behavior
Service Stops/enter Failure State when this happens

Console output
curl: nvidia_smi_failed_scrapes_total 4718
Systemctl: Mar 10 18:24:48 HOSTNAME prometheus-nvidia-exporter-2[874]: level=error ts=2022-03-10T17:24:48.652Z caller=exporter.go:148 error="command failed. stderr: err: exit status 2"

Model and Version

GPU Model K3100M + GRIDs
App version and architecture v0.4.0 - linux_x86_64
Installation method: binary download
Operating System: Centos7 + Rocky8
Nvidia GPU driver version
--- Quadro K3100M: Linux 431
--- GRID: Linux 418
--- etc....

Updated to 0.4.0, Problems still occur.

Updated to 0.5.0

Action Required: Fix Renovate Configuration

There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.

Location: .renovaterc.json
Error type: The renovate configuration file contains some invalid settings
Message: Regex Managers must contain depNameTemplate configuration or regex group named depName

Change from 'throttle' to 'event' in output from nvidia-smi v535.113.01

Describe the bug

In version v535.113.01 of nvidia-smi, the fields with throttle seems to have been renamed to rather have event. I ran the example command from here and it "silently" renames the fields accordingly. It's worth noting that the nvidia_gpu_exporter seems to handle this gracefully, but the dashboard does not. I modified the dashboard for my own use but figured I would report the change here aswell.

To Reproduce
Steps to reproduce the behavior:

Run command:

nvidia-smi --query-gpu="clocks_throttle_reasons.hw_thermal_slowdown,clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,clocks_throttle_reasons.sync_boost" --format=csv

See that the csv header names have been modified to 'event'.

Expected behavior
None.

Console output
None.

Model and Version

GPU Model: NVIDIA GeForce RTX 4080
App version and architecture: v1.2.0 - linux_x86_64
Installation method: binary download through AUR
Operating System: ArchLinux 6.5.5-arch1-1
Nvidia GPU driver version: NVIDIA Linux driver v535.113.01

Additional context
None.

Error starting HTTP server

Hi. I get this error as I'm trying to connect to Grafana

ts=2022-09-15T10:57:07.347Z caller=main.go:84 level=info msg="Listening on address" address=:9835
ts=2022-09-15T10:57:07.348Z caller=main.go:99 level=error msg="Error starting HTTP server" err="listen tcp :9835: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted."

What can I do?

Can you also show an example of the --web.config.file ?

Thanks

Great Work

Wanted to tell you that this is a really cool project!

Awesome work, you can close this whenever you want!

Have a great day!

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Detected dependencies

dockerfile

Dockerfile

ubuntu 22.04

github-actions

.github/workflows/build.yml

actions/checkout v4.1.5

actions/setup-go v5.0.1

golangci/golangci-lint-action v6.0.1

codecov/codecov-action v4.3.1

goreleaser/goreleaser-action v5.1.0

ubuntu 22.04

.github/workflows/release.yml

actions/checkout v4.1.5

actions/setup-go v5.0.1

docker/login-action v3.1.0

goreleaser/goreleaser-action v5.1.0

ubuntu 22.04

gomod

go.mod

go 1.22

github.com/alecthomas/kingpin/v2 v2.4.0

github.com/coreos/go-systemd/v22 v22.5.0

github.com/go-kit/log v0.2.1

github.com/prometheus/client_golang v1.19.1

github.com/prometheus/common v0.53.0

github.com/prometheus/exporter-toolkit v0.11.0

github.com/stretchr/testify v1.9.0

golang.org/x/exp v0.0.0-20240506185415-9bf2ced13842@9bf2ced13842

regex

.github/workflows/build.yml

golangci/golangci-lint v1.58.1

kyoh86/richgo v0.3.12

goreleaser/goreleaser v1.25.1

.github/workflows/release.yml

goreleaser/goreleaser v1.25.1

Check this box to trigger a request for Renovate to run again on this repository

Exporter not scrapping metrics

Describe the bug
Exporter is not able to gather information.

Console Output

./nvidia_gpu_exporter --query-field-names="AUTO" --log.level=debug

level=info ts=2022-03-23T00:23:30.344Z caller=main.go:65 msg="Listening on address" address=:9835
level=info ts=2022-03-23T00:23:30.346Z caller=tls_config.go:191 msg="TLS is disabled." http2=false
level=debug ts=2022-03-23T00:23:50.520Z caller=exporter.go:171 error="couldn't parse number from: 2022/03/22 20:23:46.283" query_field_name=timestamp raw_value="2022/03/22 20:23:46.283"
level=debug ts=2022-03-23T00:23:50.520Z caller=exporter.go:171 error="couldn't parse number from: 510.47.03" query_field_name=driver_version raw_value=510.47.03
level=debug ts=2022-03-23T00:23:50.520Z caller=exporter.go:171 error="couldn't parse number from: nvidia a100-pcie-40gb" query_field_name=name raw_value="NVIDIA A100-PCIE-40GB"
level=debug ts=2022-03-23T00:23:50.520Z caller=exporter.go:171 error="couldn't parse number from: gpu-7bdeeff7-f7c6-e13c-f368-227523e670a7" query_field_name=uuid raw_value=GPU-7bdeeff7-f7c6-e13c-f368-227523e670a7
level=debug ts=2022-03-23T00:23:50.520Z caller=exporter.go:171 error="couldn't parse number from: 00000000:17:00.0" query_field_name=pci.bus_id raw_value=00000000:17:00.0

$ ./ nvidia-smi --query-gpu="timestamp,driver_version" --format=csv
timestamp, driver_version
2022/03/22 20:26:27.040, 510.47.03
2022/03/22 20:26:27.040, 510.47.03
2022/03/22 20:26:27.040, 510.47.03
2022/03/22 20:26:27.040, 510.47.03

$ nvidia-smi
Tue Mar 22 20:28:58 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+

Monitor can't turn off after windows going to saving mode

When monitor going to sleep, it wake up immidiately. So basically exporter broke monitor sleep.

To Reproduce
Steps to reproduce the behavior:

When exporter running as service, monitor can't fall asleep
When i stop service, monitor can turn off when going to sleep

Expected behavior
Monitor turned off after delay in windows power mode settings

Console output

Model and Version

GPU Mode: Gigabyte RTX 3060 Gaming OC
App version and architecture: v0.3.0 [x86_64.zip]
Installation method: binary download, runs as a service with nssm
Operating System: Windows 10 LTSC
Nvidia GPU driver version: Windows Studio Driver 472.84

Additional context
In Grafana i update information from exporter every 5 sec, and monitor turns on every 5 sec after it going to sleep.

Failed to initialize NVML: Unknown Error

Describe the bug
I'm running the current version of your Docker image, and it works most of the time - but sometimes it starts to fail, and i need to restart the container.
It sometimes runs for a whole day, and sometimes only a couple of minutes.

To Reproduce
Steps to reproduce the behavior:

Systemd Unit ExecStart:

/usr/bin/docker run --name prometheus-nvidia-gpu-exporter \
  --gpus all \
  -p 9835:9835 \
  -v /dev/nvidiactl:/dev/nvidiactl \
  -v /dev/nvidia0:/dev/nvidia0 \
  -v /usr/lib/x86_64-linux-gnu/libnvidia-ml.so:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so \
  -v /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 \
  -v /usr/bin/nvidia-smi:/usr/bin/nvidia-smi \
  utkuozdemir/nvidia_gpu_exporter:1.2.0

Expected behavior
I'd expect the exporter to not start throwing errors ;-)

Console output
(Disregard the mismatching timestamps, i copypasta'd the error first, and then also added the initial log when starting the container.)

May 24 19:01:22 hades systemd[1]: Stopped Prometheus Nvidia GPU Exporter.
May 24 19:01:22 hades systemd[1]: Starting Prometheus Nvidia GPU Exporter...
May 24 19:01:22 hades docker[1915038]: prometheus-nvidia-gpu-exporter
May 24 19:01:23 hades docker[1915048]: 1.2.0: Pulling from utkuozdemir/nvidia_gpu_exporter
May 24 19:01:23 hades docker[1915048]: Digest: sha256:cc407f77ab017101ce233a0185875ebc75d2a0911381741b20ad91f695e488c7
May 24 19:01:23 hades docker[1915048]: Status: Image is up to date for utkuozdemir/nvidia_gpu_exporter:1.2.0
May 24 19:01:23 hades docker[1915048]: docker.io/utkuozdemir/nvidia_gpu_exporter:1.2.0
May 24 19:01:23 hades systemd[1]: Started Prometheus Nvidia GPU Exporter.
May 24 19:01:24 hades docker[1915066]: ts=2023-05-24T17:01:24.380Z caller=tls_config.go:232 level=info msg="Listening on" address=[::]:9835
May 24 19:01:24 hades docker[1915066]: ts=2023-05-24T17:01:24.380Z caller=tls_config.go:235 level=info msg="TLS is disabled." http2=false address=[::]:9835
[...]
May 24 19:00:45 hades docker[1903720]: ts=2023-05-24T17:00:45.428Z caller=exporter.go:184 level=error error="error running command: exit status 255: command failed. code: 255 | command: nvidia-smi --query-gpu=timestamp,driver_version,vgpu_driver_capability.heterogenous_multivGPU,count,name,serial,uuid,pci.bus_id,pci.domain,pci.bus,pci.device,pci.device_id,pci.sub_device_id,vgpu_device_capability.fractional_multiVgpu,vgpu_device_capability.heterogeneous_timeSlice_profile,vgpu_device_capability.heterogeneous_timeSlice_sizes,pcie.link.gen.current,pcie.link.gen.gpucurrent,pcie.link.gen.max,pcie.link.gen.gpumax,pcie.link.gen.hostmax,pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,clocks_throttle_reasons.active,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,clocks_throttle_reasons.sync_boost,memory.total,memory.reserved,memory.used,memory.free,compute_mode,compute_cap,utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,power.draw.average,power.draw.instant,power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,mig.mode.current,mig.mode.pending,fabric.state,fabric.status --format=csv | stdout: Failed to initialize NVML: Unknown Error\n | stderr: "

(The error from the title is at the end of this very long last line.)

Model and Version

GPU Model: RTX 4070 Ti
App version: 1.2.0 am64
Installation method: Docker image
Operating System: Debian 11/bullseye
Nvidia GPU driver version:

Running on Docker with Nvidia Container Toolkit:

$ docker info
Client: Docker Engine - Community
 Version:    24.0.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.4
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx

Server:
 Containers: 84
  Running: 83
  Paused: 0
  Stopped: 1
 Images: 87
 Server Version: 24.0.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
 runc version: v1.1.7-0-g860f061
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.10.0-23-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 125.7GiB
 Docker Root Dir: /srv/docker
 Debug Mode: false
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true

$ dpkg -l | grep nvidia
ii  libnvidia-container-tools             1.13.1-1                                                                   amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64            1.13.1-1                                                                   amd64        NVIDIA container runtime library
ii  nvidia-container-toolkit              1.13.1-1                                                                   amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base         1.13.1-1                                                                   amd64        NVIDIA Container Toolkit Base

$ nvidia-smi
Wed May 24 19:10:49 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:42:00.0 Off |                  N/A |
|  0%   56C    P2    34W / 285W |   5122MiB / 12282MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    937698      C   /usr/bin/zmc                      225MiB |
|    0   N/A  N/A   3332933      C   python3                          1838MiB |
|    0   N/A  N/A   3469008      C   python                           3056MiB |
+-----------------------------------------------------------------------------+

Helm chart

Implement a helm chart after figuring out containerization

Core/Mem Clock Speed (MHz) and Memory Junction Temp Displays?

Eg maybe have an extra bit of info about the clock speeds in MHz and the memory junction temperatures if possible? I'd suggest the MHz be placed next to the respective utilization percentages and the temperature have the memory temp under the core temp.

Not sure how doable that is in NVIDIA SMI but would be a neat addition.

Wait nevermind I'm dumb and didn't see the bottom lol. Still mem temp would be nice.

nvidia_gpu_exporter doesn't work for NVIDIA A10

Describe the bug
The exporter doesn't work on lab with NVIDIA A10. It cannot collect the GPU information normally.

Console output
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: 2023/07/18 07:23:20.045" query_field_name=timestamp raw_value="2023/07/18 07:23:20.045"
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: 535.54.03" query_field_name=driver_version raw_value=535.54.03
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: [n/a]" query_field_name=vgpu_driver_capability.heterogenous_multivGPU raw_value=[N/A]
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: gpu-5e10b7bc-91f1-640a-e927-963f7f82de44" query_field_name=uuid raw_value=GPU-5e10b7bc-91f1-640a-e927-963f7f82de44
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: 00000000:00:0c.0" query_field_name=pci.bus_id raw_value=00000000:00:0C.0
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: [n/a]" query_field_name=vgpu_device_capability.fractional_multiVgpu raw_value=[N/A]
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: [n/a]" query_field_name=vgpu_device_capability.heterogeneous_timeSlice_profile raw_value=[N/A]
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: [n/a]" query_field_name=vgpu_device_capability.heterogeneous_timeSlice_sizes raw_value=[N/A]
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: [n/a]" query_field_name=pcie.link.gen.hostmax raw_value=[N/A]
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: none" query_field_name=addressing_mode raw_value=None
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: [n/a]" query_field_name=driver_model.current raw_value=[N/A]
ts=202

Model and Version

GPU Model: NVIDIA A10
Operating System: Ubuntu Server 20.04
Nvidia GPU driver version: 535.54.03

Additional context
$ dpkg -l | grep nvidia
ii libnvidia-cfg1-525:amd64 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-525 525.125.06-0ubuntu0.20.04.3 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-525:amd64 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA libcompute package
rc libnvidia-compute-535:amd64 535.54.03-0ubuntu0.20.04.4 amd64 NVIDIA libcompute package
ii libnvidia-decode-525:amd64 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-525:amd64 525.125.06-0ubuntu0.20.04.3 amd64 NVENC Video Encoding runtime library
ii libnvidia-extra-525:amd64 525.125.06-0ubuntu0.20.04.3 amd64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-525:amd64 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-525:amd64 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii nvidia-compute-utils-525 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA compute utilities
ii nvidia-dkms-525 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA DKMS package
ii nvidia-driver-525 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA driver metapackage
ii nvidia-driver-local-repo-ubuntu2004-515.105.01 1.0-1 amd64 nvidia-driver-local repository configuration files
ii nvidia-kernel-common-525 525.125.06-0ubuntu0.20.04.3 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-525 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA kernel source package
ii nvidia-prime 0.8.16~0.20.04.2 all Tools to enable NVIDIA's Prime
ii nvidia-settings 470.57.01-0ubuntu0.20.04.3 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-525 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA driver support binaries
ii screen-resolution-extra 0.18build1 all Extension for the nvidia-settings control panel
ii xserver-xorg-video-nvidia-525 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA binary Xorg driver

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 28269 C python 8582MiB |
+---------------------------------------------------------------------------------------+

[Discussion] Offering CPU and Memory Monitoring Support

Firstly, I'd like to thank you for providing this repository. It has been instrumental in helping us set up our cluster monitoring.

In the course of utilizing your tool, I've added a file to support CPU and memory monitoring specifically for Linux systems. While this addition outputs CPU monitoring in a manner akin to a plugin, I'm uncertain if it aligns perfectly with the direction of your contributions.

I'd be happy to contribute this addition to the community. If you find this functionality valuable, I'm more than willing to refine my code further and submit a PR.

Thanks once again for your invaluable contributions and hard work!

Addition:
I also adjust the dashboard to see cluster info and single node info. 😎

CPU&Memory metrics

# HELP basic_cpu_sy system process cost
# TYPE basic_cpu_sy gauge
basic_cpu_sy{uuid="123"} 1.3
# HELP basic_cpu_tot cpu cost percetage
# TYPE basic_cpu_tot gauge
basic_cpu_tot{uuid="123"} 1.6
# HELP basic_cpu_us user process cost
# TYPE basic_cpu_us gauge
basic_cpu_us{uuid="123"} 0.3
# HELP basic_info_command_exit_code Exit code of the last scrape command
# TYPE basic_info_command_exit_code gauge
basic_info_command_exit_code 0
# HELP basic_mem_free memory free
# TYPE basic_mem_free gauge
basic_mem_free{uuid="123"} 3.35515268e+08
# HELP basic_mem_tot memory total
# TYPE basic_mem_tot gauge
basic_mem_tot{uuid="123"} 5.93782332e+08
# HELP basic_mem_used memory used
# TYPE basic_mem_used gauge
basic_mem_used{uuid="123"} 2.5704732e+07

Can I add externalLabels?

To distinguish between machines, can I add an external label when deploying this exporter?

Running Exporter Causes Stuttering in All Games

I've been using this amazing exporter for a month or two now, but I noticed in almost all games (and even some videos) that there's be some pretty constant stuttering. This would occur once every 30 seconds or so and would essentially look like someone pressed pause then resume once really quickly.

I troubleshooted everything under the sun. I ran DDU, did a full Windows Reinstall, disabled/uninstalled any overlays I had running. It turns out the culprit was this exporter. Disabling the exporter made the problem go away immediately.

I also run the Prometheus Windows Exporter (https://github.com/prometheus-community/windows_exporter) and it doesn't seem to cause the same issue.

Unfortunately I don't really have any other info to share with you about this, and maybe there's nothing that can be done, but I thought I'd mention it in case there is a possible solution.

My Main Specs:
AMD Ryzen 3900x
EVGA 3080 Ultra
1TB NVME
64GB DDR4 @3600

Thanks!

Driver version is not displayed

I have a minor issue of driver version not being displayed in the dashboard.

Prometheus can't get this info also, it seems

I run a multinode k8s cluster, where i have the nvidia-gpu-exporter Helm chart deployed.
Each node has Ubuntu 18.04 installed.
When i login to a ozdemir-nvidia-gpu-exporter pod, i can fetch driver version easily like that

root@ozdemir-nvidia-gpu-exporter-d2j74:/# nvidia-smi --query-gpu=driver_version --format=csv
driver_version
460.91.03

NSSM missing metrics

Describe the bug
Missing metrics when nvidia_gpu_exporter is running as a service in Windows 10. Running nvidia_gpu_exporter manually, all metrics are exposed.

To Reproduce
Steps to reproduce the behavior:

Follow Install docs
Run the nssm service
Connect to Prometheus, choose your target and 99% of the nvidia metrics are missing

Expected behavior
Have all metrics exposed running it as a service

Console output
I see in the metrics that scraping nvidia_smi is failing, just cannot determine why..

# HELP nvidia_gpu_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, goversion from which nvidia_gpu_exporter was built, and the goos and goarch for the build.
# TYPE nvidia_gpu_exporter_build_info gauge
nvidia_gpu_exporter_build_info{branch="HEAD",goarch="amd64",goos="windows",goversion="go1.20",revision="01f163635ca74aefcfb62cab4dc0d25cc26c0562",version="1.2.0"} 1
# HELP nvidia_smi_command_exit_code Exit code of the last scrape command
# TYPE nvidia_smi_command_exit_code gauge
nvidia_smi_command_exit_code -1
# HELP nvidia_smi_failed_scrapes_total Number of failed scrapes
# TYPE nvidia_smi_failed_scrapes_total counter
nvidia_smi_failed_scrapes_total 2

Model and Version

GPU Model [`NVIDIA RTXA6000-48Q``]
App version and architecture [amd64 ]
Installation method [scoop]
Operating System [Windows 10]
Nvidia GPU driver version [Production Driver 513.46]

Additional context
Running nvidia_gpu_exporter manually from powershell, all the metrics work fine. I am looking to see if anyone else has this issue or if I am doing something wrong here...

Grafana not showing any data

The error: "Templating [gpu] Error updating options: Browser access mode in the Prometheus datasource is no longer available. Switch to server access mode."

I set the execution policy to unrestricted so that the script will run in the first place. Then I run the install as a windows service, and everything was installed without problems. Prometheus works (both 9090 and 9835), grafana works, it's just that I can't see any data on grafana.

GPU is a 3090Ti

ps: If it's relevant, I also have prometheus windows exporter running on 9182

command failed. stderr: err: exit status 12

Describe the bug
Error command failed. stderr: err: exit status 12 when running in docker.

To Reproduce
docker-compose.yml

version: "3"
services:
  nvidia_smi_exporter:
    image: utkuozdemir/nvidia_gpu_exporter:0.4.0
    devices:
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia0:/dev/nvidia0
    volumes:
      - /usr/lib/libnvidia-ml.so:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so
      - /usr/lib/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
      - /usr/bin/nvidia-smi:/usr/bin/nvidia-smi
    ports:
      - 9835:9835

Console output
docker-compose service console output

ts=2022-03-05T12:14:57.407Z caller=exporter.go:108 level=warn msg="Failed to auto-determine query field names, falling back to the built-in list"
2022-03-05T12:14:57.408274606Z ts=2022-03-05T12:14:57.408Z caller=main.go:66 level=info msg="Listening on address" address=:9835
2022-03-05T12:14:57.408511827Z ts=2022-03-05T12:14:57.408Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
2022-03-05T12:15:01.058200295Z ts=2022-03-05T12:15:01.058Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:19:49.798104217Z ts=2022-03-05T12:19:49.797Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:20:14.187971066Z ts=2022-03-05T12:20:14.187Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:20:44.187095757Z ts=2022-03-05T12:20:44.186Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:21:14.187231908Z ts=2022-03-05T12:21:14.187Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:21:44.187147375Z ts=2022-03-05T12:21:44.187Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:22:14.186874585Z ts=2022-03-05T12:22:14.186Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:22:44.186995854Z ts=2022-03-05T12:22:44.186Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:23:14.188342901Z ts=2022-03-05T12:23:14.187Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"

Model and Version
OS: Fedora Linux 35
Qt version: 5.15.2
Kernel version: 5.16.11-200.fc35.x86_64
CPU: i7-11800H
GPU: NVIDIA GeForce RTX 3050 Ti Laptop GPU/PCIe/SSE2
NVIDIA Driver Version: 510.47.03
NVML Version: 11.510.47.03

$ ll /dev | grep nvidia
crw-rw-rw-.  1 root root    195,     0 Mar  5 16:14 nvidia0
crw-rw-rw-.  1 root root    195,   255 Mar  5 16:14 nvidiactl
crw-rw-rw-.  1 root root    195,   254 Mar  5 16:14 nvidia-modeset
crw-rw-rw-.  1 root root    505,     0 Mar  5 16:14 nvidia-uvm
crw-rw-rw-.  1 root root    505,     1 Mar  5 16:14 nvidia-uvm-tools

$ ll /usr/lib | grep nvidia
lrwxrwxrwx.  1 root root        26 Feb  1 20:33 libEGL_nvidia.so.0 -> libEGL_nvidia.so.510.47.03
-rwxr-xr-x.  1 root root   1224012 Jan 25 03:35 libEGL_nvidia.so.510.47.03
lrwxrwxrwx.  1 root root        32 Feb  1 20:33 libGLESv1_CM_nvidia.so.1 -> libGLESv1_CM_nvidia.so.510.47.03
-rwxr-xr-x.  1 root root     71120 Jan 25 03:34 libGLESv1_CM_nvidia.so.510.47.03
lrwxrwxrwx.  1 root root        29 Feb  1 20:33 libGLESv2_nvidia.so.2 -> libGLESv2_nvidia.so.510.47.03
-rwxr-xr-x.  1 root root    128464 Jan 25 03:34 libGLESv2_nvidia.so.510.47.03
lrwxrwxrwx.  1 root root        26 Feb  1 20:33 libGLX_nvidia.so.0 -> libGLX_nvidia.so.510.47.03
-rwxr-xr-x.  1 root root   1082980 Jan 25 03:34 libGLX_nvidia.so.510.47.03
lrwxrwxrwx.  1 root root        32 Feb  1 20:33 libnvidia-allocator.so.1 -> libnvidia-allocator.so.510.47.03
-rwxr-xr-x.  1 root root    121408 Jan 25 03:34 libnvidia-allocator.so.510.47.03
-rwxr-xr-x.  1 root root  59574832 Jan 25 03:56 libnvidia-compiler.so.510.47.03
-rwxr-xr-x.  1 root root  28190356 Jan 25 03:48 libnvidia-eglcore.so.510.47.03
lrwxrwxrwx.  1 root root        29 Feb  1 20:33 libnvidia-encode.so.1 -> libnvidia-encode.so.510.47.03
-rwxr-xr-x.  1 root root    124048 Jan 25 03:34 libnvidia-encode.so.510.47.03
lrwxrwxrwx.  1 root root        26 Feb  1 20:33 libnvidia-fbc.so.1 -> libnvidia-fbc.so.510.47.03
-rwxr-xr-x.  1 root root    136828 Jan 25 03:34 libnvidia-fbc.so.510.47.03
-rwxr-xr-x.  1 root root  30472084 Jan 25 03:49 libnvidia-glcore.so.510.47.03
-rwxr-xr-x.  1 root root    613928 Jan 25 03:35 libnvidia-glsi.so.510.47.03
-rwxr-xr-x.  1 root root  18955008 Jan 25 03:53 libnvidia-glvkspirv.so.510.47.03
lrwxrwxrwx.  1 root root        25 Feb  1 20:33 libnvidia-ml.so -> libnvidia-ml.so.510.47.03
lrwxrwxrwx.  1 root root        25 Feb  1 20:33 libnvidia-ml.so.1 -> libnvidia-ml.so.510.47.03
-rwxr-xr-x.  1 root root   1702708 Jan 25 03:36 libnvidia-ml.so.510.47.03
lrwxrwxrwx.  1 root root        29 Feb  1 20:33 libnvidia-opencl.so.1 -> libnvidia-opencl.so.510.47.03
-rwxr-xr-x.  1 root root  17126348 Jan 25 03:56 libnvidia-opencl.so.510.47.03
lrwxrwxrwx.  1 root root        34 Feb  1 20:33 libnvidia-opticalflow.so.1 -> libnvidia-opticalflow.so.510.47.03
-rwxr-xr-x.  1 root root     46224 Jan 25 03:33 libnvidia-opticalflow.so.510.47.03
lrwxrwxrwx.  1 root root        37 Feb  1 20:33 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.510.47.03
-rwxr-xr-x.  1 root root  12802792 Jan 25 03:40 libnvidia-ptxjitcompiler.so.510.47.03
-rwxr-xr-x.  1 root root     13560 Jan 25 03:33 libnvidia-tls.so.510.47.03
drwxr-xr-x.  2 root root      4096 Feb  6 18:06 nvidia

$ ll /usr/bin | grep nvidia
-rwxr-xr-x.  1 root root       36981 Jan 25 04:57 nvidia-bug-report.sh
-rwxr-xr-x.  1 root root       47528 Feb 14 17:03 nvidia-container-cli
-rwxr-xr-x.  1 root root     2260408 Feb 14 17:04 nvidia-container-runtime
lrwxrwxrwx.  1 root root          33 Feb 17 08:09 nvidia-container-runtime-hook -> /usr/bin/nvidia-container-toolkit
-rwxr-xr-x.  1 root root     2156344 Feb 14 17:04 nvidia-container-toolkit
-rwxr-xr-x.  1 root root       49920 Jan 25 04:09 nvidia-cuda-mps-control
-rwxr-xr-x.  1 root root       14488 Jan 25 04:09 nvidia-cuda-mps-server
-rwxr-xr-x.  1 root root      260912 Jan 25 03:48 nvidia-debugdump
-rwxr-xr-x.  1 root root         721 Feb 14 17:05 nvidia-docker
-rwxr-xr-x.  1 root root     3896400 Jan 25 03:49 nvidia-ngx-updater
-rwxr-xr-x.  1 root root       45272 Feb  2 01:13 nvidia-persistenced
-rwxr-xr-x.  1 root root      978560 Jan 25 03:49 nvidia-powerd
-rwxr-xr-x.  1 root root      323128 Feb  2 01:29 nvidia-settings
-rwxr-xr-x.  1 root root         904 Jan 25 03:45 nvidia-sleep.sh
-rwxr-xr-x.  1 root root      690808 Jan 25 03:49 nvidia-smi

Add option to specify IPv4

Hi,

I've been using your software, and it works great, but I noticed that there doesn't seem to be an option to specify IPv4. I would like to request that you add an option to allow users to specify IPv4 only.

I tried with --web.listen-address=0.0.0.0:9835 but still using ipv6.

This feature would be particularly useful for those of us who are running the software on systems that have both IPv4 and IPv6 enabled, and need to ensure that the software is using IPv4 only.

I understand that this may not be a high priority feature, but I believe that it would greatly enhance the functionality of your software for users like myself.

Thank you for your consideration.

Create MSI installer

Just like https://github.com/prometheus-community/windows_exporter does

Docker instructions

Add instructions on running it on docker

Not able to scrape metrics

daemoneset has been set up on eks cluster. logs of the pod:
ts=2023-03-14T11:29:48.011Z caller=exporter.go:121 level=warn msg="Failed to auto-determine query field names, falling back to the built-in list" error="error running command: exit status 12: command failed. code: 12 | command: nvidia-smi --help-query-gpu | stdout: NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.\nPlease also try adding directory that contains libnvidia-ml.so to your system PATH.\n | stderr: NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.\nPlease also try adding directory that contains libnvidia-ml.so to your system PATH.\n"

Add metric to get `nvidia-smi` command's exit status that it is error or success

In some environments nvidia-smi commands returns with error code.

Ex:

$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

Can you add simple metric (ie: nvidia_smi_command_status) to get whether nvidia-smi commands run successfully.

I think we need to add new metric if there is stderr

cmd.Stdout = &stdout
	cmd.Stderr = &stderr

	err := runCmd(cmd)
	if err != nil {
		return nil, fmt.Errorf("command failed. stderr: %s err: %w", stderr.String(), err)
	}

	t, err := parseCSVIntoTable(strings.TrimSpace(stdout.String()), qFields)
	if err != nil {
		return nil, err
	}

	return &t, nil

Thanks

Can I add gpu label,such as "GPU 0"?

GPU-uuid is too complex to understand.Can I add gpu label,such as "GPU 0"?I installed it by docker.

Docker: use nvidia/cuda instead of ubuntu

Is your feature request related to a problem? Please describe.
I use docker installation method. Since I have multiple systems that the number of GPUs are differ, so bind mount every GPU to every contains seem not good enough.

Describe the solution you'd like
I would like to be able to run a same docker-compose file on every system.

Describe alternatives you've considered
I tried to rebuild your Dockerfile, with from nvidia/cuda:11.6.2-base-ubuntu20.04 instead of ubuntu:22.04, as following https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html.
it's not work since the Dockerfile provided is not work. If you please make the Dockerfile actually works, I will try to make a pull request.

Tutorial for Grafana setup?

Can't seem to figure out how to get it properly setup with Grafana no matter what I do.

nvidia_smi_gpu_info is in reverse order (by uuid) than other metrics

Describe the bug
nvidia_smi_gpu_info is in reverse order (by UUID) than other metrics

which makes machine-wide dashboards malfunction and have metrics flipped:

To Reproduce
Steps to reproduce the behavior:
Have 2 different GPUs in the system? Not sure, currently I have access to only one such machine. On servers with identical cards this does not happen

Expected behavior
All metrics should be sorted by UUID so that they align

Console output
N/A

Model and Version

GPU Model: NVIDIA GeForce GTX 1080 Ti and NVIDIA GeForce RTX 3060
App version and architecture: nvidia_gpu_exporter, version 1.1.0 (branch: HEAD, revision: 086b41f286814c3d1b0eb93141664ff8932eb0c8)
Installation method: binary download
Operating System Ubuntu Server 22.04 uname -a: Linux ubuntu 5.15.0-47-generic #51-Ubuntu SMP Thu Aug 11 07:51:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Nvidia GPU driver version: nvidia-driver-515-server 515.65.01

Additional context:

Grafana customizable GPU uuid

First of all I would like to say this repo is a great work and it partly solved my requirements.
Since my lab's compute cards are all distributed under different hosts with different ip's, I can't very clearly tell which gpu belongs to which server by uuid. So I'm wondering that is the name of the switching gpu in the top left corner of the dashboard customizable?

how to make docker images

hello， I want to use this project to make docker images，but I failed，can you tell me how to make docker images?

AUR package

I created an AUR package for this repository to make installing on Archlinux easier.
Maybe this is also helpful for others :-)

(Feel free to close this issue again)

startup error

Ubuntu 20.04.4

x86_64

GeForce RTX 3090

erro:

● nvidia_gpu_exporter.service - Nvidia GPU Exporter
     Loaded: loaded (/etc/systemd/system/nvidia_gpu_exporter.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2022-07-20 16:09:11 CST; 4s ago
    Process: 558328 ExecStart=/usr/local/bin/nvidia_gpu_exporter (code=exited, status=217/USER)
   Main PID: 558328 (code=exited, status=217/USER)

7月 20 16:09:11 guodi-yaogan-ai-system systemd[1]: nvidia_gpu_exporter.service: Scheduled restart job, restart counter is at 5.
7月 20 16:09:11 guodi-yaogan-ai-system systemd[1]: Stopped Nvidia GPU Exporter.
7月 20 16:09:11 guodi-yaogan-ai-system systemd[1]: nvidia_gpu_exporter.service: Start request repeated too quickly.
7月 20 16:09:11 guodi-yaogan-ai-system systemd[1]: nvidia_gpu_exporter.service: Failed with result 'exit-code'.
7月 20 16:09:11 guodi-yaogan-ai-system systemd[1]: Failed to start Nvidia GPU Exporter.

No metrics for multiple master nodes

Hey man, me again!

Having another issue, may be you could help.
I have same setup as before, but now more master nodes: 3 master nodes and 2 worker nodes.
The issue is that prometheus does not seem to have access to the metrics from the 2 new master nodes for some reason.
And consecutively, they are not displayed in the grafana dashboard.
All nodes have nvidia_gpu_exporter running.

that's how i deploy nvidia_gpu_exporter

helm install --version=0.3.1 ozdemir utkuozdemir/nvidia-gpu-exporter -f nvidia-gpu-ozdemir-values.yml

with such values

tolerations:
  - key: node-role.kubernetes.io/master
    effect: NoSchedule
    operator: Exists

hostPort:
  enabled: true
  port: 31585

Might be missing something obvious, will continue the investigation.

Prometheus setup problems

Here's what I'm seeing while trying to configure the Data Source:

Error reading Prometheus: bad_response: readObjectStart: expect { or n, but found <, error found in #1 byte of ...|<html lang=|..., bigger context ...|<html lang="en"> <head><title>Nvidia GPU Exporter</|...

I think I followed the install guide correctly but still something is not set up properly somewhere...

Update dashboard

https://www.reddit.com/r/nvidia/comments/oeykia/i_built_a_crossplatform_metric_exporter_and_a/h4c6xty?utm_source=share&utm_medium=web2x&context=3

Multinode dashboard extension

Hi, great project, mate, thanks!

I don't seem to be able to get data over multiple nodes, though.
In our company we have an on-premise cluster of PCs with GPUs, on top of which there is a k8s cluster and we are looking for a proper monitoring solution.
I use 3 nodes in my test setup -> 1 master and 2 nodes. All of them are to pick up GPU workloads.

i used Helm chart to deploy the exporters

helm install ozdemir utkuozdemir/nvidia-gpu-exporter -f nvidia-gpu-utku-ozdemir-values.yml

with the following values to allow deploying to master nodes as well

tolerations:
  - key: node-role.kubernetes.io/master
    effect: NoSchedule
    operator: Exists

service:
  type: NodePort
  nodePort: 30699

NOTE: By the way, as it turns out it's not possible to set a nodePort in Helm values, are you planning on adding this feature by any chance? would be very convenient.

exporters seem to be deployed successfully

vvcServiceAccount@k8s-master-node0:~$ k get pod -o wide
NAME                                                      READY   STATUS    RESTARTS   AGE   IP          NODE               NOMINATED NODE   READINESS GATES
utku-ozdemir-nvidia-gpu-exporter-6gt5r                         1/1     Running   1          19d   10.44.0.1   k8s-worker-node1   <none>           <none>
utku-ozdemir-nvidia-gpu-exporter-7jqgs                         1/1     Running   1          19d   10.32.0.2   k8s-master-node0   <none>           <none>
utku-ozdemir-nvidia-gpu-exporter-m9h72                         1/1     Running   0          19d   10.36.0.1   k8s-worker-node2   <none>           <none>

logs from all 3 exporters are the following

level=info ts=2021-08-13T15:08:03.434Z caller=main.go:65 msg="Listening on address" address=:9835
ts=2021-08-13T15:08:03.435Z caller=log.go:124 level=info msg="TLS is disabled." http2=false

prometheus specs for scraping

prometheus:
  prometheusSpec:
    additionalScrapeConfigs:
    - job_name: nvidia_gpu_exporter
      static_configs:
        - targets: [ '10.0.10.3:31585' ]

with 31585 being the random nodePort assigned to the sevice and 10.0.10.3 being the ip address of the master node

NAME                               TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
utku-ozdemir-nvidia-gpu-exporter   NodePort   10.106.182.203   <none>        9835:31585/TCP   19d

i used this grafana dashboard to display the collected data: https://grafana.com/grafana/dashboards/14574

the results are good:

but there is only one node in the dropdown (one of the worker nodes).
am i missing something in my setup or is it not possible to have all nodes in the dropdown list to have an overview per node?

ideally, it would be awesome to also have a summary dashboard with averaged metrics to monitor the whole cluster at once.
do you by any chance plan on developing something like that or maybe know some dashboards that already do that?
i tried the official one (nvidia/gpu-operator with https://grafana.com/grafana/dashboards/12239 as a dashboard) but it is not nearly as impressive as yours and also it had a bunch of empty charts and "no data" type of situations.

so, to sum up

would be cool to add nodePort option to Helm chart values
is it possible to have an overview of multiple nodes?

thanks!

Do not fail on startup if nvidia-smi command fails

There should be a fallback if the command fails. The app shouldn't crash on startup.

Running nvidia-smi multiple time is causing to hanging for GPU

Hello, Thank you for the great work.

First of all this not a bug report! It's related with Nvidia drivers and I'm just informing.. The story is,
We are using tool for a while. After last nvidia update unfortunately we are facing some troubles. No problem with the newer GPU's but the old ones hanging after calling nvidia-smi. Is anyone faced this weird problem in these days?

GPU: RTX Titan, driver version 450.80.02 and kernel is modules/5.4.0-124-generic

Unable to view encoder stats - "1:87: parse error: unexpected number \"0\""

Describe the bug
Trying to view encoder stats in a gauge as suggested logs "1:87: parse error: unexpected number \"0\"".

To Reproduce
Add a panel with queries (replacing $gpu with the actual uuid does not work either):

nvidia_smi_encoder_stats_average_fps{uuid="$gpu"} 0
nvidia_smi_encoder_stats_average_latency{uuid="$gpu"} 0
nvidia_smi_encoder_stats_session_count{uuid="$gpu"} 0

See error
1:87: parse error: unexpected number \"0\

Expected behavior
To see 3 guages with encoder stats

Model and Version

GTX 950 and Quadro P400
v0.3.0 - linux_x86_64
Installed via binary download
Arch Linux
nvidia-driver-470.74

utkuozdemir / nvidia_gpu_exporter Goto Github PK

nvidia_gpu_exporter's Introduction

nvidia_gpu_exporter

Introduction

Highlights

Visualization

Installation

Configuration

Metrics

Contributing

nvidia_gpu_exporter's People

Contributors

Stargazers

Watchers

Forkers

nvidia_gpu_exporter's Issues

Open

Detected dependencies

Console output

Recommend Projects

Recommend Topics

Recommend Org