Comments (10)
Thanks Utku,
I just rolled out the package on our notes.
Will post the output of the exporter when it next time fails.
from nvidia_gpu_exporter.
The Service runs fine since last restart (1d 4h).
Since we install the drivers via run file, there is no nvidia service running.
So I will try to put
After=systemd-modules-load.service
On the service.
I will observe the service for the next days and replay to you.
from nvidia_gpu_exporter.
When that happens, can you please check if the process itself is crashing? (does the PID stay the same?)
Also, after it happens, can you do this and share the outputs here: #68 (comment)
It's helpful overall for troubleshooting.
from nvidia_gpu_exporter.
Here are the logs. You can see that the process is running for 1week+
● prometheus-nvidia-exporter-2.service - Prometheus nVidia GPU Exporter
Loaded: loaded (/usr/lib/systemd/system/prometheus-nvidia-exporter-2.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2022-03-29 17:05:05 CEST; 1 weeks 0 days ago
Main PID: 1022 (prometheus-nvid)
Tasks: 17
CGroup: /system.slice/prometheus-nvidia-exporter-2.service
└─1022 /usr/bin/prometheus-nvidia-exporter-2
Apr 06 12:13:48 c019dc6u024 prometheus-nvidia-exporter-2[1022]: ts=2022-04-06T10:13:48.935Z caller=exporter.go:157 level=error error="command failed. stderr: err: exit status 2"
Apr 06 12:18:48 c019dc6u024 prometheus-nvidia-exporter-2[1022]: ts=2022-04-06T10:18:48.936Z caller=exporter.go:157 level=error error="command failed. stderr: err: exit status 2"
Apr 06 12:23:48 c019dc6u024 prometheus-nvidia-exporter-2[1022]: ts=2022-04-06T10:23:48.936Z caller=exporter.go:157 level=error error="command failed. stderr: err: exit status 2"
Apr 06 12:28:48 c019dc6u024 prometheus-nvidia-exporter-2[1022]: ts=2022-04-06T10:28:48.936Z caller=exporter.go:157 level=error error="command failed. stderr: err: exit status 2"
Apr 06 12:33:48 c019dc6u024 prometheus-nvidia-exporter-2[1022]: ts=2022-04-06T10:33:48.935Z caller=exporter.go:157 level=error error="command failed. stderr: err: exit status 2"
Apr 06 12:38:48 c019dc6u024 prometheus-nvidia-exporter-2[1022]: ts=2022-04-06T10:38:48.942Z caller=exporter.go:157 level=error error="command failed. stderr: err: exit status 2"
Apr 06 12:43:48 c019dc6u024 prometheus-nvidia-exporter-2[1022]: ts=2022-04-06T10:43:48.936Z caller=exporter.go:157 level=error error="command failed. stderr: err: exit status 2"
Apr 06 12:48:48 c019dc6u024 prometheus-nvidia-exporter-2[1022]: ts=2022-04-06T10:48:48.939Z caller=exporter.go:157 level=error error="command failed. stderr: err: exit status 2"
Apr 06 12:53:48 c019dc6u024 prometheus-nvidia-exporter-2[1022]: ts=2022-04-06T10:53:48.936Z caller=exporter.go:157 level=error error="command failed. stderr: err: exit status 2"
Apr 06 12:58:48 c019dc6u024 prometheus-nvidia-exporter-2[1022]: ts=2022-04-06T10:58:48.948Z caller=exporter.go:157 level=error error="command failed. stderr: err: exit status 2"
from nvidia_gpu_exporter.
Please also share the outputs I described here: #68 (comment)
from nvidia_gpu_exporter.
This is the output:
Field "ecc.errors.corrected.volatile.dram" is not a valid field to query.
This may have to do with the quite old version of the driver installed?!
from nvidia_gpu_exporter.
Thanks. I have just released v0.5.0 with more helpful logs, can you install it and publish the logs output here?
This may have to do with the quite old version of the driver installed?!
I don't think it is the reason because the exporter tries to auto-detect the valid field names at start. But let's see what v0.5.0 output will say.
from nvidia_gpu_exporter.
Dear Utku,
it just happens again on a system just boot up:
# w
09:17:49 up 15 min, 1 user, load average: 0.10, 0.17, 0.28
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
root pts/0 SERVER 09:11 5.00s 0.00s 0.00s w
# systemctl status prometheus-nvidia-exporter-2.service -l
● prometheus-nvidia-exporter-2.service - Prometheus nVidia GPU Exporter
Loaded: loaded (/usr/lib/systemd/system/prometheus-nvidia-exporter-2.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2022-04-11 09:02:38 CEST; 9min ago
Main PID: 1144 (prometheus-nvid)
Tasks: 9
CGroup: /system.slice/prometheus-nvidia-exporter-2.service
└─1144 /usr/bin/prometheus-nvidia-exporter-2
Apr 11 09:02:38 client systemd[1]: Started Prometheus nVidia GPU Exporter.
Apr 11 09:02:39 client prometheus-nvidia-exporter-2[1144]: ts=2022-04-11T07:02:39.006Z caller=exporter.go:121 level=warn msg="Failed to auto-determine query field names, falling back to the built-in list" error="error running command: exit status 9: command failed. code: 9 | command: nvidia-smi --help-query-gpu | stdout: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.\n\n | stderr: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.\n\n"
Apr 11 09:02:39 client prometheus-nvidia-exporter-2[1144]: ts=2022-04-11T07:02:39.008Z caller=main.go:68 level=info msg="Listening on address" address=:9835
Apr 11 09:02:39 client prometheus-nvidia-exporter-2[1144]: ts=2022-04-11T07:02:39.009Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
Apr 11 09:10:28 client prometheus-nvidia-exporter-2[1144]: ts=2022-04-11T07:10:28.283Z caller=exporter.go:175 level=error error="error running command: exit status 2: command failed. code: 2 | command: nvidia-smi --query-gpu=ecc.errors.corrected.volatile.sram,ecc.errors.uncorrected.volatile.l1_cache,ecc.errors.corrected.volatile.dram,ecc.errors.uncorrected.aggregate.device_memory,driver_model.current,ecc.mode.current,ecc.errors.corrected.volatile.l1_cache,ecc.errors.corrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.texture_memory,power.default_limit,power.min_limit,inforom.ecc,fan.speed,clocks_throttle_reasons.gpu_idle,encoder.stats.sessionCount,ecc.errors.uncorrected.aggregate.cbu,clocks.current.sm,pci.sub_device_id,pcie.link.width.max,retired_pages.double_bit.count,ecc.mode.pending,ecc.errors.corrected.aggregate.cbu,mig.mode.current,inforom.oem,ecc.errors.corrected.volatile.register_file,power.limit,clocks.current.video,pci.domain,accounting.buffer_size,driver_model.pending,clocks_throttle_reasons.supported,ecc.errors.uncorrected.aggregate.dram,clocks.applications.memory,driver_version,name,clocks_throttle_reasons.active,memory.free,clocks.default_applications.graphics,display_mode,display_active,power.max_limit,clocks.max.sm,memory.used,ecc.errors.corrected.aggregate.register_file,memory.total,ecc.errors.corrected.aggregate.device_memory,ecc.errors.corrected.volatile.device_memory,retired_pages.pending,pcie.link.gen.max,pcie.link.width.current,mig.mode.pending,pstate,power.draw,ecc.errors.corrected.volatile.texture_memory,retired_pages.single_bit_ecc.count,inforom.pwr,clocks_throttle_reasons.hw_power_brake_slowdown,ecc.errors.uncorrected.volatile.dram,clocks.max.graphics,serial,accounting.mode,clocks_throttle_reasons.sync_boost,utilization.gpu,ecc.errors.corrected.volatile.l2_cache,ecc.errors.uncorrected.aggregate.sram,persistence_mode,clocks_throttle_reasons.applications_clocks_setting,ecc.errors.uncorrected.volatile.register_file,ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.aggregate.l1_cache,ecc.errors.uncorrected.aggregate.l2_cache,enforced.power.limit,vbios_version,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.uncorrected.volatile.l2_cache,clocks_throttle_reasons.sw_power_cap,encoder.stats.averageLatency,ecc.errors.corrected.aggregate.sram,ecc.errors.uncorrected.volatile.texture_memory,clocks.max.memory,pci.bus,clocks_throttle_reasons.sw_thermal_slowdown,ecc.errors.uncorrected.volatile.total,power.management,gom.pending,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.l1_cache,encoder.stats.averageFps,ecc.errors.corrected.volatile.cbu,clocks_throttle_reasons.hw_thermal_slowdown,count,pci.device,clocks.default_applications.memory,index,utilization.memory,timestamp,temperature.gpu,inforom.img,compute_mode,ecc.errors.uncorrected.aggregate.total,temperature.memory,pci.device_id,pcie.link.gen.current,ecc.errors.uncorrected.volatile.cbu,gom.current,clocks_throttle_reasons.hw_slowdown,ecc.errors.uncorrected.volatile.device_memory,clocks.current.graphics,clocks.current.memory,clocks.applications.graphics,uuid,pci.bus_id --format=csv | stdout: Field \"ecc.errors.corrected.volatile.sram\" is not a valid field to query.\n\n | stderr: "
# nvidia-smi
Mon Apr 11 09:13:57 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P6000 Off | 00000000:03:00.0 On | Off |
| 26% 41C P8 11W / 250W | 58MiB / 24449MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9765 G /usr/bin/X 40MiB |
| 0 10781 G /usr/bin/gnome-shell 16MiB |
+-----------------------------------------------------------------------------+
I guess this happens here because the driver was not ready when the exporter started?
stdout: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
So I add an start delay to the systemd service.
from nvidia_gpu_exporter.
Ok cool, it seems the auto-detection of field names at the start using nvidia-smi --help-query-gpu
failed with NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver...
.
So the exporter used the fallback set of fields which are hardcoded into the exporter code. And turns out one of those fallback fields, ecc.errors.corrected.volatile.sram
is not a supported field on your GPU/OS/driver version.
Can you check if it recovers (can auto-detect the fields) after you restart the service.
If it does, I suspect that the issue might be that the exporter starts too early, before the GPU driver is ready. Therefore the auto-detect fails.
To verify this, please check the systemd services you have and see if there is anything Nvidia related. If there is, please put in exporter systemd unit file an After=
config to make sure exporter starts after that Nvidia service. And give it a try for some time to see if the issue reproduces.
from nvidia_gpu_exporter.
Closing this, but pls let me know if you still face the issue.
from nvidia_gpu_exporter.
Related Issues (20)
- [Discussion] Offering CPU and Memory Monitoring Support HOT 2
- scoop install is not updated
- Change from 'throttle' to 'event' in output from nvidia-smi v535.113.01
- Grafana customizable GPU uuid
- getting this working on wsl2
- Add support for the PCIe TX Throughput and RX Throughput metrics HOT 1
- most ratio metrics are zeroes HOT 3
- Add instance filter in Grafana dashboard HOT 3
- Help getting pulling stats from exporter. HOT 3
- Can't figure out how to connect to Grafana cloud HOT 2
- Working with multinode servers? HOT 5
- macOS binaries? HOT 1
- change prometheus for netdata? HOT 1
- Process
- Run as Windows service is failing HOT 1
- get gpu info fail
- Use `go-nvlib` and/or `go-nvml` instead of exec
- pod running error HOT 1
- Add GPU index ID as a label like uuid
- how to add Total RAM , GPU Core, CUDA Core, CPU Core, and RAM in used etc HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nvidia_gpu_exporter.