Giter VIP home page Giter VIP logo

datadog_nvml's Introduction

datadog_nvml

Monitoring NVIDIA GPUs status using Datadog

Datadog Agent Check To capture and send metrics

nvidia-ml-py Python Module as API interface

screenshot1

screenshot1

Current Monitor Supported

Currently we will acquire the following items for each GPU.

Metrics

  • nvml.util.gpu: Percent of time over the past sample period during which one or more kernels was executing on the GPU.
  • nvml.util.memory: Percent of time over the past sample period during which global (device) memory was being read or written.
  • nvml.util.decode: Percent of usage of HW Decoding (NVDEC) from the last sample period (*)
  • nvml.util.encode: Percent of usage of HW Encoding (NVENC) from the last sample period (*)
  • nvml.mem.total: Total Memory
  • nvml.mem.used: Used Memory
  • nvml.mem.free: Free Memory
  • nvml.temp: Temperature
  • nvml.gpus.number: Number of active GPUs
  • nvml.throttle.appsettings: Clocks are being throttled by the applications settings
  • nvml.throttle.display: Clocks are being throttled by the Display clocks settings
  • nvml.throttle.hardware: Clocks are being throttled by a factor of 2 or more due to high temperature, high power draw, and/or PState or clock change
  • nvml.throttle.power.hardware: Clocks are being throttled due to the External Power Brake Assertion being triggered (e.g., by the system power supply)
  • nvml.throttle.idle: Clocks are being throttled to Idle state because nothing is running on the GPU
  • nvml.throttle.power.software: Clocks are being throttled by the software power scaling algorithm
  • nvml.throttle.syncboost: Clocks are being throttled because this GPU is in a sync boost group and will sync to the lowest possible clocks across the group
  • nvml.throttle.temp.hardware: Clocks are being throttled by a factor of 2 or more due to high temperature
  • nvml.throttle.temp.software: Clocks are being throttled due to high GPU core and/or memory temperature
  • nvml.throttle.unknown: Clocks are being throttled due to an unknown reason

(*) HW accelerated encode and decode are supported on NVIDIA GeForce, Quadro, Tesla, and GRID products with Fermi, Kepler, Maxwell and Pascal generation GPUs.

Tags

  • name: GPU (GEFORCE_GTX_660)

REQUIRES

nvidia-ml-py (v7.352.0)

# Python 2
$ sudo /opt/datadog-agent/embedded/bin/pip install nvidia-ml-py==7.352.0

# Python 3
$ sudo /opt/datadog-agent/embedded/bin/pip install nvidia-ml-py3==7.352.0

Check that was correctly installed:

# /opt/datadog-agent/embedded/bin/pip show nvidia-ml-py
Name: nvidia-ml-py
Version: 7.352.0
Summary: Python Bindings for the NVIDIA Management Library
Home-page: http://www.nvidia.com/
Author: NVIDIA Corporation
Author-email: [email protected]
License: BSD
Location: /opt/datadog-agent/embedded/lib/python2.7/site-packages

SETUP

Get the two files and placed them at:

  • nvml.py: /etc/dd-agent/checks.d
  • nvml.yaml.default: /etc/dd-agent/conf.d

with the command:

$ sudo wget https://raw.githubusercontent.com/ngi644/datadog_nvml/master/nvml.py -O /etc/datadog-agent/checks.d/nvml.py
$ sudo wget https://raw.githubusercontent.com/ngi644/datadog_nvml/master/nvml.yaml.default -O /etc/datadog-agent/conf.d/nvml.yaml.default

Restart Datadog Agent, to compile the PY Source and update the check file.

$ sudo service datadog-agent restart

Check if module was loaded correctly

$ sudo service datadog-agent info

or

$ sudo datadog-agent status

Result should look like:

Checks
  ======
...
    nvml (5.14.1)

      - instance #0 [OK]
      - Collected 16 metrics, 0 events & 1 service check
...

or with:

$ sudo datadog-agent status

Result should look like:

Checks
  ======
...
    nvml (0.1.4)
    ------------
      Instance ID: nvml:d9950296b931429 [OK]
      Total Runs: 1
      Metric Samples: Last Run: 8, Total: 8
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 1
      Average Execution Time : 700ms
...

With docker.

$ docker build -t datadog_nvml .
$ docker run -d --gpus=all \
              -v /var/run/docker.sock:/var/run/docker.sock:ro \
              -v /proc/:/host/proc/:ro \
              -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
              -v /opt/datadog-agent-conf.d:/conf.d:ro \
              -v /opt/datadog-agent-checks.d:/checks.d:ro \
              -e DD_API_KEY=${DD_API_KEY} \
              -e DD_SITE=datadoghq.com \
              datadog_nvml:latest

Tested

Tested on AWS EC2 G2 Familly (g2.2xlarge) that include 1x NVIDIA GRID K520 card. Tested on Bare-metal Supermicro server with NVIDIA TESLA P4 and P40 cards.

References

datadog_nvml's People

Contributors

cdgraff avatar ngi644 avatar ohsawa0515 avatar robert-clarifai avatar shappenny avatar zachschuermann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

datadog_nvml's Issues

doesn't work for Mac OS X

Hi,

I have tried to get this working for Mac OS X, but it seems that the NVML is not available for this platform. I was hoping that you might have some ideas on how to get this plugin working on Mac OS X?

Thanks,

can we use this in kubernets?

With docker.

$ docker build -t datadog_nvml .
$ docker run -d --gpus=all \
              -v /var/run/docker.sock:/var/run/docker.sock:ro \
              -v /proc/:/host/proc/:ro \
              -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
              -v /opt/datadog-agent-conf.d:/conf.d:ro \
              -v /opt/datadog-agent-checks.d:/checks.d:ro \
              -e DD_API_KEY=${DD_API_KEY} \
              -e DD_SITE=datadoghq.com \
              datadog_nvml:latest

In docs we have this, we have to do the same thing in kubernets?

auto generated dashboard ?

Metrics collect seems to work here with Tesla M60.

I have metrics, but no dashboard. Should I create one by myself, there is a default generated dashboard somewhere ?

Thanks for your work.

How to install on a Windows agent?

I would like to be able to set this up for Datadog agents running on Windows Server (2016 and 2019). I am able to do the first pip install step, but fail to find where to put the custom scripts on the agent for Datadog to use them.

Getting Insufficient Permissions error for nvml

Anyone know why I am getting this error?
nvml (0.1.5) ------------ Instance ID: nvml:d884b5186b651429 [ERROR] Configuration Source: file:/etc/datadog-agent/conf.d/nvml.yaml.default Total Runs: 11 Metric Samples: Last Run: 0, Total: 0 Events: Last Run: 0, Total: 0 Service Checks: Last Run: 0, Total: 0 Average Execution Time : 2ms Last Execution Date : 2020-06-26 13:16:33.000000 PDT Last Successful Execution Date : Never Error: Insufficient Permissions Traceback (most recent call last): File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/base.py", line 820, in run self.check(instance) File "/etc/datadog-agent/checks.d/nvml.py", line 47, in check pynvml.nvmlInit() File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/pynvml.py", line 615, in nvmlInit _nvmlCheckReturn(ret) File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/pynvml.py", line 310, in _nvmlCheckReturn raise NVMLError(ret) NVMLError_NoPermission: Insufficient Permissions

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.