NVIDIA SMI Telegraf Input Plugin

Powerful performance with an easy integration, powered by Telegraf, the open source data connector built by InfluxData.

5B+

Telegraf downloads

#1

Time series database
Source: DB Engines

1B+

Downloads of InfluxDB

2,800+

Contributors

Powerful Performance, Limitless Scale

Collect, organize, and act on massive volumes of high-velocity data. Any data is more valuable when you think of it as time series data. with InfluxDB, the #1 time series platform built to scale with Telegraf.

See Ways to Get Started

The NVIDIA System Management Interface (SMI) is a command line utility that helps with managing NVIDIA Graphics Processing Unit (GPU) devices. A GPU is a kind of computing technology designed for parallel processing that is frequently used in graphics and video rendering. The NVIDIA SMI lets users query and modify a GPU device state. It can report information returned from queries as plain text or Extensible Markup Language (XML) to a file or some other output. It’s meant to work with Tesla, GRID, Quartdro, and Titax X products, but there is also limited support on other NVIDIA GPUs.

Why use a Telegraf plugin for NVIDIA SMI?

This plugin pulls GPU stats such as memory usage, GPU usage, and temperature from the NVIDIA SMI binary. This lets you carefully monitor your GPU device and quickly detect any problems that occur. You can send this data to InfluxDB to use its built-in tools to analyze this data over time. You can also set up alerts to detect changes in metrics, such as if the temperature of a device crosses an established threshold.

How to monitor NVIDIA SMI using the Telegraf plugin

To configure this plugin you can set the path to your NVIDIA SMI binary, or leave it at the default /usr/bin/nvidia-smi. If the path isn’t found, the plugin will try to locate it on PATH(exec.LookPath) and if that doesn’t work it will return an error. You can optionally set a timeout for GPU polling, for example timeout = “5s”

Key NVIDIA SMI metrics to use for monitoring

Some of the important NVIDIA SMI metrics that you should proactively monitor include:

  • measurement: nvidia_smi
    • tags
      • name (type of GPU e.g. GeForce GTX 1070 Ti)
      • compute_mode (The compute mode of the GPU e.g. Default)
      • index (The port index where the GPU is connected to the motherboard e.g. 1)
      • pstate (Overclocking state for the GPU e.g. P0)
      • uuid (A unique identifier for the GPU e.g. GPU-f9ba66fc-a7f5-94c5-da19-019ef2f9c665)
    • fields
      • fan_speed (integer, percentage)
      • fbc_stats_session_count (integer)
      • fbc_stats_average_fps (integer)
      • fbc_stats_average_latency (integer)
      • memory_free (integer, MiB)
      • memory_used (integer, MiB)
      • memory_total (integer, MiB)
      • power_draw (float, W)
      • temperature_gpu (integer, degrees C)
      • utilization_gpu (integer, percentage)
      • utilization_memory (integer, percentage)
      • utilization_encoder (integer, percentage)
      • utilization_decoder (integer, percentage)
      • pcie_link_gen_current (integer)
      • pcie_link_width_current (integer)
      • encoder_stats_session_count (integer)
      • encoder_stats_average_fps (integer)
      • encoder_stats_average_latency (integer)
      • clocks_current_graphics (integer, MHz)
      • clocks_current_sm (integer, MHz)
      • clocks_current_memory (integer, MHz)
      • clocks_current_video (integer, MHz)
      • driver_version (string)
      • cuda_version (string)
For more information, please check out the documentation.

Project URL   Documentation

Powerful Performance, Limitless Scale

Collect, organize, and act on massive volumes of high-velocity data. Any data is more valuable when you think of it as time series data. with InfluxDB, the #1 time series platform built to scale with Telegraf.

See Ways to Get Started

Related Integrations