Deploying InfluxDB and Telegraf to Monitor Kubernetes

Navigate to:

I run a small Kubernetes cluster at home, which I originally set up as somewhere to experiment.

Because it started as a playground, I never bothered to set up monitoring. However, as time passed, I’ve ended up dropping more production-esque workloads onto it, so I decided I should probably put some observability in place.

Not having visibility into the cluster was actually a little odd, considering that even my fish tank can page me. I don’t need (or want) the cluster to be able to generate pages, but I do still want the underlying metrics, if only for capacity planning.

In this post I talk about deploying and automatically pre-configuring an InfluxDB OSS 2.x instance (with optional EDR) and Telegraf to collect metrics for visualization with Grafana.

The plan

First, a quick overview of the plan:

  • Create a dedicated namespace (called monitoring)
  • Deploy and preconfigure an InfluxDB OSS 2.x instance to write metrics into
  • Deploy Telegraf as a DaemonSet to run on every node using both the kubernetes and kube_inventory plugins
  • Graph out metrics in Grafana (I already had a Grafana instance—see here if you need to roll one out)

Because this is a home lab, I’m going to commit an ops-sin and deploy InfluxDB into the same cluster that I’m using it to monitor.

If you’re doing this in production, you should ensure that metrics are available during an outage by deploying InfluxDB elsewhere, using InfluxDB Cloud, or configuring Edge Data Replication (EDR) so that the data is replicated out of the cluster.

In the steps below, I’ve made it easy to configure EDR (no changes are needed if you don’t want it; just don’t create that secret).

Throughout this doc, I’m going to include snippets of YAML. The assumption is that you’ll append them to a file (mine’s called influxdb-kubemonitoring.yml) ready for feeding into kubectl at various points.

For those in a hurry, there’s also a copy of the manifests below in my article_scripts repo.

Why InfluxDB?

Using InfluxDB and Telegraf offers a number of benefits:

  • Easy off-site replication (via EDR)
  • Ease of setup: Telegraf has everything rolled in
  • Metrics can be queried with InfluxQL (and/or SQL if replicated into v3)

Because Telegraf can buffer data and InfluxDB can accept writes into the past, if something were to happen to the cluster, there’s the potential to gain retrospective visibility into its state when later analysing the cause of an incident.

And, of course, there’s familiarity: all of my metrics go into InfluxDB, so it makes most sense to use the solution I’m most comfortable with.

Namespace

Everything will be deployed into a namespace called monitoring, so first we need to create that:

Secret creation

We’re going to need to define a secret containing several items, including:

  • Admin credentials for the InfluxDB instance
  • A token to create in the InfluxDB instance
  • The name of the InfluxDB org
  • Credentials for two non-privileged users

These will be used to create the accounts within InfluxDB.

I use gen_passwd to generate passwords but use whatever suits you.

In order to create a random token, I used:

Create a secret called influxdb-info:

Storage

InfluxDB needs some persistent storage so our metrics remain available even after a pod is rolled.

Generally, I use NFS to access my NAS, so I defined a NFS backed physical volume and an associated claim:

Optional: Edge Data Replication

As we’re deploying InfluxDB into the monitored cluster, to help ensure continuity, we might want to replicate the data onward (to InfluxDB cloud or another OSS instance) so that it also exists outside the cluster.

OSS 2.X has a feature called Edge Data Replication—this involves defining a remote (i.e., where we want to replicate to) and then specifying which bucket should be replicated.

To do this, you will need:

  • The URL of your remote instance
  • An authentication token for the remote instance
  • The org ID to use with the remote instance
  • The ID of the remote bucket you want to write into

If you decide not to configure EDR and want to add it later, you can always enable it manually.

To enable EDR, we’re going to define a secret to hold the information necessary to configure it:

The manifests we create later in this post will reference these values (they’re optional, so nothing should error if the secret isn’t defined).

Preconfiguring InfluxDB Auth

The InfluxDB container supports pre-configuration of credentials via environment variables, so we’re going to pass in references to the secret that we created earlier.

However, there is still a catch: Although the image allows us to pass a known token value (via DOCKER_INFLUXDB_INIT_ADMIN_TOKEN) it’s used to create the operator token.

The operator token is an extremely privileged credential, and not something that we really want to be passing into Grafana and Telegraf.

We need to pre-populate some less privileged credentials, but neither the image nor InfluxDB provides a means to create a (non-operator) token with a known value. We could write a script to call the API and mint a pair of tokens before writing them into a k8s secret, but we’d then need to give something the ability to update secrets, which is somewhat less than ideal.

Instead, we can create a username/password pair that can then be used with the v1 API.

The InfluxDB image can run arbitrary init scripts (but only triggers them if Influx’s data files exist, so there’s no risk of accidental re-runs).

We’re going to define a ConfigMap to store a small script that will create a DBRP policy and a pair of non-privileged users (one read-only, one write-only). The script also configures EDR if we’ve chosen to enable it.

Deploying InfluxDB

Next, we need to deploy InfluxDB itself.

Things to note about this definition are:

  • Variables DOCKER_INFLUXDB_INIT_BUCKET and DOCKER_INFLUXDB_INIT_RETENTION tell InfluxDB to create a bucket called telegraf with a retention policy of 90 days.
  • My NFS share is configured to squash permissions, so I included a securityContext to use the appropriate UID.
  • We make the various values from our secret available via the environment variable.
  • As well as the PVC, we mount the configmap containing the init script into the container.

The Deployment will need to be fronted by a service, so we define that next, passing port 8086 through:

If we apply our work so far:

We should now be able to retrieve service details:

It should now also be possible to use the cluster IP to run a query against InfluxDB (replace $INFLUX_TOKEN with the token you generated when first defining the secret).

We should also be able to use the read-only creds with the v1 query API:

Deploying Telegraf

Deploying Telegraf into Kubernetes is really easy. However, authorizing it to fetch information from the Kubernetes APIs involves interacting with the k8s auth model, which is a bit more complex.

First, we’re going to define a service account called telegraf and then authorize that to talk to both kubelet and cluster APIs.

Define the service account:

Define a pair of cluster roles:

Note: Technically, you could probably combine those, but as they give access to different things, I prefer to keep them separate.

Finally, create the role bindings to link both back to the Service Account:

With that unpleasantness out of the way, we’re ready to configure and deploy Telegraf.

As the Telegraf config is simple, we drop it into a ConfigMap.

Note: There’s no need to replace the variables by hand; we’ll be setting them in the container config.

Then, we’re ready to define our DaemonSet.

There are a couple of important points here:

  • We’re exposing the node IP via env var HOSTIP
  • We’re setting the env var HOSTNAME to the node Hostname via spec.nodeName
  • We’re passing in credentials from the secret we created earlier
  • We’re mapping directories from the host into the container
  • We’re mounting our configmap as config
  • We set MONITOR_HOST to use the name of the service we created for InfluxDB (http://influxdb:8086).

If we apply the updated config:

Telegraf should spring to life and start collecting metrics from the Kubernetes APIs.

We can check by tailing Telegraf’s logs:

If it’s not logging errors, then everything is working.

Checking for metrics

We can confirm that metrics are arriving in InfluxDB by logging into its web interface.

Assuming that you deployed into the cluster, grab the cluster IP by running the following:

Then, visit http://[cluster ip]:8086 in a browser.

You should be able to log in with the credentials you created at the beginning of this process.

If you browse to the Data Explorer, you should see a bunch of Kubernetes-related measurements.

Dashboarding

With metrics coming into InfluxDB, we now just need a dashboard to help visualize things. I do all of my dashboarding in Grafana so that I only have to go to one place to visualize metrics from a wide variety of sources.

We’ll need to tell Grafana how to speak to our new InfluxDB instance, so:

  • Burger Menu
  • Connections
  • Connect Data
  • Search for InfluxDB

When adding the new InfluxDB datasource in Grafana, you’ll need to provide the following:

  • Language: InfluxQL (because we’re using the v1 API)
  • URL: If InfluxDB and Grafana are running in the same cluster, you can simply use the InfluxDB service name. Otherwise you’ll need to provide an IP or domain name

Use the following settings in Grafana:

  • Database: telegraf
  • User: kubestatsro
  • Password: the password we defined earlier

Once the datasource has been created, we can start to pull stats out with queries like:

Ultimately, we are building a single dashboard that allows us to view resource usage at node, namespace, and pod levels.

For example, we can see that my Atuin sync server has been assigned far more resources than it’s actually using (potentially preventing other workloads from being scheduled on that node): An importable copy of the dashboard can be found on Github. Because it uses InfluxQL, it should be compatible with InfluxDB 1.x, 2.x and 3.x.

Conclusion

If you’ve been following along, you should now have an InfluxDB instance receiving metrics about each resource in your Kubernetes cluster. If you enabled EDR, these metrics will be replicated to an external instance in order to help ensure continuity of monitoring during outages.

I now have the means to see a bit more about what’s going on inside my cluster, including the information I need to ensure that resource requests aren’t being set too high, unnecessarily limiting capacity in the process.

Although YAML’s verbosity makes it seem like a lot more than it is, it didn’t take much to get InfluxDB and Telegraf up and running in the cluster.

The auto-provisioning of credentials and configuration means that if I ever want to start over, I can simply wipe the PVCs and roll the pods—the image will recreate the credentials, and everything will come back up.

To get started on this or another project on InfluxDB, sign up for a free cloud account now.