Deploying InfluxDB and Telegraf to Monitor Kubernetes

Navigate to:

I run a small Kubernetes cluster at home, which I originally set up as somewhere to experiment.

Because it started as a playground, I never bothered to set up monitoring. However, as time passed, I’ve ended up dropping more production-esque workloads onto it, so I decided I should probably put some observability in place.

Not having visibility into the cluster was actually a little odd, considering that even my fish tank can page me. I don’t need (or want) the cluster to be able to generate pages, but I do still want the underlying metrics, if only for capacity planning.

In this post I talk about deploying and automatically pre-configuring an InfluxDB OSS 2.x instance (with optional EDR) and Telegraf to collect metrics for visualization with Grafana.

The plan

First, a quick overview of the plan:

  • Create a dedicated namespace (called monitoring)
  • Deploy and preconfigure an InfluxDB OSS 2.x instance to write metrics into
  • Deploy Telegraf as a DaemonSet to run on every node using both the kubernetes and kube_inventory plugins
  • Graph out metrics in Grafana (I already had a Grafana instance—see here if you need to roll one out)

Because this is a home lab, I’m going to commit an ops-sin and deploy InfluxDB into the same cluster that I’m using it to monitor.

If you’re doing this in production, you should ensure that metrics are available during an outage by deploying InfluxDB elsewhere, using InfluxDB Cloud, or configuring Edge Data Replication (EDR) so that the data is replicated out of the cluster.

In the steps below, I’ve made it easy to configure EDR (no changes are needed if you don’t want it; just don’t create that secret).

Throughout this doc, I’m going to include snippets of YAML. The assumption is that you’ll append them to a file (mine’s called influxdb-kubemonitoring.yml) ready for feeding into kubectl at various points.

For those in a hurry, there’s also a copy of the manifests below in my article_scripts repo.

Why InfluxDB?

Using InfluxDB and Telegraf offers a number of benefits:

  • Easy off-site replication (via EDR)
  • Ease of setup: Telegraf has everything rolled in
  • Metrics can be queried with InfluxQL (and/or SQL if replicated into v3)

Because Telegraf can buffer data and InfluxDB can accept writes into the past, if something were to happen to the cluster, there’s the potential to gain retrospective visibility into its state when later analysing the cause of an incident.

And, of course, there’s familiarity: all of my metrics go into InfluxDB, so it makes most sense to use the solution I’m most comfortable with.

Namespace

Everything will be deployed into a namespace called monitoring, so first we need to create that:

kubectl create namespace monitoring
view raw gistfile1.txt hosted with ❤ by GitHub

Secret creation

We’re going to need to define a secret containing several items, including:

  • Admin credentials for the InfluxDB instance
  • A token to create in the InfluxDB instance
  • The name of the InfluxDB org
  • Credentials for two non-privileged users

These will be used to create the accounts within InfluxDB.

I use gen_passwd to generate passwords but use whatever suits you.

In order to create a random token, I used:

gen_passwd 36 | base64
view raw gistfile1.txt hosted with ❤ by GitHub

Create a secret called influxdb-info:

kubectl -n monitoring create secret generic influxdb-info \
--from-literal=user=influxadmin \
--from-literal=password='MYPASSWORD' \
--from-literal=org="kubernetes" \
--from-literal=token="MYTOKEN" \
--from-literal=readuser="kubestatsro" \
--from-literal=readpass="CHANGEME" \
--from-literal=writeuser="kubestatsw" \
--from-literal=writepass="CHANGEME" \
view raw gistfile1.txt hosted with ❤ by GitHub

Storage

InfluxDB needs some persistent storage so our metrics remain available even after a pod is rolled.

Generally, I use NFS to access my NAS, so I defined a NFS backed physical volume and an associated claim:

---
apiVersion: v1
kind: PersistentVolume
metadata:
name: influxdb-pv
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
storageClassName: nfs
nfs:
server: 192.168.3.233
path: "/volume1/kubernetes_misc_mounts"
readOnly: false
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: influxdb-pvc
namespace: monitoring
labels:
app: influxdb
spec:
storageClassName: nfs
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
view raw gistfile1.txt hosted with ❤ by GitHub

Optional: Edge Data Replication

As we’re deploying InfluxDB into the monitored cluster, to help ensure continuity, we might want to replicate the data onward (to InfluxDB cloud or another OSS instance) so that it also exists outside the cluster.

OSS 2.X has a feature called Edge Data Replication—this involves defining a remote (i.e., where we want to replicate to) and then specifying which bucket should be replicated.

To do this, you will need:

  • The URL of your remote instance
  • An authentication token for the remote instance
  • The org ID to use with the remote instance
  • The ID of the remote bucket you want to write into

If you decide not to configure EDR and want to add it later, you can always enable it manually.

To enable EDR, we’re going to define a secret to hold the information necessary to configure it:

kubectl -n monitoring create secret generic upstream-influxdb \
--from-literal=url='<URL>' \
--from-literal=org="<ORG>" \
--from-literal=token="<TOKEN>" \
--from-literal=bucket="<BUCKET>"
view raw gistfile1.txt hosted with ❤ by GitHub

The manifests we create later in this post will reference these values (they’re optional, so nothing should error if the secret isn’t defined).

Preconfiguring InfluxDB Auth

The InfluxDB container supports pre-configuration of credentials via environment variables, so we’re going to pass in references to the secret that we created earlier.

However, there is still a catch: Although the image allows us to pass a known token value (via DOCKER_INFLUXDB_INIT_ADMIN_TOKEN) it’s used to create the operator token.

The operator token is an extremely privileged credential, and not something that we really want to be passing into Grafana and Telegraf.

We need to pre-populate some less privileged credentials, but neither the image nor InfluxDB provides a means to create a (non-operator) token with a known value. We could write a script to call the API and mint a pair of tokens before writing them into a k8s secret, but we’d then need to give something the ability to update secrets, which is somewhat less than ideal.

Instead, we can create a username/password pair that can then be used with the v1 API.

The InfluxDB image can run arbitrary init scripts (but only triggers them if Influx’s data files exist, so there’s no risk of accidental re-runs).

We’re going to define a ConfigMap to store a small script that will create a DBRP policy and a pair of non-privileged users (one read-only, one write-only). The script also configures EDR if we’ve chosen to enable it.

apiVersion: v1
kind: ConfigMap
metadata:
name: influx-monitoring-files
namespace: monitoring
data:
00_setup_dbrp: |
#!/bin/bash
#
# Create the DBRP policy
# This isn't _strictly_ necessary as more
# recent versions auto-map, but as we're here
# anyway lets be explicit
influx v1 dbrp create \
--bucket-id ${DOCKER_INFLUXDB_INIT_BUCKET_ID} \
--db telegraf \
--rp autogen \
--default \
--org ${DOCKER_INFLUXDB_INIT_ORG}
# Create the write user
influx v1 auth create \
--username ${V1_WRITE_USERNAME} \
--password ${V1_WRITE_PASSWORD} \
--write-bucket ${DOCKER_INFLUXDB_INIT_BUCKET_ID} \
--org ${DOCKER_INFLUXDB_INIT_ORG}
influx v1 auth create \
--username ${V1_READ_USERNAME} \
--password ${V1_READ_PASSWORD} \
--read-bucket ${DOCKER_INFLUXDB_INIT_BUCKET_ID} \
--org ${DOCKER_INFLUXDB_INIT_ORG}
if [ ! "$UPSTREAM_URL" == "" ]
then
# Configure EDR
# Create the remote and capture the ID
REMOTE_ID=`influx remote create \
--name replicated \
--remote-url "$UPSTREAM_URL" \
--remote-api-token "$UPSTREAM_TOKEN" \
--remote-org-id "$UPSTREAM_ORG" | tail -n1 | awk '{print $1}'`
# Set up replication
influx replication create \
--name replicated_data \
--remote-id $REMOTE_ID \
--local-bucket-id "${DOCKER_INFLUXDB_INIT_BUCKET_ID}" \
--remote-bucket "$UPSTREAM_BUCKET"
fi
view raw gistfile1.txt hosted with ❤ by GitHub

Deploying InfluxDB

Next, we need to deploy InfluxDB itself.

Things to note about this definition are:

  • Variables DOCKER_INFLUXDB_INIT_BUCKET and DOCKER_INFLUXDB_INIT_RETENTION tell InfluxDB to create a bucket called telegraf with a retention policy of 90 days.
  • My NFS share is configured to squash permissions, so I included a securityContext to use the appropriate UID.
  • We make the various values from our secret available via the environment variable.
  • As well as the PVC, we mount the configmap containing the init script into the container.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: influxdb
spec:
replicas: 1
selector:
matchLabels:
app: influxdb
template:
metadata:
labels:
app: influxdb
spec:
# Optional
# run as the UID that the NAS squashes to
securityContext:
fsGroup: 100
runAsUser: 1024
runAsGroup: 100
containers:
- env:
- name: DOCKER_INFLUXDB_INIT_MODE
value: "setup"
# Provide account details using the secret
- name: DOCKER_INFLUXDB_INIT_USERNAME
valueFrom:
secretKeyRef:
name: influxdb-info
key: user
optional: false
- name: DOCKER_INFLUXDB_INIT_PASSWORD
valueFrom:
secretKeyRef:
name: influxdb-info
key: password
optional: false
- name: DOCKER_INFLUXDB_INIT_ORG
valueFrom:
secretKeyRef:
name: influxdb-info
key: org
optional: false
- name: V1_READ_USERNAME
valueFrom:
secretKeyRef:
name: influxdb-info
key: readuser
optional: false
- name: V1_READ_PASSWORD
valueFrom:
secretKeyRef:
name: influxdb-info
key: readpass
optional: false
- name: V1_WRITE_USERNAME
valueFrom:
secretKeyRef:
name: influxdb-info
key: writeuser
optional: false
- name: V1_WRITE_PASSWORD
valueFrom:
secretKeyRef:
name: influxdb-info
key: writepass
optional: false
- name: DOCKER_INFLUXDB_INIT_ADMIN_TOKEN
valueFrom:
secretKeyRef:
name: influxdb-info
key: token
optional: false
# These are optional and will be used if EDR is being
# configured
- name: UPSTREAM_URL
valueFrom:
secretKeyRef:
name: upstream-influxdb
key: url
optional: true
- name: UPSTREAM_ORG
valueFrom:
secretKeyRef:
name: upstream-influxdb
key: org
optional: true
- name: UPSTREAM_TOKEN
valueFrom:
secretKeyRef:
name: upstream-influxdb
key: token
optional: true
- name: UPSTREAM_BUCKET
valueFrom:
secretKeyRef:
name: upstream-influxdb
key: bucket
optional: true
# Define the bucket to auto-create
- name: DOCKER_INFLUXDB_INIT_BUCKET
value: "telegraf"
- name: DOCKER_INFLUXDB_INIT_RETENTION
value: "90d"
image: influxdb:2.7
name: influxdb2
ports:
- containerPort: 8086
name: http-influxport
protocol: TCP
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 250m
memory: 1Gi
volumeMounts:
# Mount two paths from the pvc
- mountPath: /etc/influxdb2
name: influxdb-pvc
subPath: "influxdb/config"
- mountPath: /var/lib/influxdb2
name: influxdb-pvc
subPath: "influxdb/data"
# Mount the configmap as an init script
- mountPath: /docker-entrypoint-initdb.d/create_dbrp.sh
name: mon-files
subPath: "00_setup_dbrp"
volumes:
- name: influxdb-pvc
persistentVolumeClaim:
claimName: influxdb-pvc
- name: mon-files
configMap:
name: influx-monitoring-files
defaultMode: 0755
view raw gistfile1.txt hosted with ❤ by GitHub

The Deployment will need to be fronted by a service, so we define that next, passing port 8086 through:

apiVersion: v1
kind: Service
metadata:
labels:
app: influxdb
name: influxdb
namespace: monitoring
spec:
type: LoadBalancer
sessionAffinity: None
ports:
- port: 8086
name: influxapi
protocol: TCP
targetPort: http-influxport
selector:
app: influxdb
view raw gistfile1.txt hosted with ❤ by GitHub

If we apply our work so far:

kubectl apply -f influxdb-kubemonitoring.yml
view raw gistfile1.txt hosted with ❤ by GitHub

We should now be able to retrieve service details:

$ kubectl get svc -l app=influxdb
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
influxdb LoadBalancer 10.105.178.13 <pending> 8086:31986/TCP 3m10s
view raw gistfile1.txt hosted with ❤ by GitHub

It should now also be possible to use the cluster IP to run a query against InfluxDB (replace $INFLUX_TOKEN with the token you generated when first defining the secret).

$ curl \
-H "Authorization: Token $INFLUX_TOKEN" \
"http://10.105.178.13:8086/api/v2/query?org=kubernetes" \
-H "content-type: application/vnd.flux" \
-d 'buckets()'
,result,table,name,id,organizationID,retentionPolicy,retentionPeriod
,_result,0,_monitoring,6d9185f72df6fc9f,625260941b0ecaf8,,604800000000000
,_result,0,_tasks,434c2d9c07d15d2a,625260941b0ecaf8,,259200000000000
,_result,0,telegraf,98b46be7a5f243d6,625260941b0ecaf8,,7776000000000000
view raw gistfile1.txt hosted with ❤ by GitHub

We should also be able to use the read-only creds with the v1 query API:

INFLUX_PASS="set me to kubestatsro password"
curl \
-u kubestatsro:"$INFLUX_PASS" \
"http://10.105.178.13:8086/query?q=show+databases"
{"results":[{"statement_id":0,"series":[{"name":"databases","columns":["name"],"values":[["telegraf"]]}]}]}
view raw gistfile1.txt hosted with ❤ by GitHub

Deploying Telegraf

Deploying Telegraf into Kubernetes is really easy. However, authorizing it to fetch information from the Kubernetes APIs involves interacting with the k8s auth model, which is a bit more complex.

First, we’re going to define a service account called telegraf and then authorize that to talk to both kubelet and cluster APIs.

Define the service account:

---
apiVersion: v1
kind: ServiceAccount
metadata:
name: telegraf
namespace: monitoring
labels:
app: telegraf
view raw gistfile1.txt hosted with ❤ by GitHub

Define a pair of cluster roles:

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: influx-stats-viewer
namespace: monitoring
labels:
app: telegraf
rules:
- apiGroups: ["metrics.k8s.io"]
resources: ["pods"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["nodes/proxy", "nodes/stats", "persistentvolumes", "nodes", "secrets"]
verbs: ["get", "list", "watch"]
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: influx:telegraf
aggregationRule:
clusterRoleSelectors:
- matchLabels:
app: telegraf
- matchLabels:
rbac.authorization.k8s.io/aggregate-to-view: "true"
rules: [] # Rules are automatically filled in by the controller manager.
view raw gistfile1.txt hosted with ❤ by GitHub

Note: Technically, you could probably combine those, but as they give access to different things, I prefer to keep them separate.

Finally, create the role bindings to link both back to the Service Account:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: influx:telegraf:viewer
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: influx:telegraf
subjects:
- kind: ServiceAccount
name: telegraf
namespace: monitoring
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: metric-scanner-kubelet-api-admin
labels:
app: telegraf
subjects:
- kind: ServiceAccount
name: telegraf
namespace: monitoring
roleRef:
kind: ClusterRole
name: system:kubelet-api-admin
apiGroup: rbac.authorization.k8s.io
view raw gistfile1.txt hosted with ❤ by GitHub

With that unpleasantness out of the way, we’re ready to configure and deploy Telegraf.

As the Telegraf config is simple, we drop it into a ConfigMap.

Note: There’s no need to replace the variables by hand; we’ll be setting them in the container config.

---
apiVersion: v1
kind: ConfigMap
metadata:
name: telegraf
namespace: monitoring
labels:
app: telegraf
data:
telegraf.conf: |+
[global_tags]
env = "$ENV"
[agent]
hostname = "$HOSTNAME"
[[outputs.influxdb]]
urls = ["$MONITOR_HOST"] # required
database = "$MONITOR_DATABASE" # required
username = "$V1_WRITE_USER"
password = "$V1_WRITE_PASSWORD"
timeout = "5s"
skip_database_creation = true
[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false
report_active = false
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.kubernetes]]
url = "https://$HOSTIP:10250"
bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
insecure_skip_verify = true
[[inputs.kube_inventory]]
namespace = "" # Collect from all namespaces
bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
view raw gistfile1.txt hosted with ❤ by GitHub

Then, we’re ready to define our DaemonSet.

There are a couple of important points here:

  • We’re exposing the node IP via env var HOSTIP
  • We’re setting the env var HOSTNAME to the node Hostname via spec.nodeName
  • We’re passing in credentials from the secret we created earlier
  • We’re mapping directories from the host into the container
  • We’re mounting our configmap as config
  • We set MONITOR_HOST to use the name of the service we created for InfluxDB (http://influxdb:8086).
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: telegraf
namespace: monitoring
labels:
app: telegraf
spec:
selector:
matchLabels:
app: telegraf
template:
metadata:
labels:
app: telegraf
spec:
serviceAccountName: telegraf
containers:
- name: telegraf
image: telegraf:1.31
resources:
limits:
memory: 500Mi
requests:
cpu: 500m
memory: 250Mi
env:
# Pass the node hostname through
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
# Pass the node IP through
- name: HOSTIP
valueFrom:
fieldRef:
fieldPath: status.hostIP
# Mount some host directories so that telegraf can pull
# process info etc
- name: "HOST_PROC"
value: "/rootfs/proc"
- name: "HOST_SYS"
value: "/rootfs/sys"
- name: ENV
value: "lab"
# Provide credentials to use with the write API
- name: V1_WRITE_USER
valueFrom:
secretKeyRef:
name: influxdb-info
key: writeuser
optional: false
- name: V1_WRITE_PASSWORD
valueFrom:
secretKeyRef:
name: influxdb-info
key: writepass
optional: false
# What URL can we find InfluxDB at?
- name: MONITOR_HOST
value: "http://influxdb:8086"
# Which DB are we writing into?
- name: MONITOR_DATABASE
value: "telegraf"
volumeMounts:
- name: sys
mountPath: /rootfs/sys
readOnly: true
- name: proc
mountPath: /rootfs/proc
readOnly: true
- name: utmp
mountPath: /var/run/utmp
readOnly: true
- name: config
mountPath: /etc/telegraf
# Allow Telegraf time to flush when terminating
terminationGracePeriodSeconds: 30
volumes:
- name: sys
hostPath:
path: /sys
- name: proc
hostPath:
path: /proc
- name: utmp
hostPath:
path: /var/run/utmp
- name: config
configMap:
name: telegraf
view raw gistfile1.txt hosted with ❤ by GitHub

If we apply the updated config:

kubectl apply -f influxdb-kubemonitoring.yml
view raw gistfile1.txt hosted with ❤ by GitHub

Telegraf should spring to life and start collecting metrics from the Kubernetes APIs.

We can check by tailing Telegraf’s logs:

kubectl -n monitoring logs -f daemonset/telegraf
view raw gistfile1.txt hosted with ❤ by GitHub

If it’s not logging errors, then everything is working.

Checking for metrics

We can confirm that metrics are arriving in InfluxDB by logging into its web interface.

Assuming that you deployed into the cluster, grab the cluster IP by running the following:

kubectl -n monitoring get svc -l app=influxdb
view raw gistfile1.txt hosted with ❤ by GitHub

Then, visit http://[cluster ip]:8086 in a browser.

You should be able to log in with the credentials you created at the beginning of this process.

If you browse to the Data Explorer, you should see a bunch of Kubernetes-related measurements.

Dashboarding

With metrics coming into InfluxDB, we now just need a dashboard to help visualize things. I do all of my dashboarding in Grafana so that I only have to go to one place to visualize metrics from a wide variety of sources.

We’ll need to tell Grafana how to speak to our new InfluxDB instance, so:

  • Burger Menu
  • Connections
  • Connect Data
  • Search for InfluxDB

When adding the new InfluxDB datasource in Grafana, you’ll need to provide the following:

  • Language: InfluxQL (because we’re using the v1 API)
  • URL: If InfluxDB and Grafana are running in the same cluster, you can simply use the InfluxDB service name. Otherwise you’ll need to provide an IP or domain name

Use the following settings in Grafana:

  • Database: telegraf
  • User: kubestatsro
  • Password: the password we defined earlier

Once the datasource has been created, we can start to pull stats out with queries like:

SELECT
max(cpu_usage_nanocores)/1000000000 AS usage
FROM kubernetes_pod_container
WHERE
$timeFilter
AND namespace = '${namespace}'
GROUP BY time($__interval), pod_name
fill(null)
view raw gistfile1.txt hosted with ❤ by GitHub

Ultimately, we are building a single dashboard that allows us to view resource usage at node, namespace, and pod levels.

For example, we can see that my Atuin sync server has been assigned far more resources than it’s actually using (potentially preventing other workloads from being scheduled on that node): An importable copy of the dashboard can be found on Github. Because it uses InfluxQL, it should be compatible with InfluxDB 1.x, 2.x and 3.x.

Conclusion

If you’ve been following along, you should now have an InfluxDB instance receiving metrics about each resource in your Kubernetes cluster. If you enabled EDR, these metrics will be replicated to an external instance in order to help ensure continuity of monitoring during outages.

I now have the means to see a bit more about what’s going on inside my cluster, including the information I need to ensure that resource requests aren’t being set too high, unnecessarily limiting capacity in the process.

Although YAML’s verbosity makes it seem like a lot more than it is, it didn’t take much to get InfluxDB and Telegraf up and running in the cluster.

The auto-provisioning of credentials and configuration means that if I ever want to start over, I can simply wipe the PVCs and roll the pods—the image will recreate the credentials, and everything will come back up.

To get started on this or another project on InfluxDB, sign up for a free cloud account now.