Continuous Deployment of Telegraf Configurations

Navigate to:

After I shared my “Using Telegraf as a Gateway” post with Rawkode, he mentioned a talk he gives, where he discusses advanced Telegraf topics, including a requirement he’s seen for automatically configuring running Telegraf instances just by editing config files in GitHub. Watch that talk here.

I had a couple of questions about Telegraf reloading its configs, which he answered, and I wondered whether I have all the toolkit components already running in my home LAN to get this working. I could look at implementing a Continuous Delivery or Release Automation application to do this, but I’ll look at using what I already have.

I found that I already use everything I need! So here’s a Continuous Deployment mechanism for Telegraf. To replicate this, you will need:

  1. Telegraf
    • I mostly use Linux servers, but I have a couple of Windows machines and I mention how to include those later in this post
  2. Root access to those machines
    • to change configuration files
  3. Node-RED
    • other applications that permit the dynamic creation of HTTP endpoints would do just as well, but I use Node-RED
  4. Err, that's it!

When installed on Linux (Debian - other distros might put config locations elsewhere), Telegraf command line options read /etc/telegraf/telegraf.conf and /etc/telegraf/telegraf.d/*.conf. Telegraf also includes the option of reading its configuration from an HTTP endpoint, which you’ll have seen if you use InfluxDB v2. When prompted to configure Telegraf with InfluxDB v2, you are provided with an API Token to store as an environment variable on the machine where your Telegraf runs, and you’re presented with this command line:

telegraf --config https://InfluxDBCloud2Instance/api/v2/telegrafs/NNN

That command line will run Telegraf (assuming it’s in your PATH), connecting to the specified endpoint to read its configuration from there.

So, what happens if I use this configuration?

/usr/bin/telegraf -config http://192.168.1.8:1880/telegraf/myhost

This should connect to that IP address (it’s my Node-RED server) on port 1880 (the Node-RED port), to the endpoint /telegraf/myhost. If I can get Node-RED to respond to web requests on /telegraf/myhost, I should be able to send a configuration to Telegraf.

The Node-RED http-in and http-out nodes are designed to do exactly this.

The next challenge is that each of my Telegraf instances might want a different set of configurations, so I want Node-RED to understand which instance of Telegraf is calling it, and to change configs accordingly. Telegraf will send a GET, so it’ll need to ensure the URL is unique in some way.

After that, I want Node-RED to provide updated configurations, for when I change any config, so I need to have Node-RED refresh its configs periodically. What might be a reasonable refresh time? I’ll go with 10 minutes.

Then I want each instance of Telegraf to refresh configurations, so everything is automated. What might be a reasonable refresh time for this? I’ll try an hour.

Of course, I want all this stuff to run without exposing any sensitive config information, so I need to have Node-Red modify any retrieved configs to add personal information such as the InfluxDB v2 API Token.

This is quite a kit-bag of things I need to do, so let’s get to it!

First, the Telegraf end:

I run Telegraf from systemd in all my LXC containers, so it runs automatically on start-up. Therefore, the configuration of Telegraf is in the systemd directories. If you’re manually running Telegraf, running on a system that uses sysvinit, or running from Docker, you’ll need to modify accordingly, but for my configuration, I need to edit the Telegraf service file located here:

/lib/systemd/system/telegraf.service

Copy that file to /etc, so it’s preserved when Telegraf is updated:

cp /lib/systemd/system/telegraf.service /etc/systemd/system

Now edit the new file. The ExecStart line shows the command line that will run Telegraf, with the -config and -config-directory options. Change it to this:

ExecStart=/usr/bin/telegraf -config http://192.168.1.8:1880/telegraf/%H $TELEGRAF_OPTS

Note the %H. It’s a special property in systemd that means “replace this with the hostname of this machine”, so each machine will request its own set of configs. That should be enough uniqueness for now — if I ever want to implement multiple instances of Telegraf on a single machine, or expand this mechanism to include environment definitions, I’ll investigate abstractions in systemd to retrieve variables from the EnvironmentFile entry rather than using %H.

Look at the next line in that file:

ExecReload=/bin/kill -HUP $MAINPID

This states that a “systemctl reload” command will send a SIGHUP signal to Telegraf, which is designed to cause Telegraf to refresh its configuration. I want to execute that on an hourly schedule. Linux machines run cron for regular automated tasks, so we can use this facility. Create the file /etc/cron.hourly/telegraf and put these lines into it:

#!/bin/bash -e 
# random sleep (max 5 min) to prevent clients from hitting the server at the same time 
SLEEP=$[ ($RANDOM % 300) ] && sleep $SLEEP 
systemctl reload telegraf

This will cause Telegraf to reload its configuration every 60-65 minutes.

Make that file executable:

This will cause Telegraf to reload its configuration every 60-65 minutes.

Make that file executable:

chmod +x /etc/cron.hourly/telegraf

Notes:

  • If your filename uses punctuation marks, it will be rejected by cron. There's no running a telegraf.runthis file — just keep the filename simple.
  • If you want your config reload to happen at a different time scale than hourly, other parts of the cron system enable this, so you can edit crontab, cron.daily, etc.

Now run two commands:

systemctl daemon-reload 
systemctl restart telegraf

and Telegraf will now complain in its log file that it can’t get a configuration from anywhere. We’d better jump to Node-RED and sort this out!

Before we run off, a quick note for readers running Telegraf on Windows. Edit the registry, go to:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\telegraf

Change ImagePath to this (change the IP address to that of your own Node-RED server):

"C:\Program Files(x86)\Telegraf\telegraf.exe" --config 
"http://192.168.1.8:1880/telegraf/%ComputerName%"

Either reboot, or run “Services” and restart Telegraf.

This will NOT cause Telegraf to regularly reload its configuration in Windows. Just reboot the PC or restart the Telegraf service for this to happen.

Now, let’s run to Node-RED.

Create a new tab in Node-RED. Add the required nodes into it for configs.

timestamp

The first node, “timestamp”, updates the configs every 10 minutes.

edit inject node

The next nodes set properties for private settings. These are “Change” nodes, and are where I’ll add all my personal information for connecting and authenticating against my Cloud instances of InfluxDB. They look like this:

edit change node

Now we need to retrieve the configs from wherever they are stored. I created nodes like this:

creating nodes

This is the time where you can get really creative! You might want to store your configs in GitHub or GitLab, or you might be into Software-Defined Networking (config-as-a-service) and so want to send updates over something like a messaging topic. Updated config data can come from many places, and the only difference in Node-RED is the type of node used to update the config information:

update config information

I prefer to use Node-RED to directly store my configs, so my config nodes look something like this:

Note the “mustache” template for the options urls, database, username and password. These pieces of information are retrieved from the “Change” nodes and inserted into here, so the data is always correct and current.

Multiple config snippets can exist in one template node. Create each template node as you would like, for best readability, understanding, and appropriate re-use, such as this one for a set of core monitoring inputs:

After a deployment of the flow, all of our config snippets are loaded into Node-RED, and refreshed every 10 minutes. The final task is to serve a web page for Telegraf to use:

config snippets

The first node in this flow is a “http-in” node. It listens for GET requests on http://192.168.1.8:1880/telegraf/:hostname.

http in node

The :hostname automatically becomes a property in Node-RED, which we will use in the “Select on hostname” Switch node. Remember when we configured Telegraf with %H? That parameter replacement is what we’re using here.

The Switch node matches against the hostname, and sends the request down a specific path for each host.

As an example to follow the workflow, let’s look at what happens for the host “mqtt”. After the Switch node selects the “mqtt” path, it goes next to a “MQTT config” template node, which enables specific monitoring config for my MQTT implementation.

That config contains my ActiveMQ and Jolokia-2 config information (I use ActiveMQ as my MQTT broker because it has monitoring hooks and a built-in view of all the topics I’m using, along with an indication of how many subscribers I have listening to each topic). The workflow then goes to the Linux config node, which inserts all the standard Linux monitoring configs I want to use, appending the MQTT config at the end:

The final step is a “http-out” node, which responds to the original requestor with the required config data.

http out node

Every few minutes, Node-Red logs that a connection has been made and a new config has been served:

How does this look from the Telegraf end? Here’s a section of telegraf.log from the server named “mqtt”, which shows what happens when the reload command is sent to Telegraf, how it writes all buffered points to InfluxDB before it restarts and loads all its updated inputs and outputs.

telegraf output

Note that the first outputs after the reload contain more points than the outputs before the reload. This is Telegraf processing inputs that were buffered while it was reloading its config. During testing, I noticed that, if I write to the Telegraf instance over HTTP while it’s reloading its config, the HTTP port isn’t open, so this could be a consideration for determining how frequently to reload configs.

So all my Telegraf instances are now updating their running config every hour, which means I can tune my jitter and buffers as required, and I can update my configs as I want, for every instance of Telegraf, a subset of instances, or individually, all without needing to log in to a server.

Node-RED can even be configured to commit your tabs to version control, so all the configs are safe and secure.

In effect, I’ve just configured Node-RED-to-Telegraf to run as a configuration management / change management / continuous deployment system, all through UI elements using model-based processes, without needing a single line of code.

I don’t know if I’m going to continue to use this mechanism for Telegraf configs, whether I’ll use the Telegraf Gateway mechanism, or whether I’ll use a combination of both. I’ll run a combination for now and revisit this at a later date to determine the most appropriate deployment mechanism for my use case.