Collecting Running Process Counts with Telegraf

Navigate to:

Photo of the output of ps aux in macOS Terminal

Adding monitoring to your stack is one of the quickest ways to get visibility into your application, letting you catch issues more quickly and begin to make data-driven decisions about where to invest your engineering efforts. One of the most straightforward things to start monitoring is your processes themselves—it can be hard for a web server to serve requests if no processes are running, and on the flip side you can quickly deplete your available resources by running more copies of a program than you intended.

On most ‘nix systems, process information can be gathered in multiple ways. Depending on which specific OS you’re running, you might be able to look at the proc filesystem, which contains a number of files with information about running processes and the system in general, or you could use a tool like ps, which outputs information about running processes to the command line.

In this example we’ll use Python and Ubuntu Linux, but many of the concepts will carry over to other languages or applications and operating systems.

Getting Info about Processes on Linux

One great place to get more information about your system is the proc filesystem, which according to the man page, “is a pseudo-filesystem which provides an interface to kernel data structures. It is commonly mounted at /proc.” If you visit that directory on a Linux machine, you might see something like this (output from a fresh install of Ubuntu 16.04.3):

$ cd /proc
$ ls
1     1182  14   18   25   3    399  437  53   60   80   839        bus        driver       kallsyms     locks         partitions   sys            version_signature
10    12    141  19   26   30   4    47   54   61   81   840        cgroups    execdomains  kcore        mdstat        sched_debug  sysrq-trigger  vmallocinfo
1017  1200  15   2    266  31   413  48   544  62   817  890        cmdline    fb           keys         meminfo       schedstat    sysvipc        vmstat
1032  1230  152  20   267  320  414  480  55   66   818  9          consoles   filesystems  key-users    misc          scsi         thread-self    zoneinfo
1033  1231  16   21   277  321  420  49   56   671  820  919        cpuinfo    fs           kmsg         modules       self         timer_list
1095  1243  164  22   278  369  421  5    57   7    828  925        crypto     interrupts   kpagecgroup  mounts        slabinfo     timer_stats
11    126   165  23   28   373  423  50   58   701  831  acpi       devices    iomem        kpagecount   mtrr          softirqs     tty
1174  128   166  24   29   381  425  51   59   79   837  asound     diskstats  ioports      kpageflags   net           stat         uptime
1176  13    17   241  295  397  426  52   6    8    838  buddyinfo  dma        irq          loadavg      pagetypeinfo  swaps        version

There’s a lot there! The first thing you’ll notice is a series of numbered directories; these correspond to running processes, and each directory is named with the “Process ID”, or PID, of that process. You’ll also see a number of other directories and files with information about everything from kernel parameters and loaded modules to CPU info, network statistics and system uptime. Inside the directories for each process you’ll find almost as much information about each individual process—too much for our use case. After all, we just want to monitor whether or not the process is running, and maybe how many copies are running.

When a system administrator logs into a server to verify that a process is running, it’s unlikely that /proc would be the first place they turn. Instead, they’d probably use a tool like ps, which also provides information about running processes. There are a few different versions of ps that you might use, but for the version on Ubuntu you can use the following command to get information about all running processes on your server:

$ ps aux

We’re going to use Python to create a few processes for testing purposes. Since we don’t really need these to be doing any work, we’ll write a simple program with an infinite loop and a call to the sleep function in Python’s time module, in order to avoid using unnecessary CPU cycles. Make sure you have Python installed by entering the following command at the command line:

$ python --version
Python 2.7.12

Since our program is so simple, it will work with either Python 2 or Python 3. Next, use a text editor to create a file called loop.py with the following contents:

#!/usr/bin/env python

import time

while True:
    time.sleep(5)

The first line tells the operating system which interpreter to use to execute the script. If this program was more complex, or used functionality that differed between Python 2 and Python 3, we’d want to specify which version of Python we were using instead of just saying python.

Run this command from the same directory where the file is located to make the script executable:

$ chmod 744 loop.py

and then run the program, appending the & character to the end of the command (which tells Linux to run the process in the background) so we still have access to the shell:

$ ./loop.py &
[1] 1886

After running a command using the & character, the PID of the running process is listed in the output. If you run ps aux again, you should now see a Python process with PID 1886 in the list of results.

On the Ubuntu server I am using, this command returned just over 100 results, and searching through that list manually is too inefficient. We can use another command, grep, as well as some of Linux’s built-in functionality, to narrow down the results. The grep command acts like a search filter, and we can use a Linux “Pipe”, the | character, to send the data from the output of the ps aux command to the input of the grep command. Let’s try it out:

$ ps aux | grep python
noah      1886  0.0  0.2  24512  6000 pts/0    S    20:14   0:00 python ./loop.py
noah      1960  0.0  0.0  14224  1084 pts/0    S+   20:56   0:00 grep --color=auto python

First we’re getting information about all running processes, then we’re “piping” that data into the grep command, which is searching for any lines that contain the string python. Sure enough, there is our Python process, 1886, in the first line of the results. But what about that second line?

When we run the ps command, the output includes the arguments we provided when each process was started; in this case, --color=auto is added because Ubuntu has an alias that runs grep --color=auto when you type grep, and then the python argument, which is the string we were searching for. So we’re searching for “python”, which means the string “python” will be included in the output of ps for the grep process, so grep will always match with itself because it contains the string it is searching for.

A common workaround to this issue is to search for the regular expression “[p]ython” instead of the string “python”. This will cause grep to match any string that starts with any of the letters inside the brackets, in our case only a “p”, followed by the letters “ython”. When we do this, grep still matches the word “python”, because it starts with a “p” and ends in “ython”, but it does not match itself because “[p]ython” doesn’t match that pattern. Give it a shot:

$ ps aux | grep [p]ython
noah      1886  0.0  0.2  24512  6000 pts/0    S    20:14   0:00 python ./loop.py

Let’s start up another Python process and see what we get:

$ ./loop.py &
[2] 1978
$ ps aux | grep [p]ython
noah      1886  0.0  0.2  24512  6000 pts/0    S    20:14   0:00 python ./loop.py
noah      1978  0.0  0.2  24512  6004 pts/0    S    21:13   0:00 python ./loop.py

Two Python processes, two results. If we wanted to verify that a certain number of processes were running, we should just be able to count the lines outputted by our command; fortunately providing the -c argument to grep does exactly that:

$ ps aux | grep -c [p]ython
2

Let’s bring the most recent of the two Python scripts into the foreground by using the fg command, and then kill it using <Ctrl+C>, and count the number of Python processes again:

$ fg
./loop.py
^CTraceback (most recent call last):
  File "./loop.py", line 6, in 
    time.sleep(5)
KeyboardInterrupt
$ ps aux | grep -c [p]ython
1

Perfect! One is the number we were looking for.

There’s another command, pgrep, which also fulfills all the requirements of our use case, but it’s not as generally useful. It allows you to search for all processes which match a string, and returns their PIDs by default. It also accepts a -c argument, which outputs a count of the number of matches instead:

$ pgrep -c python
1

Gathering Process Counts with Telegraf

Now that we know how to count the number of processes running on a server, we need to start collecting that data at regular intervals. Telegraf gives us a way to execute the same commands that we’re using at the shell in order to collect data in the form of the exec input plugin.

The exec plugin will run once during each of Telegraf’s collection intervals, executing the commands from your configuration file and collecting their output. The output can be in a variety of formats, including any of the supported Input Formats, which means that if you already have scripts that output some kind of metrics data in JSON or another of the supported formats, you can use the exec plugin to quickly start collecting those metrics using Telegraf.

If you don’t already have Telegraf installed, you can refer to the installation documentation here. After following the instructions for Ubuntu, you should find a config file located at /etc/telegraf/telegraf.conf.

For the purpose of this example, we’re going to write the output to a file, so we want to edit the [[outputs.file]] section of the config, like so:

# # Send telegraf metrics to file(s)
[[outputs.file]]
  ## Files to write to, "stdout" is a specially handled file.
  files = ["/tmp/metrics.out"]
  ## Data format to output.
  data_format = "influx"

We’ll apply those changes by restarting Telegraf, then check that metrics are being written to /tmp/metrics.out. When installing Telegraf from the package manager, the system input plugin is enabled by default, so we should start seeing metrics immediately:

$ sudo systemctl restart telegraf
$ tail -n 2 /tmp/metrics.out
diskio,name=dm-0,host=demo writes=7513i,read_bytes=422806528i,write_bytes=335978496i,write_time=23128i,io_time=9828i,iops_in_progress=0i,reads=9111i,read_time=23216i,weighted_io_time=46344i 1519701100000000000
diskio,name=dm-1,host=demo write_time=0i,io_time=108i,weighted_io_time=116i,read_time=116i,writes=0i,read_bytes=3342336i,write_bytes=0i,iops_in_progress=0i,reads=137i 1519701100000000000

Unfortunately, the exec plugin doesn’t know what to do with multiple commands, like we have above, so we need to put them into a simple bash script. First, create a file called pyprocess_count in your home directory, with the following text:

#!/bin/sh

count=$(ps aux | grep -c [p]ython)

echo $count

This script serves a secondary objective besides allowing us to execute a piped command using the exec plugin— if grep -c returns zero results, it exits with a status code of 1, indicating an error. This causes Telegraf to ignore the output of the command, and emit its own error. By storing the results of the command in the count variable, and then outputting it using echo, we can make sure that the script exits with a status code of 0. Be careful not to include “python” in the filename, or grep will match with that string when the script is run. Once you’ve created the file, set its permissions so that anyone can execute it and test it out:

$ chmod 755 pyprocess_count
$ ./pyprocess_count

Then move it to /usr/local/bin

$ sudo mv pyprocess_count /usr/local/bin

Next, we need to configure the exec input plugin to execute the script. Edit the [[inputs.exec]] file so it looks like this:

# # Read metrics from one or more commands that can output to stdout
[[inputs.exec]]
  ## Commands array
  commands = [
    "/usr/bin/local/pyprocess_count"
  ]

  ## Timeout for each command to complete.
  timeout = "5s"

  name_override = "python_processes"

  ## Data format to consume.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
  data_format = "value"

We’ve added the command directly to the command array, so it will be executed by Telegraf once per collection interval. We’ve also set the data_format to "value", because the command will output a single number, and we use name_override to give the metric a name.

Restart Telegraf again and then look at the metrics.out file to see if our new metrics are showing up. Instead of searching through the file by eye, we can use grep again to search for any lines with “python” in them:

$ grep python < /tmp/metrics.out
python_processes,host=demo value=1i 1519703250000000000
python_processes,host=demo value=1i 1519703260000000000
python_processes,host=demo value=1i 1519703270000000000

We’re using the < character to send the contents of the metrics file to the grep command, another Linux feature, and in return we get a few lines of metrics in InfluxDB line protocol, with the name of the metric, a tag for the host added by Telegraf, the value (with an “i” to indicate that it is an integer), and a timestamp.

If we bring up another Python process, we should see the value change in our output:

$ ./loop.py &
[2] 2468
$ grep python < /tmp/metrics.out
python_processes,host=demo value=1i 1519703250000000000
python_processes,host=demo value=1i 1519703260000000000
python_processes,host=demo value=1i 1519703270000000000
python_processes,host=demo value=1i 1519703280000000000
python_processes,host=demo value=1i 1519703290000000000
python_processes,host=demo value=2i 1519703300000000000

And there we go! The final metric shows two Python processes running.

Next Steps

Writing metrics to disk isn’t a very useful practice, but it’s good for making sure your setup is collecting the data you expected. In order to make it actionable, you’ll need to send the data you collect to a central store somewhere so that you can visualize and alert on it.

The visualizations for these metrics would be minimal; we probably don’t need a full graph, since there shouldn’t be much variation in the data we’re getting that we need to look at historically. Instead, displaying a single number (for example, the Single Stat panel in Chronograf) should be enough to give you some confidence that things are working as expected.

How you alert on these metrics will depend on what exactly you’re monitoring. Perhaps you always want to have one copy of a process running. You could create an alert that sends an email every time your process count dropped below 1. After the first few alerts, though, your team will probably want to automate bringing up a new process if yours crashes, so you’ll need to tweak the alert so that some time needs to elapse between seeing the metric go to 0 and sending the first alert; if your automated system can bring up the process quickly, then a human doesn’t need to be contacted.

Or maybe you have a system that is regularly spawning new processes and killing old ones, but which should never have more than X processes running at a given time. You’d probably want to set up a similar alert to the one above, except instead of alerting when the metric drops from 0 to 1, you’d alert if the metric was greater than or less than X. You might want to give yourself a time window for this alert as well; maybe it’s OK if your system runs X+1 or X-1 processes for a short time as it is killing and bringing up new ones.

If you decide to send your data to InfluxDB, you can use Chronograf and Kapacitor to visualize and alert on your metrics. You can read more about creating a Chronograf Dashboard or setting up a Kapacitor alert on their respective documentation pages.