Collecting Running Process Counts with Telegraf
By
Noah Crowley /
Product, Use Cases, Developer
Feb 27, 2018
Navigate to:
Adding monitoring to your stack is one of the quickest ways to get visibility into your application, letting you catch issues more quickly and begin to make data-driven decisions about where to invest your engineering efforts. One of the most straightforward things to start monitoring is your processes themselvesit can be hard for a web server to serve requests if no processes are running, and on the flip side you can quickly deplete your available resources by running more copies of a program than you intended.
On most ‘nix systems, process information can be gathered in multiple ways. Depending on which specific OS you’re running, you might be able to look at the proc
filesystem, which contains a number of files with information about running processes and the system in general, or you could use a tool like ps
, which outputs information about running processes to the command line.
In this example we’ll use Python and Ubuntu Linux, but many of the concepts will carry over to other languages or applications and operating systems.
Getting Info about Processes on Linux
One great place to get more information about your system is the proc
filesystem, which according to the man page, “is a pseudo-filesystem which provides an interface to kernel data structures. It is commonly mounted at /proc
.” If you visit that directory on a Linux machine, you might see something like this (output from a fresh install of Ubuntu 16.04.3):
$ cd /proc
$ ls
1 1182 14 18 25 3 399 437 53 60 80 839 bus driver kallsyms locks partitions sys version_signature
10 12 141 19 26 30 4 47 54 61 81 840 cgroups execdomains kcore mdstat sched_debug sysrq-trigger vmallocinfo
1017 1200 15 2 266 31 413 48 544 62 817 890 cmdline fb keys meminfo schedstat sysvipc vmstat
1032 1230 152 20 267 320 414 480 55 66 818 9 consoles filesystems key-users misc scsi thread-self zoneinfo
1033 1231 16 21 277 321 420 49 56 671 820 919 cpuinfo fs kmsg modules self timer_list
1095 1243 164 22 278 369 421 5 57 7 828 925 crypto interrupts kpagecgroup mounts slabinfo timer_stats
11 126 165 23 28 373 423 50 58 701 831 acpi devices iomem kpagecount mtrr softirqs tty
1174 128 166 24 29 381 425 51 59 79 837 asound diskstats ioports kpageflags net stat uptime
1176 13 17 241 295 397 426 52 6 8 838 buddyinfo dma irq loadavg pagetypeinfo swaps version
There’s a lot there! The first thing you’ll notice is a series of numbered directories; these correspond to running processes, and each directory is named with the “Process ID”, or PID, of that process. You’ll also see a number of other directories and files with information about everything from kernel parameters and loaded modules to CPU info, network statistics and system uptime. Inside the directories for each process you’ll find almost as much information about each individual processtoo much for our use case. After all, we just want to monitor whether or not the process is running, and maybe how many copies are running.
When a system administrator logs into a server to verify that a process is running, it’s unlikely that /proc
would be the first place they turn. Instead, they’d probably use a tool like ps
, which also provides information about running processes. There are a few different versions of ps
that you might use, but for the version on Ubuntu you can use the following command to get information about all running processes on your server:
$ ps aux
We’re going to use Python to create a few processes for testing purposes. Since we don’t really need these to be doing any work, we’ll write a simple program with an infinite loop and a call to the sleep function in Python’s time module, in order to avoid using unnecessary CPU cycles. Make sure you have Python installed by entering the following command at the command line:
$ python --version
Python 2.7.12
Since our program is so simple, it will work with either Python 2 or Python 3. Next, use a text editor to create a file called loop.py
with the following contents:
#!/usr/bin/env python
import time
while True:
time.sleep(5)
The first line tells the operating system which interpreter to use to execute the script. If this program was more complex, or used functionality that differed between Python 2 and Python 3, we’d want to specify which version of Python we were using instead of just saying python
.
Run this command from the same directory where the file is located to make the script executable:
$ chmod 744 loop.py
and then run the program, appending the &
character to the end of the command (which tells Linux to run the process in the background) so we still have access to the shell:
$ ./loop.py &
[1] 1886
After running a command using the &
character, the PID of the running process is listed in the output. If you run ps aux
again, you should now see a Python process with PID 1886
in the list of results.
On the Ubuntu server I am using, this command returned just over 100 results, and searching through that list manually is too inefficient. We can use another command, grep
, as well as some of Linux’s built-in functionality, to narrow down the results. The grep
command acts like a search filter, and we can use a Linux “Pipe”, the |
character, to send the data from the output of the ps aux
command to the input of the grep
command. Let’s try it out:
$ ps aux | grep python
noah 1886 0.0 0.2 24512 6000 pts/0 S 20:14 0:00 python ./loop.py
noah 1960 0.0 0.0 14224 1084 pts/0 S+ 20:56 0:00 grep --color=auto python
First we’re getting information about all running processes, then we’re “piping” that data into the grep
command, which is searching for any lines that contain the string python
. Sure enough, there is our Python process, 1886
, in the first line of the results. But what about that second line?
When we run the ps
command, the output includes the arguments we provided when each process was started; in this case, --color=auto
is added because Ubuntu has an alias that runs grep --color=auto
when you type grep
, and then the python
argument, which is the string we were searching for. So we’re searching for “python”, which means the string “python” will be included in the output of ps
for the grep
process, so grep
will always match with itself because it contains the string it is searching for.
A common workaround to this issue is to search for the regular expression “[p]ython” instead of the string “python”. This will cause grep
to match any string that starts with any of the letters inside the brackets, in our case only a “p”, followed by the letters “ython”. When we do this, grep
still matches the word “python”, because it starts with a “p” and ends in “ython”, but it does not match itself because “[p]ython” doesn’t match that pattern. Give it a shot:
$ ps aux | grep [p]ython
noah 1886 0.0 0.2 24512 6000 pts/0 S 20:14 0:00 python ./loop.py
Let’s start up another Python process and see what we get:
$ ./loop.py &
[2] 1978
$ ps aux | grep [p]ython
noah 1886 0.0 0.2 24512 6000 pts/0 S 20:14 0:00 python ./loop.py
noah 1978 0.0 0.2 24512 6004 pts/0 S 21:13 0:00 python ./loop.py
Two Python processes, two results. If we wanted to verify that a certain number of processes were running, we should just be able to count the lines outputted by our command; fortunately providing the -c
argument to grep
does exactly that:
$ ps aux | grep -c [p]ython
2
Let’s bring the most recent of the two Python scripts into the foreground by using the fg
command, and then kill it using <Ctrl+C>, and count the number of Python processes again:
$ fg
./loop.py
^CTraceback (most recent call last):
File "./loop.py", line 6, in
time.sleep(5)
KeyboardInterrupt
$ ps aux | grep -c [p]ython
1
Perfect! One is the number we were looking for.
There’s another command, pgrep
, which also fulfills all the requirements of our use case, but it’s not as generally useful. It allows you to search for all processes which match a string, and returns their PIDs by default. It also accepts a -c
argument, which outputs a count of the number of matches instead:
$ pgrep -c python
1
Gathering Process Counts with Telegraf
Now that we know how to count the number of processes running on a server, we need to start collecting that data at regular intervals. Telegraf gives us a way to execute the same commands that we’re using at the shell in order to collect data in the form of the exec
input plugin.
The exec
plugin will run once during each of Telegraf’s collection intervals, executing the commands from your configuration file and collecting their output. The output can be in a variety of formats, including any of the supported Input Formats, which means that if you already have scripts that output some kind of metrics data in JSON or another of the supported formats, you can use the exec
plugin to quickly start collecting those metrics using Telegraf.
If you don’t already have Telegraf installed, you can refer to the installation documentation here. After following the instructions for Ubuntu, you should find a config file located at /etc/telegraf/telegraf.conf
.
For the purpose of this example, we’re going to write the output to a file, so we want to edit the [[outputs.file]]
section of the config, like so:
# # Send telegraf metrics to file(s)
[[outputs.file]]
## Files to write to, "stdout" is a specially handled file.
files = ["/tmp/metrics.out"]
## Data format to output.
data_format = "influx"
We’ll apply those changes by restarting Telegraf, then check that metrics are being written to /tmp/metrics.out
. When installing Telegraf from the package manager, the system
input plugin is enabled by default, so we should start seeing metrics immediately:
$ sudo systemctl restart telegraf
$ tail -n 2 /tmp/metrics.out
diskio,name=dm-0,host=demo writes=7513i,read_bytes=422806528i,write_bytes=335978496i,write_time=23128i,io_time=9828i,iops_in_progress=0i,reads=9111i,read_time=23216i,weighted_io_time=46344i 1519701100000000000
diskio,name=dm-1,host=demo write_time=0i,io_time=108i,weighted_io_time=116i,read_time=116i,writes=0i,read_bytes=3342336i,write_bytes=0i,iops_in_progress=0i,reads=137i 1519701100000000000
Unfortunately, the exec
plugin doesn’t know what to do with multiple commands, like we have above, so we need to put them into a simple bash script. First, create a file called pyprocess_count
in your home directory, with the following text:
#!/bin/sh
count=$(ps aux | grep -c [p]ython)
echo $count
This script serves a secondary objective besides allowing us to execute a piped command using the exec
plugin if grep -c
returns zero results, it exits with a status code of 1, indicating an error. This causes Telegraf to ignore the output of the command, and emit its own error. By storing the results of the command in the count
variable, and then outputting it using echo
, we can make sure that the script exits with a status code of 0. Be careful not to include “python” in the filename, or grep will match with that string when the script is run. Once you’ve created the file, set its permissions so that anyone can execute it and test it out:
$ chmod 755 pyprocess_count
$ ./pyprocess_count
Then move it to /usr/local/bin
$ sudo mv pyprocess_count /usr/local/bin
Next, we need to configure the exec
input plugin to execute the script. Edit the [[inputs.exec]]
file so it looks like this:
# # Read metrics from one or more commands that can output to stdout
[[inputs.exec]]
## Commands array
commands = [
"/usr/bin/local/pyprocess_count"
]
## Timeout for each command to complete.
timeout = "5s"
name_override = "python_processes"
## Data format to consume.
## Each data format has its own unique set of configuration options, read
## more about them here:
## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
data_format = "value"
We’ve added the command directly to the command array, so it will be executed by Telegraf once per collection interval. We’ve also set the data_format
to "value"
, because the command will output a single number, and we use name_override
to give the metric a name.
Restart Telegraf again and then look at the metrics.out
file to see if our new metrics are showing up. Instead of searching through the file by eye, we can use grep
again to search for any lines with “python” in them:
$ grep python < /tmp/metrics.out
python_processes,host=demo value=1i 1519703250000000000
python_processes,host=demo value=1i 1519703260000000000
python_processes,host=demo value=1i 1519703270000000000
We’re using the <
character to send the contents of the metrics file to the grep command, another Linux feature, and in return we get a few lines of metrics in InfluxDB line protocol, with the name of the metric, a tag for the host added by Telegraf, the value (with an “i” to indicate that it is an integer), and a timestamp.
If we bring up another Python process, we should see the value change in our output:
$ ./loop.py &
[2] 2468
$ grep python < /tmp/metrics.out
python_processes,host=demo value=1i 1519703250000000000
python_processes,host=demo value=1i 1519703260000000000
python_processes,host=demo value=1i 1519703270000000000
python_processes,host=demo value=1i 1519703280000000000
python_processes,host=demo value=1i 1519703290000000000
python_processes,host=demo value=2i 1519703300000000000
And there we go! The final metric shows two Python processes running.
Next Steps
Writing metrics to disk isn’t a very useful practice, but it’s good for making sure your setup is collecting the data you expected. In order to make it actionable, you’ll need to send the data you collect to a central store somewhere so that you can visualize and alert on it.
The visualizations for these metrics would be minimal; we probably don’t need a full graph, since there shouldn’t be much variation in the data we’re getting that we need to look at historically. Instead, displaying a single number (for example, the Single Stat panel in Chronograf) should be enough to give you some confidence that things are working as expected.
How you alert on these metrics will depend on what exactly you’re monitoring. Perhaps you always want to have one copy of a process running. You could create an alert that sends an email every time your process count dropped below 1. After the first few alerts, though, your team will probably want to automate bringing up a new process if yours crashes, so you’ll need to tweak the alert so that some time needs to elapse between seeing the metric go to 0 and sending the first alert; if your automated system can bring up the process quickly, then a human doesn’t need to be contacted.
Or maybe you have a system that is regularly spawning new processes and killing old ones, but which should never have more than X processes running at a given time. You’d probably want to set up a similar alert to the one above, except instead of alerting when the metric drops from 0 to 1, you’d alert if the metric was greater than or less than X. You might want to give yourself a time window for this alert as well; maybe it’s OK if your system runs X+1 or X-1 processes for a short time as it is killing and bringing up new ones.
If you decide to send your data to InfluxDB, you can use Chronograf and Kapacitor to visualize and alert on your metrics. You can read more about creating a Chronograf Dashboard or setting up a Kapacitor alert on their respective documentation pages.