Parsing Logs into InfluxDB Using Telegraf's Logparser Plugin
By
Cameron Sparr /
Product, Company
Jun 23, 2016
Navigate to:
InfluxData is excited to announce that it now supports the ability to stream log event data to InfluxDB with out-of-the-box patterns for popular servers like Nginx and Apache. Log monitoring support is the latest addition to Telegraf’s already impressive list of 90+ input and output data plug-ins. In addition to standard format logs, Telegraf logparser, custom log parsing and patterns based on the popular “grok” patterns are also supported.
What is Telegraf? Telegraf is a metric collection daemon that can collect metrics from a wide array of inputs and write them into a wide array of outputs. It is plugin-driven for both collection and output of data so it is easily extendable. It is also written in Go, which means that it is a compiled and standalone binary that can be executed on any system with no need for external dependencies, no npm, pip, gem, or other package management tools required.
With Telegraf you can not only parse logs to create metrics and store them as time series dataInfluxDB also has the ability to store raw log lines alongside the metrics. We should caution that this new capability is not intended to be a replacement for full-text search tools, rather for aiding in root cause analysis during specific windows of triage. So, if you are investigating performance metrics within a band of time, you can search via regex for any log errors that occurred in that same band. All the data is conveniently in one place, making it fast to lookup and correlate data.
The logparser plugin can be utilized in Telegraf to structure any log data into queryable and searchable metrics. In this guide we will use one of Telegraf’s built-in patterns to parse logs that are in the “Combined Log Format”, which is a common setup in Apache and Nginx access logs. It is an extension of the Common Log Format with the addition of two quoted strings for “Referrer” and “User-Agent”:
lang:default decode:true">127.0.0.1 ident auth [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://localhost:8083/" "Chrome/51.0.2704.84"
In Nginx, configuring your server to output logs like this would look something like this in /etc/nginx/nginx.conf
:
log_format vhosts '$host $remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"';
access_log /var/log/nginx/access.log vhosts;
And like this in Apache (see here for more details):
"%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""
Step 1 -- Install Telegraf
Telegraf packages are available for many different platforms from the InfluxData downloads page:
# Install on Ubuntu:
wget https://dl.influxdata.com/telegraf/releases/telegraf_1.0.0_amd64.deb
sudo dpkg -i telegraf_1.0.0_amd64.deb
service telegraf stop
Step 2 -- Running Through a Test Metric
First create a temp log file:
touch /tmp/test.log
Next, setup a simplified telegraf.conf
file by running the following echo
command.
echo '
[[inputs.logparser]]
## file(s) to tail:
files = ["/tmp/test.log"]
from_beginning = false
name_override = "test_metric"
## For parsing logstash-style "grok" patterns:
[inputs.logparser.grok]
patterns = ["%{COMBINED_LOG_FORMAT}"]
[[outputs.file]]
## Files to write to, "stdout" is a specially handled file.
files = ["stdout"]' >> /tmp/telegraf.conf
The file tells Telegraf to run using the logparser
input plugin and the file
output plugin. The logparser
input will tail the /tmp/test.log
file, parsing and outputting metrics with the name test_metric
. The output will be written using the file
output plugin to stdout
.
Now we will use that config file to run Telegraf:
telegraf --config /tmp/telegraf.conf
While Telegraf is running, we will echo the example log line to the /tmp/test.log
file.
bash
echo '127.0.0.1 ident auth [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://localhost:8083/" "Chrome/51.0.2704.84"' >> /tmp/test.log
In the Telegraf output, you should see the above log line get parsed into an InfluxDB line-protocol metric and printed to STDOUT
.
test_metric,host=localhost,http_method=GET agent="Chrome/51.0.2704.84",auth="auth",client_ip="127.0.0.1",http_status_code=200i,http_version=1,ident="ident",referrer="http://localhost:8083/",request="/apache_pb.gif",response_bytes=2326i 971211336000000000
Parsing Access Logs and Writing Real Data into InfluxDB
The next steps will build on the previous test data and parse real access log data into InfluxDB.
These steps will deal with parsing an Nginx log file. If you don’t have one handy, you can download an example here.
Step 3 -- Install InfluxDB
InfluxDB packages are available for many different platforms from the downloads page.
# Install on Ubuntu:
wget https://dl.influxdata.com/influxdb/releases/influxdb_1.0.0_amd64.deb
sudo dpkg -i influxdb_1.0.0_amd64.deb
service influxdb start
Step 4 -- Configure Telegraf
First we will rename the default Telegraf config file:
mv /etc/telegraf/telegraf.conf /etc/telegraf/telegraf.conf.orig
Now write a new config file:
echo '
[[inputs.logparser]]
## files to tail.
files = ["/var/log/nginx/access.log"]
## Read file from beginning.
from_beginning = true
## Override the default measurement name, which would be "logparser_grok"
name_override = "nginx_access_log"
## For parsing logstash-style "grok" patterns:
[inputs.logparser.grok]
patterns = ["%{COMBINED_LOG_FORMAT}"]
[[outputs.influxdb]]
## The full HTTP or UDP endpoint URL for your InfluxDB instance.
urls = ["http://localhost:8086"] # required
## The target database for metrics (telegraf will create it if not exists).
database = "telegraf" # required
## Write timeout (for the InfluxDB client), formatted as a string.
timeout = "5s"
' >> /etc/telegraf/telegraf.conf
This time our config file will assume that you are running an InfluxDB instance locally. We will also be tailing an nginx log file at /var/log/nginx/access.log
. If you don’t have an Nginx access log handy, you can download an example here.
Step 5 -- Parsing the Nginx Access Log
Now we can start the Telegraf service daemon:
service telegraf start
We won’t see metrics getting written to STDOUT
, but we should see log lines indicating that metrics are getting written to InfluxDB.
The Telegraf service log is located at /var/log/telegraf/telegraf.log
.
2016/06/16 15:32:50 Output [influxdb] buffer fullness: 600 / 10000 metrics. Total gathered metrics: 600. Total dropped metrics: 0.
2016/06/16 15:32:50 Output [influxdb] wrote batch of 600 metrics in 5.809µs
Step 6 -- Exploring the Data in InfluxDB
When we installed InfluxDB, we also installed the Influx CLI client for connecting to and querying the database. We can launch this by simply running influx
and using the telegraf
database.
$ influx
Connected to http://localhost:8086 version 1.0.0
InfluxDB shell version: 1.0.0
> USE telegraf
Using database telegraf
> precision rfc3339
>
First let’s query our parsed data and see how many metrics we’ve got:
> SELECT count(resp_code) FROM nginx_access_log
name: nginx_access_log
----------------------
time count
1970-01-01T00:00:00Z 600
Perfect! This is the number of parsable metrics in the example log file.
Now let’s see how many errors we’ve returned using the count()
function:
> SELECT count(resp_code) FROM nginx_access_log WHERE resp_code >= 400
name: nginx_access_log
----------------------
time count
1970-01-01T00:00:00Z 71
71 errors is quite a bit for only 600 requests in total! Let’s take a peek at a few of those errors using *
to grab all fields and LIMIT
to limit the number returned.
> SELECT * FROM nginx_access_log WHERE resp_code >= 400 LIMIT 5
name: nginx_access_log
----------------------
time agent auth client_ip host http_version ident referrer request resp_bytes resp_code verb
2011-08-22T01:02:01.005Z okhttp/2.4.0 - 192.168.1.2 tars 1.1 - - /huh 0 404 GET
2011-08-22T01:02:01.008Z Python-urllib/2.7 - 127.0.0.1 tars 1.1 - - /huh 0 404 GET
2011-08-22T01:02:02.005Z Firefox/5.0 - 127.0.0.1 tars 1.1 - - /huh 0 404 GET
2011-08-22T01:02:03.006Z Firefox/5.0 - 127.0.0.1 tars 1.1 - - /huh 0 404 GET
2011-08-22T01:02:04.009Z Firefox/5.0 - 127.0.0.1 tars 1.1 - - /huh 0 404 GET
Looks like we’ve got some 404s. If we look closely, we can see that each of the 404s are coming from requests on the /huh
endpoint. Let’s see if every 404 is on that endpointfor this we will use the distinct()
function.
> SELECT distinct(request) FROM nginx_access_log WHERE resp_code >= 400
name: nginx_access_log
----------------------
time distinct
1970-01-01T00:00:00Z /huh
1970-01-01T00:00:00Z /foo
Looks like there might be a few people getting 404 from the /foo
endpoint as well, so let’s see how many:
> SELECT count(resp_code) FROM nginx_access_log WHERE resp_code >= 400 AND request = '/foo'
name: nginx_access_log
----------------------
time count
1970-01-01T00:00:00Z 1
Great! Since there’s only one, this means that the main issue is users querying the /huh
endpoint, and now we can deal with finding out where that errant endpoint is coming from.
Customizing the Logparser Plugin
Now that we’ve covered how to parse data from standard “Combined Log Format”, let’s look into parsing logs that might have additional fields. Apache and Nginx allow for a high level of customization to their log output, and likewise so does the Logparser.
Let’s say that our Apache admin added 2 fields on to the end of each log line:
- the ID of the request and
- the response time in microseconds.
While the response time is likely very useful from an analytics perspective, the ID
is probably not as useful, so let’s come up with a grok
pattern to deal with this.
Say the log lines look like this (see the full file here):
127.0.0.1 - - [22/Aug/2011:10:02:59 +0900] "GET /huh HTTP/1.1" 404 0 "-" "Firefox/5.0" f18064bf-a4c5-4fd9-9cf8-37aff27382c5 3648
Which is the Combined Log Format with the addition of a UUID and integer at the end. So, how do we go about parsing this?
The Telegraf Grok Logparser
In order to parse a custom log file we must create a custom “grok” pattern for Telegraf. The logparser plugin uses a slightly modified version of logstash ”grok” patterns, with the format:
%{<capture_syntax>[:<semantic_name>][:<modifier>]}
Telegraf has many of its own built-in patterns, as well as supporting Logstash’s built-in patterns. The best way to get acquainted with grok patterns is to read the Logstash docs, which are available here.
Setting Up a Custom Log Pattern
Telegraf allows for specifying custom patterns in a separate file, or directly in the config file. Let’s change our config file to include our custom pattern.
service telegraf stop
vim /etc/telegraf/telegraf.conf
telegraf.conf:
[[inputs.logparser]]
## files to tail.
files = ["/var/log/myapp/access.log"]
## Read file from beginning.
from_beginning = true
## Override the default measurement name, which would be "logparser_grok"
name_override = "my_log"
## For parsing logstash-style "grok" patterns:
[inputs.logparser.grok]
patterns = ["%{CUSTOM_LOG}"]
custom_patterns = '''
CUSTOM_LOG %{COMBINED_LOG_FORMAT} %{UUID} %{NUMBER:response_time_us:int}
'''
[[outputs.influxdb]]
## The full HTTP or UDP endpoint URL for your InfluxDB instance.
urls = ["http://localhost:8086"] # required
## The target database for metrics (telegraf will create it if not exists).
database = "telegraf" # required
## Write timeout (for the InfluxDB client), formatted as a string.
timeout = "5s"
Notice that we need to add the CUSTOM_LOG
pattern, as well as telling Telegraf to use the %{CUSTOM_LOG}
pattern when parsing.
Let’s go over the fields in this pattern:
CUSTOM_LOG
is the name of the pattern.-
%{COMBINED_LOG_FORMAT}
is the name of the built-in pattern. %{UUID}
uses the Logstash built-in UUID pattern to recognize this portion of the log file, but does not store it.%{NUMBER:response_time_us:int}
uses the Logstash built-in NUMBER pattern. It names this fieldresponse_time_us
and casts it to anint
.
Parsing the Log & Analyzing the Data
Now that our config file is setup, we can run the Telegraf service again (but first make sure that you have a file at /var/log/myapp/access.log
, you can grab the example here.)
service telegraf start
And once the metrics are written we can enter the influx
CLI again:
$ influx
Connected to http://localhost:8086 version 1.0.0
InfluxDB shell version: 1.0.0
> USE telegraf
Using database telegraf
> precision rfc3339
And now that we have access to the response times, let’s see what our average & 95th percentile response times are:
> SELECT mean(response_time_us),percentile(response_time_us,95) FROM my_log
name: my_log
------------
time mean percentile
1970-01-01T00:00:00Z 912411.36 10078023
Yikes, seems a bit high, so let’s peek at some of the metrics that have a very large response time:
> SELECT * FROM my_log WHERE response_time_us > 10078023 LIMIT 5
name: my_log
------------
time agent auth client_ip host http_version ident referrer request resp_bytes resp_code response_time_us verb
2011-08-22T01:02:01.008Z Firefox/5.0 userb 127.0.0.1 tars 1.1 - - /large.jpg 10485760 200 10181259 GET
2011-08-22T01:02:04.004Z Firefox/5.0 usera 127.0.0.1 tars 1.1 - - /large.jpg 10485760 200 10110514 GET
2011-08-22T01:02:06.004Z Python-urllib/2.7 - 127.0.0.1 tars 1.1 - - /large.jpg 10485760 200 10132756 GET
2011-08-22T01:02:10.002Z Firefox/5.0 userb 192.168.1.2 tars 1.1 - - /large.jpg 10485760 200 10101485 GET
2011-08-22T01:02:11.006Z Firefox/5.0 - 127.0.0.1 tars 1.1 - - /large.jpg 10485760 200 10144209 GET
Aha! These are serving up a very large image file. Let’s ignore these and view what our response time is in cases with smaller requests:
> SELECT mean(response_time_us),percentile(response_time_us,95) FROM my_log WHERE resp_bytes < 4096
name: my_log
------------
time mean percentile
1970-01-01T00:00:00Z 4003.456043956044 7633
>
Great! That looks a bit better.
What's Next?
- Download and get started with InfluxDB 1.3.1.
- Looking for InfluxDB clustering on your infrastructure? Get started with InfluxDB Enterprise, now available for evaluation.
- Schedule a FREE 20 minute consultation with a Solutions Architect to review your InfluxDB project.