Logging Improvements for InfluxDB 1.5.0
By
Stuart Carnie /
Product, Developer, Company
Mar 06, 2018
Navigate to:
When it comes to troubleshooting issues, log files are a high-valued asset. If things go wrong, you’ll almost always be asked to “send the logs”. InfluxDB 1.5 comes with a number of improvements to logging in an effort to simplify the task of analyzing this data.
Logging
InfluxDB generates a lot of log output that chronicles many aspects of its internal operation. This latest update has revamped how the database logs key information and the format of the log output. The primary goal of these changes is to enable tooling which can efficiently parse and analyze the log data and reduce the time required to diagnose issues.
Structured Logging
InfluxDB 1.5 can generate structured logs in either logfmt or JSON lines. This feature is found in a new section of the configuration file titled [logging]
. The default log format is auto
, which determines the output based on whether stderr
refers to a terminal (TTY) or not. If stderr
is a TTY, a less verbose “console” format is selected; otherwise, the output will be logfmt
. If you would prefer consistent output, we would encourage you to explicitly set the format to logfmt
.
The value of structured logging is greatly improved when specific elements of a log event can be extracted easily. With that in mind, the existing logging code was reviewed to ensure notable data was moved to separate keys. The following example shows a few startup events related to opening files in version 1.3. Note some entries include the path and duration embedded in the message:
[I] 2018-03-03T00:46:48Z /Users/stuartcarnie/.influxdb/data/db/autogen/510/000000001-000000001.tsm (#0) opened in 280.401827ms engine=tsm1 service=filestore
[I] 2018-03-03T00:46:48Z reading file /Users/stuartcarnie/.influxdb/wal/db/autogen/510/_00001.wal, size 15276 engine=tsm1 service=cacheloader
[I] 2018-03-03T00:46:54Z /Users/stuartcarnie/.influxdb/data/db/autogen/510 opened in 5.709152422s service=store
As of 1.5, the same events using logfmt
now look like:
ts=2018-03-03T00:48:57.246371Z lvl=info msg="Opened file" log_id=06bLAgTG000 engine=tsm1 service=filestore path=/Users/stuartcarnie/.influxdb/data/db/autogen/510/000000001-000000001.tsm id=0 duration=61.736ms
ts=2018-03-03T00:48:57.246590Z lvl=info msg="Reading file" log_id=06bLAgTG000 engine=tsm1 service=cacheloader path=/Users/stuartcarnie/.influxdb/wal/db/autogen/510/_00001.wal size=15276
ts=2018-03-03T00:49:01.023313Z lvl=info msg="Opened shard" log_id=06bLAgTG000 service=store trace_id=06bLAgv0000 op_name=tsdb_open path=/Users/stuartcarnie/.influxdb/data/db/autogen/510 duration=3841.275ms
Take note that path
and duration
are now separate keys. Using a tool like lcut, we can select specific keys (ts
, msg
, path
and duration
) to reformat the output:
2018-03-03T00:48:57.246371Z Opened file /Users/stuartcarnie/.influxdb/data/db/autogen/510/000000001-000000001.tsm 61.736ms
2018-03-03T00:48:57.246590Z Reading file /Users/stuartcarnie/.influxdb/wal/db/autogen/510/_00001.wal
2018-03-03T00:49:01.023313Z Opened shard /Users/stuartcarnie/.influxdb/data/db/autogen/510 3841.275ms
Collating Log Events
During the lifetime of an InfluxDB process, operations such as compactions run continuously and generate multiple events as they advance. To further complicate matters, when these operations run concurrently, the events from each will be interleaved together. Determining the outcome of a specific compaction is practically impossible, as there is no way to associate log events for each independently. To address this issue, InfluxDB 1.5 adds the keys listed in the following table:
key | comment |
trace_id |
A unique value associated with each log event for a single run of an operation |
op_name |
A searchable identifier, such as tsm1.compact_group assigned to each operation type |
op_event |
Given a value of start or end to indicate whether this is the first or last event of the operation |
op_elapsed |
The time an operation took to complete, in milliseconds. This key is always included with the op_event=end log event |
To demonstrate how these keys might be used, we’ll use the following (abridged) file named “influxd.log”, which includes at least two compactions running concurrently.
msg="TSM compaction (start)" trace_id=06avQESl000 op_name=tsm1_compact_group op_event=start
msg="Beginning compaction" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_files_n=2
msg="Compacting file" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/influxdb/data/db1/rp/10/000000859-000000002.tsm
msg="Compacting file" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/influxdb/data/db1/rp/10/000000861-000000001.tsm
msg="TSM compaction (start)" trace_id=06avQEZW000 op_name=tsm1_compact_group op_event=start
msg="Beginning compaction" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_files_n=2
msg="invalid subscription token" service=subscriber
msg="Post http://kapacitor-rw:9092/write?consistency=&db=_internal&precision=ns&rp=monitor: dial tcp: lookup kapacitor-rw on 10.0.0.1: server misbehaving" service=subscriber
msg="Compacting file" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/influxdb/data/db2/rp/12/000000027-000000002.tsm
msg="Compacting file" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/influxdb/data/db2/rp/12/000000029-000000001.tsm
msg="Compacted file" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/influxdb/data/db2/rp/12/000000029-000000002.tsm.tmp
msg="Finished compacting files" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_files_n=1
msg="TSM compaction (end)" trace_id=06avQEZW000 op_name=tsm1_compact_group op_event=end op_elapsed=56.907ms
msg="Compacted file" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/influxdb/data/db1/rp/10/000000861-000000002.tsm.tmp
msg="Finished compacting files" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_files_n=1
msg="TSM compaction (end)" trace_id=06avQESl000 op_name=tsm1_compact_group op_event=end op_elapsed=157.739ms
Compactions are identified by op_name=tsm1_compact_group
, so to summarize them, we might use the following command to output the trace id and elapsed time:
$ fgrep 'tsm1_compact_group' influxd.log | fgrep 'op_event=end' | lcut trace_id op_elapsed
which can be read as:
Find all the lines containing the texttsm1_compact_group
andop_event=end
and display thetrace_id
andop_elapsed
keys
and would produce the following output:
06avQEZW000 56.907ms
06avQESl000 157.739ms
From here it is easy to filter the logs for trace 06avQESl000
using the following:
$ fgrep '06avQESl000' influxd.log | lcut msg tsm1_file
TSM compaction (start)
Beginning compaction
Compacting file /influxdb/data/db1/rp/10/000000859-000000002.tsm
Compacting file /influxdb/data/db1/rp/10/000000861-000000001.tsm
Compacted file /influxdb/data/db1/rp/10/000000861-000000002.tsm.tmp
Finished compacting files
TSM compaction (end)
We can ask more complex questions, such as:
What are the top 10 slowest continuous queries?
using a command like:
$ fgrep 'continuous_querier_execute' influxd.log | fgrep 'op_event=end' | lcut trace_id op_elapsed | sort -r -h -k2 | head -10
06eXrSJG000 15007.940ms
06d7Ow3W000 15007.646ms
06axkRVG000 15007.222ms
06ay9170000 15007.118ms
06c9tbwG000 15006.701ms
06dUcXhG000 15006.533ms
06ekMi40000 15006.158ms
06c5FH7l000 15006.145ms
06bDHhkG000 15006.012ms
06a~ioYG000 15005.988ms
In summary, structured logging will enable us to analyze log data more efficiently with off-the-shelf tooling.
HTTP Access Logs
InfluxDB has long supported the ability to output HTTP request traffic in Common Log Format. This feature is enabled by setting the log-enabled
option to true
in the [http]
section of the InfluxDB configuration.
Prior to 1.5, all log output was sent to stderr
and looked something like the following:
[I] 2018-03-02T19:59:58Z compacted level 1 4 files into 1 files in 130.832391ms engine=tsm1
[I] 2018-03-02T20:00:09Z SELECT count(v0) FROM db.autogen.m0 WHERE time >= '2018-02-27T01:00:00Z' AND time < '2018-02-27T02:00:00Z' GROUP BY * service=query
[httpd] ::1 - - [02/Mar/2018:13:00:09 -0700] "GET /query?db=db&q=select+count%28v0%29+from+m0+where+time+%3E%3D+%272018-02-27T01%3A00%3A00Z%27+and+time+%3C+%272018-02-27T02%3A00%3A00Z%27+group+by+%2A HTTP/1.1" 200 0 "-" "curl/7.54.0" 4f39378e-1e54-11e8-8001-000000000000 726
[I] 2018-03-02T20:00:57Z retention policy shard deletion check commencing service=retention
Intermingling the log event streams required additional work to separate them before we can begin any analysis. The latest release adds an additional configuration option to write the access log to a separate file. For example, the following configuration will write the access log to a file located at /var/log/influxd/access.log
[http]
# ...
log-enabled = true
access-log-path = "/var/log/influxd/access.log"
# ...
When the access log is written to a file, the [http]
prefix is stripped such that the file can be parsed by standard HTTP log analysis and monitoring tools without further processing. For example, using lnav, an admin can open an active log file and display the data using a number of different visualizations, including error rates and histograms as is demonstrated in the following asciinema.