InfluxDB Internals 101 - Part One
By
Ryan Betts /
Product, Use Cases, Developer
Oct 27, 2017
Navigate to:
Paul Dix led a series of internal InfluxDB 101 sessions to teach newcomers InfluxDB internals. I learned a lot from the talks and want to share the content with the community. I’m also writing this to organize my own understanding of InfluxDB and to perhaps help others who want to learn how InfluxDB is architected. A lot of this information is gathered from InfluxDB documentation as well the goal with this series is to present a consolidated overview of the InfluxDB architecture.
There’s a lot to digest so it’s presented in three parts. This first post explains the data model and the write path. Post two explains the query path. Post three explains InfluxDB Enterprise clustering.
Series Table of Contents
- Data model and write path: adding data to InfluxDB
- Data model terminology
- Receiving points from clients
- Persisting points to storage
- Compacting persisted points
- Query path: reading data from InfluxDB
- Indexing points for query
- A note on TSI (on disk indexes)
- Parsing and planning
- Executing queries
- A note on IFQL
- DELETE and DROP - removing data from InfluxDB
- Updating points
- Clustering: InfluxDB Enterprise
- Understanding the meta-service
- Understanding data-nodes
- Understanding data distribution and replication
Data model and write path: adding data to InfluxDB
Data model and terminology
An InfluxDB database stores points
. A point has four components: a measurement
, a tagset
, a fieldset
, and a timestamp
.
The measurement
provides a way to associate related points that might have different tagsets
or fieldsets
. The tagset
is a dictionary of key-value pairs to store metadata with a point. The fieldset
is a set of typed scalar values the data being recorded by the point.
The serialization format for points is defined by the line protocol (which includes additional examples and explanations if you’d like to read more detail). An example point from the specification helps to explain the terminology:
’’ temperature,machine=unit42,type=assembly internal=32,external=100 1434055562000000035
The measurement
is temperature.
The tagset
is machine=unit42,type=assembly. The keys, machine and type, in the tagset
are called tag keys
. The values, unit42 and assembly, in the tagset
are called tag values
.
The fieldset
is internal=32,external=100. The keys, internal and external, in the fieldset
are called field keys
. The values, 32 and 100, in the fieldset
are called field values
.
Each point is stored within exactly one database
within exactly one retention policy
. A database
is a container for users, retention policies, and points. A retention policy
configures how long InfluxDB keeps points (duration), how many copies of those points are stored in the cluster (replication factor), and the time range covered by shard groups (shard group duration). The retention policy
makes it easy for users (and efficient for the database) to drop older data that is no longer needed. This is a common pattern in time series applications.
We’ll explain replication factor
, shard groups
, andshards
later when we describe how the write path works in InfluxDB.
There’s one additional term that we need to get started: series
. A series is a group of points that share a measurement
+ tag set
+ field key
.
You can refer to the documentation glossary for these terms or others that might be used in this blog post series.
Receiving Points from Clients
Clients POST points (in line protocol format) to InfluxDB’s HTTP /write
endpoint. Points can be sent individually; however, for efficiency, most applications send points in batches. A typical batch ranges in size from hundreds to thousands of points. The POST specifies a database and an optional retention policy via query parameters. If the retention policy is not specified, the default retention policy is used. All points in the body will be written to that database and retention policy. Points in a POST body can be from an arbitrary number of series; points in a batch do not have to be from the same measurement or tagset.
When the database receives new points, it must (1) make those points durable so that they can be recovered in case of a database or server crash and (2) make the points queryable. This post focuses on the first half, making points durable.
Persisting Points to Storage
To make points durable, each batch is written and fsynced
to a write ahead log (WAL
). The WAL
is an append only file that is only read during a database recovery. For space and disk IO efficiency, each batch in the WAL
is compressed using snappy compression before being written to disk.
While the WAL
format efficiently makes incoming data durable, it is an exceedingly poor format for reading making it unsuitable for supporting queries. To allow immediate query ability of new data, incoming points are also written to an in-memory cache
. The cache
is an in-memory data structure that is optimized for query and insert performance. The cache
data structure is a map of series
to a time-sorted list of fields.
The WAL
makes new points durable. The cache
makes new points queryable. If the system crashes or shut down before the cache
is written to TSM
files, it is rebuilt when the database starts by reading and replaying the batches stored in the WAL
.
The combination of WAL
and cache
works well for incoming data but is insufficient for long-term storage. Since the WAL
must be replayed on startup, it is important to constrain it to a reasonable size. The cache
is limited to the size of RAM, which is also undesirable for many time series use cases. Consequently, data needs to be organized and written to long-term storage blocks on disk that are size-efficient (so that the database can store a lot of points) and efficient for query.
Time series queries are frequently aggregations over time scans of points within a bounded time range that are then reduced by a summary function like mean, max, or moving windows. Columnar database storage techniques, where data is organized on disk by column and not by row, fit this query pattern nicely. Additionally, columnar systems compress data exceptionally well, satisfying the need to store data efficiently. There is a lot of literature on column stores. Columnar-oriented Database Systems is one such overview.
Time series applications often evict data from storage after a period of time. Many monitoring applications, for example, will store the last month or two of data online to support monitoring queries. It needs to be efficient to remove data from the database if a configured time-to-live expires. Deleting points from columnar storage is expensive, so InfluxDB additionally organizes its columnar format into time-bounded chunks. When the time-to-live expires, the time-bounded file can simply be deleted from the filesystem rather than requiring a large update to persisted data.
Finally, when InfluxDB is run as a clustered system, it replicates data across multiple servers for availability and durability in case of failures.
The optional time-to-live duration, the granularity of time blocks within the time-to-live period, and the number of replicas are configured using an InfluxDB retention policy
:
CREATE RETENTION POLICY <retention_policy_name> ON <database_name> DURATION <duration> REPLICATION <n> [SHARD DURATION <duration>] [DEFAULT]
The duration
is the optional time to live (if data should not expire, set duration
to INF
). SHARD DURATION
is the granularity of data within the expiration period. For example, a one- hour shard duration
with a 24 hour duration
configures the database to store 24 one-hour shards. Each hour, the oldest shard is expired (removed) from the database. Set REPLICATION
to configure the replication factor how many copies of a shard should exist within a cluster.
Concretely, the database creates this physical organization of data on disk:
'' Database director /db
'' Retention Policy directory /db/rp
'' Shard Group (time bounded). (Logical)
'' Shard directory (db/rp/Id#)
'' TSM0001.tsm (data file)
'' TSM0002.tsm (data file)
''
The in-memory cache
is flushed to disk in the TSM format. When the flush completes, flushed points are removed from the cache
and the corresponding WAL
is truncated. (The WAL and cache are also maintained per-shard.) The TSM data files store the columnar-organized points. Once written, a TSM file is immutable. A detailed description of the TSM file layout is available in the [InfluxDB documentation].
Compacting TSM Data
The cache
is a relatively small amount of data. The TSM columnar format works best when it can store long runs of values for a series in a single block. A longer run produces both better compression and reduces seeks to scan a field for query. The TSM format is based heavily on log-structured merge-trees. New (level one
) TSM files are generated by cache flushes. These files are later combined (compacted
) into level two files. Level two files are further combined into level three
files. Additional levels of compaction occur as the files become larger and eventually become cold (the time range they cover is no longer hot for writes.) The documentation reference above offers a detailed description of compaction.
There’s a lot of logic and sophistication in the TSM compaction code. However, the high-level goal is quite simple: organize values for a series together into long runs to best optimize compression and scanning queries.
Concluding Part One
In summary, batches of points
are POSTed to InfluxDB. Those batches are snappy compressed and written to a WAL
for immediate durability. The points are also written to an in-memory cache
so that newly written points are immediately queryable. The cache
is periodically flushed to TSM
files. As TSM
files accumulate, they are combined and compacted
into higher level TSM
files. TSM
data is organized into shards
. The time range covered by a shard
and the replication factor of a shard
in a clustered deployment are configured by the retention policy
.
Hopefully this post helps to explain how InfluxDB receives and persists incoming writes. In the next post, we’ll discuss how the system supports query, update, and delete operations.