InfluxDB Internals 101 - Part Two
By
Ryan Betts /
Product, Developer
Nov 27, 2017
Navigate to:
- Query path: reading data from InfluxDB
- Indexing points for query
- A note on TSI (on disk indexes)
- Executing queries
- A note on IFQL
- DELETE and DROP - removing data from InfluxDB
- Updating points
Introduction
Part One of this series describes the InfluxDB write path: how the database persists and organizes data being written to the database. This part (Part Two) describes the other main interaction with the database: querying data once it has been persisted. Note that Part One also defines the InfluxDB jargon used in this post (tagset
, fieldset
, measurement
, series
) which will be helpful to new readers.
InfluxDB is queried using a SQL dialect called influxql
. There is quite a bit of documentation for the language as well as a guide to using influxql
for different querying tasks. This post focuses on how the query engine works and not on the semantics of the language itself.
Time series applications tend to query in two patterns. Queries either window and produce per-window aggregates (window data into one-minute intervals and calculate the average for each minute). Or, queries search for a specific point (often the last()
or most recent point in a series). Both query patterns filter the points in the database by criteria applied to a set of dimensions; for example, all the data where region = us-east
or where measurement = 'cpu'
. In InfluxDB, these dimensions are stored as tagsets
.
Finally, before we get into more detail, it is important to note that influxql
supports selection
and projection
operators but does not support traditional relational joins
. Optimizing query performance in InfluxDB requires finding the initial point for each series and then leveraging columnar storage to efficiently scan a sequence of points following that initial point. The use of flexible schema-on-write tagsets
vs. pre-defined dimension tables in a star-schema is one of the more interesting differences between InfluxDB and a traditional SQL columnar OLAP database.
Indexing Points for Query
Part One describes the different data structures populated by incoming writes to achieve durability and compact long-term storage. There is one additional data structure populated by writes to make queries efficient: the index. InfluxDB automatically maintains an index to make filtering by tagsets
efficient.
The index maintains mappings of measurement name
to field keys
, of measurement name
to series ids
(an internal series identifier), of measurement name
to tag keys
to tag value
to series id
, and of series id
to shards
. The index (as of version 1.4) also maintains sketches of series
and measurements
for fast cardinality estimates. You can read the index implementation on GitHub for more detail.
That’s a lot of different mappings to think about and understand. Personally, I find it easier, and conceptually accurate, to think of the index as a posting list (aka inverted index) that maps tag key/value pairs to a list of series keys. This slight abstraction captures the primary purpose of the index: to make it efficient at query time to identify all series that need to be scanned based on a tagset
filter in an influxql
WHERE predicate.
A Note on TSI (On-disk Index)
The current default index is stored in-memory. This allows fast lookup for query planning. However, it also means that high-cardinality data, data that include a large number of unique tagsets
, requires a lot of memory to index. This is why we suggest that users use tagsets
for lower-cardinality dimension data and use unindexed field values
for high-cardinality data.
We are developing a new index structure, Time Series Index (TSI), which is now shipping as an opt-in preview. TSI stores the index on SSD, allowing much higher cardinality datasets than the default in-memory index.
Parsing and Planning
Having described the index, it is possible to explain the internal workflow that runs to parse, plan, and execute an example influxql
query. The query engine:
- Determines the type of query (one with an expression or a raw data query)
- Determines and then separates the time range and the condition expression for filtering data
- Determines which shards it needs to access using the list of measurements and the time frame
- Expands any wildcards
- Validates that the query is semantically correct
- Directs the storage engine to create the iterators for each shard
- And merges the shard iterator outputs, performing any post-processing on the data
Sample query:
select user, system from cpu where time > now() - 1h and host = 'serverA'
The database receives the query and parses out the measurements that are accessed, fields returned, grouping time intervals, filter predicates, and other influxql
query components. You can read the AST structure for the SELECT statement in the influxdata/influxql GitHub repository.
After parsing, the query engine determines which series are needed to produce an answer. In this example, the query engine uses the index to find all series
that are part of the cpu measurement
. It then uses the index to find all series
that have the tag key, tag value
pair host, serverA. The intersection of these sets provide the series
that need to be scanned. The time range in the query, now() - 1h, limits the scan to shard groups
covering the last one hour.
The query engine instantiates an iterator for each series, for each shard. These iterators are nested, forming a tree. The iterator tree is executed bottom-up, reading, filtering, and merging data to produce a final result set.
The version 1.4 EXPLAIN
and EXPLAIN ANALYZE
statements provide statistics on iterators created and TSM blocks decoded as part of query execution. There are example outputs in the What’s New in InfluxDB 1.4 blog post.
A Note on IFQL
The combination of schema-on-write, automatic indexing of tagsets
, and SQL-like syntax produce a system that allows newcomers to be productive quickly, that feels familiar, and requires minimal setup to get started.
However, the pre-allocation of narrowly scoped iterators means high-cardinality queries, and queries that produce a very large number of groups are expensive to plan. The iterator structures can consume, worst case, GBs of RAM. Secondly, the iterator allocation during planning and other implementation details make multi-query resource management difficult. Finally, while SQL-like syntax is a good fit for simple queries, it becomes cumbersome for more sophisticated analytics. Time series queries are often sets of functions applied to groupings of filtered streams. Expressing these queries using select-project-join logic with advanced SQL partition and over clauses requires an experienced SQL programmer and is no longer beginner- friendly.
We recently announced a prototype query language, IFQL, to explore solutions to these problems: cheaper planning, better resource management, and easier expression of complex queries.
DELETE and DROP: Removing Data from InfluxDB
InfluxDB supports retention policies to enforce time to live policies against data. This is always the preferred way to regularly delete points from the database. However, applications sometimes write bad data to the database. That data needs to be removed to return to normal operation. In these cases, DELETE
and DROP
can be used to delete unwanted points.
DELETE and DROP statements are processed through the query layer, not the write layer. This allows DELETE and DROP to re-use the selection and expression features of influxql
.
Deleting data from a columnar database is expensive. InfluxDB organizes data on disk into immutable runs of values for a single column of a series. A delete operation needs to undo a lot of that work for a subset of points.
In InfluxDB, deleting a row from the database produces a tombstone. A tombstone includes a series key
and the min and max time of the deleted range. This allows a very compact expression for the primary delete use case: delete all data for an invalid series between times t1 and t2.
When sufficient tombstones collect, TSM data is re-compacted into a new immutable file with the deleted data removed and tombstone records deleted. At query time, tombstones are checked to avoid processing data marked as deleted.
Over the last six months, substantial work has gone into making tombstone management, compaction based on accumulated deletes, and index updates after deletes, correct and efficient.
Updating Points
InfluxDB does not support an UPDATE
statement. However, re-inserting a fully qualified series key
at an existing timestamp will replace the old point’s field value
with the new field value
.
Conclusion
Hopefully this post has added to your mental model of InfluxDB. It discusses four key concepts:
series
andtagsets
are indexed for query planning.- Query planning uses the index to identify series to scan.
- Query planning generates and executes a tree of iterators.
- DELETE and DROP statements are part of
influxql
and result in tombstones to annotate deleted data.