A Guide to the New InfluxDB Database Engine
InfluxDB has been the leading time series database for years now due to its combination of performance and developer-friendly features for working with all forms of time series data. But the InfluxDB team is always looking to push things forward with new features and better performance for users, and one key part of doing that was creating a new database engine to take InfluxDB to the next level.
How does InfluxDB’s database engine work?
InfluxDB’s new database engine is column-oriented which provides a number of benefits for working with time series data. This includes improved data compression which makes it more cost effective when storing large amounts of data. This compression also makes it possible to analyze and query more data because the data also uses less memory and bandwidth when being actively used.
Cardinality is no longer an issue with the new database engine, so use cases that were previously challenging to perform with InfluxDB like distributed tracing are now viable. InfluxDB now supports SQL natively which not only makes querying data more accessible but opens up InfluxDB to an entire ecosystem of tools for integration.
Components of InfluxDB’s database engine
To build this new database engine InfluxData didn’t start from scratch. The team took advantage of a number of open source projects and also contributed major features upstream to benefit the wider ecosystem. In this section we will look at some of the core projects used and what they provide for InfluxDB.
Apache Arrow
Apache Arrow is an open-source project aimed at providing a high-performance, in-memory data structure for data processing and analytics. Arrow’s purpose is to standardize the columnar data format used by data analytics and data processing frameworks, making it easier for these systems to efficiently exchange data with one another.
Arrow was a natural fit for InfluxDB because it provides an efficient way for moving data from storage on disk and then back into RAM. It also has useful subprojects like Arrow Flight for moving data efficiently over the network and Arrow DataFusion which provides a query engine layer for working with data stored in the Apache Arrow format.
Apache Parquet
Apache Parquet is an open-source columnar storage format for big data processing. It is designed to provide efficient storage, access, and processing of large and complex data sets. Parquet is optimized for columnar storage, allowing for highly efficient compression and encoding, resulting in smaller file sizes and improved query performance. Parquet supports a variety of data types, including nested data structures, and is compatible with a wide range of data processing and analytics tools. The goal of the project is to enable organizations to efficiently store and process large data sets, making it easier to extract insights and drive business value from big data.
Parquet is used by InfluxDB as the persistent storage format on disk. The main advantage Parquet provides for InfluxDB is being able to map the on-disk representation of data to the in-memory representation, which makes moving data between disk and memory more efficient. Using Parquet also makes it easy to integrate with other parts of the big data ecosystem because of how ubiquitous Parquet has become.
InfluxDB use cases
InfluxDB has typically been used as a database for storing metrics for various kinds of monitoring or IoT use cases. With the new columnar database engine a number of additional use cases have been opened up as well that were previously somewhat difficult to do with InfluxDB.
Observability
InfluxDB used to struggle with a lot of observability-related workloads that required large numbers of unique tags like distributed tracing. The new database engine supports unbounded cardinality and can now support distributed tracing and other observability-related data.
Real-time analytics
Because Apache Arrow reduces latency significantly due to data not needing to be serialized and deserialized, InfluxDB can be used for a number of real-time analytics workloads on large volumes of data. This can simply be creating dashboards or also creating real-time alerting or automations based on data in real time. Overall, InfluxDB is a good fit for any type of OLAP workload.
Long-term data storage
Because of the superior compression provided by columnar storage, InfluxDB can now store much larger volumes of data at a cheaper price. InfluxDB’s new engine also seamlessly moves data from memory to cheap object storage to further reduce costs. This makes InfluxDB a viable option for things like data warehousing.
IoT
InfluxDB is a great fit for IoT use cases because of the freedom in architecture provided by being able to combine open source, on-prem, and cloud versions of InfluxDB. InfluxDB can be deployed at the edge for real-time monitoring and data processing, then data can be synced up to the cloud for further analysis if desired.
As the new database engine gets into the hands of users, no doubt other use cases will be discovered to take advantage of the improved performance and feature set of InfluxDB.