How Does InfluxDB 3 Query Data in Real-Time?

Navigate to:

InfluxDB 3 builds on open-source technologies—Flight, DataFusion, Arrow, and Parquet—but even if a developer made their own time series database using the same technologies, they would not be able to replicate InfluxDB 3. The FDAP stack provides many of the building blocks required for a high-performance database, such as the fast, multi-threaded, streaming, columnar execution engine that defines InfluxDB 3. However, it does not include all the low latency and time series specializations used in InfluxDB 3. The specialized parts include a custom ingester and compactor to quickly organize incoming data for querying, optimized file organization for time series, and specialized caches that allow users to see and analyze data in real-time.

Real-time analytics

What is real-time analytics, and why is it so important for time series data? Real-time analytics for time series refers to the process of analyzing and extracting insights from time-stamped data immediately after it is ingested, with minimal latency. It empowers quick decision-making through immediate trend, anomaly, and event detection. Real-time analytics enables organizations to monitor and respond to events as they happen. Any latency during data ingestion will increase the time between incident and intervention, staking time to a process where milliseconds matter.

Real-time analytics and InfluxDB 3

InfluxDB 3’s high-performance analytics are supported by a combination of the custom ingester, compactor, catalog, caching and custom file organization that underpin the database. These three components work in tandem to deliver speed and efficiency.

Before diving into the ingester, let’s discuss the technology. As mentioned earlier, InfluxDB 3 is built on the FDAP stack—Flight, DataFusion, Arrow, and Parquet—which significantly enhances query performance. In addition to the custom engineering done by InfluxDB’s engineering teams, these technologies form a strong foundation for fast querying of historical data. The custom engineering includes upstream contributions to DataFusion itself and InfluxDB 3.

What is the FDAP stack?

The Foundation: DataFusion, the Query Engine

DataFusion is an open source query execution framework written in Rust that efficiently processes large-scale data. It offers a modern SQL interface for querying data, supports multiple data sources, and includes features like distributed query execution and vectorized processing for high performance. As part of the Apache Arrow ecosystem, DataFusion integrates seamlessly with Arrow’s in-memory columnar format and works natively with Parquet files. This compatibility enables fast and efficient analytics on large datasets stored in columnar formats and makes DataFusion ideal for analytics, ETL pipelines, and big data processing.

DataFusion uses Arrow as its in-memory format and can read Parquet, along with many other file formats. DataFusion is considered a high-level library, while Arrow and Parquet are lower-level libraries that provide more control over data storage and system performance. High-level libraries offer simplified interfaces to perform specific tasks, abstracting away complexity, while low-level libraries allow more detailed customization at the cost of requiring more technical expertise. Part of DataFusion’s speed comes from building on top of the foundation that Arrow and Parquet provide.

InfluxData’s contribution to DataFusion

Although DataFusion supports various data sources, it wasn’t originally designed with time series data in mind. When InfluxDB adopted DataFusion, InfluxData engineers played a crucial role in adding the relevant extension APIs needed to build an optimized query engine for time series data. These upstream contributions have significantly boosted DataFusion’s performance, recently making it the fastest engine for querying Apache Parquet files. InfluxData added several time series-specific optimizations through DataFusion’s extension mechanisms, making it more suitable for real-time, high-performance queries on time series data.

With the adoption of DataFusion, InfluxDB 3, and any other organization that adopts it, will have a lightning-fast query engine for time series data. Adding real-time querying capabilities introduces an additional layer of complexity, as the data must be queryable immediately upon ingestion—this is critical for time series data. It’s important to note that querying data immediately after ingestion and obtaining results against large historical datasets are two distinct challenges. InfluxDB 3 engineers spent considerable time optimizing the ingestion process because real-time analytics doesn’t exist without lightning-fast, real-time ingestion. This is why any skilled engineer could build a high-performance engine for querying historical data using the FDAP stack, but real-time querying requires a more advanced solution such as InfluxDB 3.

The ingester

The ingester is custom-built to handle the specific needs of time series data, specifically the massive volumes and velocities required for successful ingestion. InfluxDB 3’s ingester includes a custom parser for the line protocol file format, a time-series-focused write-ahead log (WAL) file, and trade-offs supporting fast write ingestion. Speed is essential for time series workloads, so the ingester prioritizes speed over durability and the strict consistency ACID compliance provides.

When data enters InfluxDB 3, it doesn’t immediately go into Parquet. Instead, the data enters a specialized buffer, which is also based on Arrow, before eventually being written to an object store (Parquet). This system allows for faster data writes, bypassing the wait for data to be placed in object storage before responding to the client. Data is readable immediately after write without waiting for write-to-object storage. The data is eventually written to object storage in batches. This design avoids adding object store communications into the write and query path, which helps avoid the inevitable performance bottlenecks that occur if data is written to object storage in real-time.

Workflow for Ingest, Metadata, and Storage

Trade-offs in data ingestion in distributed products

In its distributed products (Cloud Serverless, Cloud Dedicated, and Clustered), InfluxDB 3 deliberately trades off ingestion speed and durability. By writing data first to local disks and later to object storage, it enables users to ingest and query massive amounts of data quickly, without waiting for object storage to persist each record. While this approach sacrifices immediate durability, it ensures the system will handle high write throughput, which is crucial for time series data.

InfluxDB 3’s distributed architecture also eliminates the need for complex consensus protocols when ingesting data. Instead, it focuses on maximizing write speed, with eventual consistency handled through the compactor or at read time. This design allows fast, efficient data ingestion without sacrificing scalability or performance.

Real-time querying and data organization

In addition to the ingester, real-time querying in InfluxDB 3 relies heavily on data organization in Parquet files and advanced caching techniques. DataFusion works on flexible Parquet files that give engineers many options for data organization, including custom sorting and data divisions. InfluxDB’s CTO, Paul Dix, spent considerable time optimizing how data is divided across these files, balancing write speed against query performance to ensure that real-time querying remains fast, even as the volume of ingested data grows. InfluxDB 3 organizes data into Parquet files to optimize for both query speed and efficient storage, ensuring it delivers high-performance analytics while efficiently handling large-scale data ingestion.

Where to go from here

The combination of the FDAP stack and custom engineering enables a great time series experience. Though both the ingester and the FDAP stack are powerful independently, together, they deliver capabilities that neither could do alone. InfluxDB 3 represents a breakthrough in time series database technology, combining the power of the FDAP stack with extensive custom engineering to deliver high performance for real-time analytics. By leveraging open source tools like DataFusion, Arrow, and Parquet and enhancing them with bespoke optimizations, InfluxData has created a platform that excels in both high-speed data ingestion and real-time query execution. Its innovative ingester and custom file organization strike a critical balance between speed, scalability, and efficiency, so InfluxDB 3 can handle massive time series workloads without compromising performance.

For more reading on InfluxDB 3’s architecture, check out this post written by our engineering team. Get started for free in the cloud with InfluxDB Cloud Serverless, try the Alpha of our new open source product InfluxDB 3 Core, or contact sales for a custom POC.