Apache DataFusion is Now the Fastest Single Node Engine for Querying Apache Parquet Files

By Andrew Lamb / Developer
Nov 25, 2024

Navigate to:

This blog was originally published on Apache DataFusion Project News & Blog

I am extremely excited to announce that Apache DataFusion 43.0.0 is the fastest engine for querying Apache Parquet files in ClickBench. It is faster than both DuckDB and chDB/Clickhouse using the same hardware. It also marks the first time a Rust based engine holds the top spot, which has previously been held by traditional C/C++ based engines. Figure 1: 2024-11-16 ClickBench Results for the ‘hot’¹ run against the partitioned 14 GB Parquet dataset (100 files, each ~140MB) on a c6a.4xlarge (16 CPU/32 GB RAM) machine. Measurements are relative (1.x) to results using different hardware.

Best in class performance on Parquet is now available to anyone. DataFusion’s open design lets you start quickly with a full-featured Query Engine, including SQL, data formats, catalogs, and more, and then customize any behavior you need. I predict the continued emergence of new classes of data systems now that creators can focus the bulk of their innovation on areas such as query languages, system integrations, and data formats rather than trying to play catchup with core engine performance.

ClickBench also includes results for proprietary storage formats, which require costly load/export steps, making them useful in fewer use cases and thus much less important than open formats (though the idea of use case specific formats is interesting²).

This blog post highlights some of the techniques we used to achieve this performance, and celebrates the teamwork involved.

A strong history of performance improvements

Performance has long been a core focus for DataFusion’s community, and speed attracts users and contributors. Recently, we seem to have been even more focused on performance, including in July, 2024 when Mehmet Ozan Kabak, CEO of Synnada, again suggested focusing on performance. This got many of us excited (who doesn’t love a challenge!), and we have subsequently rallied to steadily improve the performance release on release as shown in Figure 2. Figure 2: ClickBench performance improved over 30% between DataFusion 34 (released Dec 2023) and DataFusion 43 (released Nov 2024).

Like all good optimization efforts, ours took sustained effort as DataFusion ran out single 2x performance improvements several years ago. Working together our community of engineers from around the world³ and all experience levels⁴ pulled it off (check out this discussion to get a sense). It may be a “hobo sandwich”⁵ but it is a tasty one!

Of course, most of these techniques have been implemented and described before, but until now they were only available in proprietary systems such as Vertica, DataBricks Photon, or Snowflake or in tightly integrated open source systems such as DuckDB or ClickHouse which were not designed to extend.

StringView

Performance improved for all queries when DataFusion switched to using Arrow StringView. Using StringView “just” saves some copies and avoids one memory access for certain comparisons. However, these copies and comparisons happen to occur in many of the hottest loops during query processing, so optimizing them resulted in noticeable performance improvements.

Figure 3: Figure from Using StringView / German Style Strings to Make Queries Faster: Part 1 showing how StringView saves copying data in many cases.

Using StringView to make DataFusion faster for ClickBench required substantial careful, low level optimization work described in Using StringView / German Style Strings to Make Queries Faster: Part 1 and Part 2. However, it also required extending the rest of DataFusion’s operations to support the new type. You can get a sense of the magnitude of the work required by looking at the 100+ pull requests linked to the epic in arrow-rs (here) and three major epics (here, here, and here) in DataFusion.

Here is a partial list of people involved in the project (I am sorry to those I forgot):

Arrow: Xiangpeng Hao (InfluxData’s amazing 2024 summer intern and UW Madison PhD), Yijun Zhao from DataBend Labs, and Raphael Taylor-Davies laid the foundation. RinChanNOW from Tencent and Andrew Duffy from SpiralDB helped push it along in the early days, and Liang-Chi Hsieh, Daniël Heres reviewed and provided guidance.
DataFusion: Xiangpeng Hao, again charted the initial path and Alex Huang, Dharan Aditya Lordworms, Jax Liu, wiedld, Tai Le Manh, yi wang, doupache, Jay Zhan, Xin Li, and Kaifeng Zheng made it real.
DataFusion String Function Migration: Trent Hauck organized the effort and set the patterns, Jax Liu made a clever testing framework, and Austin Liu, Dmitrii Bu, Tai Le Manh, Chojan Shang, WeblWabl, Lordworms, iamthinh, Bruce Ritchie , Kaifeng Zheng, and Xin Li bashed out the conversions.

Parquet

Part of DataFusion’s speed in ClickBench is reading Parquet files (really) quickly, which reflects invested effort in the Parquet reading system (see Querying Parquet with Millisecond Latency).

The DataFusion ParquetExec (built on Rust Parquet) is now the most sophisticated open source Parquet reader I know of. It has every optimization we can think of for reading Parquet, including projection pushdown, predicate pushdown (row group metadata, page index, and bloom filters), limit pushdown, parallel reading, interleaved I/O, and late materialized filtering (coming soon by default). Some recent work from June recently unblocked a remaining hurdle for enabling late materialized filtering, and conveniently Xiangpeng Hao is working on the final piece (no pressure😅).

Skipping partial aggregation when it doesn’t help

Many ClickBench queries are aggregations that summarize millions of rows, a common task for reporting and dashboarding. DataFusion uses state of the art two phase aggregation plans. Normally, two phase aggregation works well as the first phase consolidates many rows immediately after reading, while the data is still in cache. However, for certain “high cardinality” aggregate queries (that have large numbers of groups), the two phase aggregation strategy used in DataFusion was inefficient, manifesting in relatively slower performance compared to other engines for ClickBench queries such as

SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") 
FROM hits 
GROUP BY "WatchID", "ClientIP" -- **** 13M distinct Groups  ****
ORDER BY c DESC 
LIMIT 10;

For such queries, the first first aggregation phase does not significantly reduce the number of rows, which wastes significant effort. Eduard Karacharov solved this problem with a dynamic strategy to bypass the first phase when it is not working efficiently, shown in Figure 4. Figure 4: Diagram from DataFusion API docs showing when the muti-phase grouping is not effective.

Optimized multi-column grouping

Another method for improving analytic database performance is specialized (aka highly optimized) versions of operations for different data types, which the system picks at runtime based on the query. Like other systems, DataFusion has specialized code for handling different types of group columns. For example, there is special code that handles GROUP BY int_id and different specialized code that handles GROUP BY string_id.

When a query groups by multiple columns, it is tricker to apply this technique. For example GROUP BY string_id, int_id and GROUP BY int_id, string_id have different optimal structures, but it is not possible to include specialized versions for all possible combinations of group column types.

DataFusion includes a general Row based mechanism that works for any combination of column types, but this general mechanism copies each value twice as shown in Figure 5. The cost of this copy is especially high for variable length strings and binary data. Figure 5: Prior to DataFusion 43.0.0, queries with multiple group columns used Row based group storage and copied each group value twice. This copy consumes a substantial amount of the query time for queries with many distinct groups, such as several of the queries in ClickBench.

Many optimizations in Databases boil down to simply avoiding copies, and this was no exception. The trick was to figure out how to avoid copies without causing per-column comparison overhead to dominate or complexity to get out of hand. In a great example of diligent and disciplined engineering, Jay Zhan tried several different approaches until arriving at the one shipped in DataFusion 43.0.0, shown in Figure 6.

Figure 6: DataFusion 43.0.0’s new columnar group storage copies each group value exactly once, which is significantly faster when grouping by multiple columns.

Huge thanks as well to Emil Ejbyfeldt and Daniël Heres for their help reviewing and to Rachelint (kamille) for reviewing and contributing a faster vectorized append and compare for multi group which will be released in DataFusion 44. The discussion on the ticket is another great example of the power of the DataFusion community working together to build great software.

What’s next 🚀

Just as I expect the performance of other engines to improve, DataFusion has several more performance improvements lined up itself:

We are also talking about what to focus on over the next three months and are always looking for people to help! If you want to geek out (obsess??) about performance and other features with engineers from around the world, we would love you to join us.

Additional thanks

In addition to the people called out above, thanks:

Patrick McGleenon for running ClickBench and gathering this data (source).
Everyone I missed in the shoutouts—there are so many of you. We appreciate everyone.

Conclusion

I have dreamed about DataFusion being at the top of the ClickBench leaderboard for several years. I often watched with envy improvements in systems backed by large VC investments, internet companies, or world class research institutions, and doubted that we could pull off something similar in an open source project with always limited time.

The fact that we have now surpassed those other systems in query performance speaks to the power and possibility of focusing on community and aligning the collective enthusiasm and skills towards a common goal. Of course, being on the top in any particular benchmark is likely fleeting as other engines will improve, but so will DataFusion!

I love working on DataFusion—the people, the quality of the code, my interactions and the results we have achieved together far surpass my expectations as well as most of my other software development experiences. I can’t wait to see what people will build next, and hope to see you online.

Note that DuckDB is slightly faster on the ‘cold’ run.
Want to try your hand at a custom format for ClickBench fame/glory? Make DataFusion the fastest engine in ClickBench with custom file format
We have contributors from North America, South American, Europe, Asia, Africa and Australia.
Undergraduates, PhD, Junior engineers, and getting-kind-of-crotchety experienced engineers.
Thanks to Andy Pavlo, I love that nomenclature.

Navigate to:

Try InfluxDB Cloud

Stop flying blind

Apache DataFusion is Now the Fastest Single Node Engine for Querying Apache Parquet Files

By Andrew Lamb / Developer
Nov 25, 2024

Navigate to:

A strong history of performance improvements

StringView

Parquet

Skipping partial aggregation when it doesn’t help

Optimized multi-column grouping

What’s next 🚀

Additional thanks

Conclusion

Ready to get started?

InfluxDB 3 Core & Enterprise GA: The Next Generation Time Series Platform for Developers is Here

Data Lakes and Warehouses

InfluxDB for Industrial IoT:
A Live Demonstration

Time Series Databases Explained

Network Monitoring

Time Series Data Analysis: Definitions and Best Techniques in 2025

Product & Solutions

Developers

Company

Navigate to:

Try InfluxDB Cloud

Stop flying blind

Get Updates

Apache DataFusion is Now the Fastest Single Node Engine for Querying Apache Parquet Files

By Andrew Lamb / Developer Nov 25, 2024

Navigate to:

A strong history of performance improvements

StringView

Parquet

Skipping partial aggregation when it doesn’t help

Optimized multi-column grouping

What’s next 🚀

Additional thanks

Conclusion

Ready to get started?

InfluxDB 3 Core & Enterprise GA: The Next Generation Time Series Platform for Developers is Here

Data Lakes and Warehouses

InfluxDB for Industrial IoT: A Live Demonstration

Time Series Databases Explained

Network Monitoring

Time Series Data Analysis: Definitions and Best Techniques in 2025

Product & Solutions

Developers

Company

Sign up for the InfluxData newsletter

Follow Us

By Andrew Lamb / Developer
Nov 25, 2024

InfluxDB for Industrial IoT:
A Live Demonstration