2025: The Year of 1,000 DataFusion-Based Systems
By
Andrew Lamb /
Developer
Jan 08, 2025
Navigate to:
Apache DataFusion has reached an inflection point. It has matured beyond early adopters and is now a viable choice for anyone building highly performant analytic systems. I predict 2025 will bring a significant acceleration in the number of systems built on DataFusion, and my focus this year is to help drive that growth.
The journey from 0 to 1,000 projects
Two years ago, when introducing DataFusion to VCs and early collaborators, I had an ambitious goal: 1,000 projects powered by DataFusion. That number was aspirational—bold enough to challenge but grounded enough to feel achievable. I think we may hit that goal in 2025.
DataFusion achieved several key milestones in 2024 as it matured from a promising technology to a building block for highly performant systems:
- Elevated to a Top-Level Project within the Apache Software Foundation (ASF).
- Hosted the first in-person meetup in Austin, Texas, followed by others in San Francisco, Seattle, Belgrade, and more.
- Published a research paper at ACM SIGMOD 2024, one of the world’s leading database conferences.
- Gained adoption by a growing number of database products and companies[1], with increased media attention.
The year closed with a major breakthrough: DataFusion 43.0.0 became the fastest engine for querying Apache Parquet files in ClickBench, marking the first time a Rust-based engine surpassed traditional C/C++ engines.
These milestones didn’t happen by chance—they are the result of eight years of relentless development from hundreds of individuals and countless engineering hours. Figure 1 shows my subjective appraisal of DataFusion’s timeline and my prediction of its acceleration over the next few years: Figure 1: Major milestones in the DataFusion project lifetime and my estimates of project adoption. I predict 2025 will be very exciting.
2020-2023 early adopters, including InfluxDB 3
InfluxData recognized DataFusion’s potential early on and bet on it for the rebuild of InfluxDB in Rust, along with the rest of the FDAP stack—Apache Arrow Flight, Apache DataFusion, Apache Arrow, and Apache Parquet—all ASF technologies. At the time, DataFusion was still in its infancy, developed primarily by its creator, Andy Grove, during his spare time.
Creating a high-performance time series engine using well-known columnar and vectorization techniques was central to the InfluxDB 3 design. Such an engine requires significant knowledge and investment and had previously been available only to a small number of companies and elite research institutions. We believed that the combination of being written in Rust, an ASF project, and part of the Arrow ecosystem would attract other users to DataFusion, who would both benefit and help provide the engineering needed. That bet has paid off, with over 94 individuals contributing to the most recent release.
InfluxData wasn’t alone in recognizing DataFusion’s potential. Companies like Coralogix, Greptime, and Synnada also embraced DataFusion, betting that building on its foundation and contributing to its development would allow them to deliver better products more quickly and cost-effectively than doing it entirely by themselves.
This collective investment helped grow DataFusion and its community while delivering tangible benefits to early adopters. While the journey came with challenges, the returns have been undeniably high.
Today, in InfluxDB 3, every aspect of data processing flows through a DataFusion plan after Line Protocol parsing. This includes writing and compacting Apache Parquet files and executing SQL, InfluxQL, and Flux queries. Our multi-tenant production systems alone execute 10s of millions of DataFusion plans daily. Improvements from the broader DataFusion community flow directly into InfluxDB 3, with many of our bug reports or SQL feature requests from customers resolved upstream by other contributors—requiring only a version upgrade.
2023-2025: gaining momentum
Major companies with dedicated engineering teams are now building and deploying DataFusion-based systems across diverse contexts while contributing back to the project. This virtuous cycle has driven rapid innovation in performance and features, with adoption still in its early stages. The past two years have been a turning point, with engineers from leading tech companies such as Apple, eBay, Kuaishou, Airbnb, TikTok, Huawei, and Alibaba contributing significantly to DataFusion.
A key milestone came last year when Apple developers built a replacement for Spark query execution using DataFusion, which they donated to ASF and is now developed as Apache DataFusion Comet. This not only demonstrated Apple’s confidence in DataFusion but also inspired additional contributions from the broader open source community, accelerating its growth.
Integration into the Open Data Lake
In 2025, adoption of DataFusion is set to surge as the industry embraces Open Data Lake architectures. The data landscape is evolving into a constellation of specialized processing systems, each tailored for unique use cases, as illustrated in Figure 2. Figure 2: Next-generation analytics: a constellation of different tools with a shared storage layer based on the open Apache Parquet and Apache Iceberg formats stored on Object Storage such as AWS S3, GCP Cloud Storage, and Azure Blob Storage.
These systems will share the same underlying data stored in the Apache Parquet open format, organized by Apache Iceberg, and tailored to different use cases. Achieving high performance in this architecture requires advanced, vectorized analytic technology—an area where DataFusion excels due to its permissive licensing, extensible design, and exceptional Parquet performance. The Rust-based implementations of Delta Lake, Apache Iceberg, and Apache Hudi, all built using DataFusion, highlight its central role in the shift toward open, modular data architectures.
To support this proliferation, I expect significant additional investment from the DataFusion community to improve the technological underpinnings of querying in this new architecture. Efforts include simplifying and accelerating remote file queries and exploring advanced caching strategies.
Streamlining Adoption for Downstream Users
Another major theme for investment in 2025 will be reducing friction for downstream users when adopting new versions of DataFusion. Recent efforts in DataFusion to complete projects such as StringView and Window Function Migration solidified its foundation, but the velocity of changes also caused challenges downstream for some upgrades.
As the ecosystem grows, ensuring the smooth adoption of updates becomes increasingly critical. We are discussing ways to improve this process as well as clarify the criteria for adding new features/what belongs in core DataFusion.
By balancing innovation with stability, the DataFusion community aims to maintain its rapid velocity of improvements while making it easier for users and contributors to keep pace.
Next Level Quality: Bashing Pesky Bugs
As DataFusion matures, users tend to:
- Expect more concerning the breadth and depth of functionality (e.g., SQL and type support)
- Run increasingly complicated queries
These trends naturally expose feature gaps and bugs. For example, given that InfluxDB 3 executes tens of millions of DataFusion plans per day on InfluxData production systems, we find occasional and increasingly esoteric issues that we report and help fix.
This “hardening” phase is a natural step for any successful software on its path to maturity and widespread adoption. While fixing these bugs can be tedious, it is a straightforward task requiring focused engineering effort. I am confident in our community’s ability to drive up the quality level.
DataFusion already benefits from extensive test coverage, and I predict we will see additional focus on automated industrial testing. Examples include Bruce Ritchie’s work on running DataFusion on the SQLite test corpus and Yongting You’s efforts to run SQLancer on Datafusion. InfluxData plans to contribute significantly to this area as well, and I hope other companies using DataFusion will do the same.
Pushing the Limits of Performance
One of DataFusion’s core principles is world-class performance: applications built on DataFusion can focus mostly on their specific features and take advantage of DataFusion’s performance (like LLVM, my favorite, though very geeky, analogy).
DataFusion already has optimized most “low-hanging fruit,” so continued performance improvements require careful and focused engineering. We continue to see performance projects such as vectorized group keys and improved pruning, but the quality bar gets higher. We will need continued, ongoing help from the community to find, implement, evaluate, and verify these improvements.
I am particularly excited about the possibility of working with academic groups—there is a wealth of talent and focused time for low-level performance optimization among PhD students. Additional collaboration can accelerate the adoption of students’ work into real-world systems and make DataFusion faster, and I am excited to help make it happen.
The year ahead
2025 will be very exciting as more DataFusion-based systems hit the market, solidifying its place as a foundational building block for analytic and data platforms. The future of the data stack is composable, and DataFusion will be one key component. While challenges are inevitable, the community (and I) will focus on driving it forward as fast as possible while maintaining a stable foundation, leading to a thriving ecosystem.
I’ll close with my usual appeal (aka 🎣 attempt): DataFusion is an open source project driven by open contributions. We welcome and encourage contributions from everyone. Review capacity remains our most limited, but impactful resource, and I encourage companies and individuals to dedicate time reviewing code, testing proposals, and helping maintain the project.
Finally, I want to express my gratitude to InfluxData. It was InfluxData’s vision and early recognition of DataFusion’s potential that introduced me to the project and supported my contributions over the past 4.5 years. This has allowed me to engage deeply—reviewing countless PRs, contributing more features (both directly and indirectly related to InfluxDB 3), writing many blog posts, traveling for meetups, and supporting my role as the project’s PMC.
2025 will be a pivotal year for DataFusion, and I look forward to seeing the innovation this community will drive.
[1] The numbers on those lists seem modest, but they only include people who have written publicly about their use. I know of many internal projects/data systems not listed that also use DataFusion.
[2] I am a database internals developer, after all. How cool is that!!