7 Projects Building on DataFusion
By
Charles Mahler /
Developer
Feb 25, 2025
Navigate to:
2024 was a huge year for Apache DataFusion, and 2025 is looking to be even bigger. The project officially became a top-level Apache Software Foundation project, achieved best-in-class performance for querying Parquet files in the ClickBench benchmark, and had a research paper accepted for SIGMOD.
As companies seek faster, more cost-effective ways to build data-intensive applications, DataFusion has become an increasingly valuable tool. DataFusion provides a powerful, embeddable, and scalable solution out of the box. This has leveled the playing field, allowing startups and open source projects to compete with established data platforms by offering high-performance analytics without the traditional cost of developing a query engine from scratch.
In this post, we’ll explore some projects leveraging DataFusion to power everything from real-time observability to AI-driven data infrastructure.
Pydantic Logfire
Pydantic is a popular open source data validation library for Python. It uses Python type annotations to validate and parse data, ensuring that variables and configurations adhere to specified types and constraints.
The Pydantic team created Logfire to simplify observability for Python applications. A key piece is built around DataFusion, which is used to query the observability data collected and stored in Parquet files.
InfluxDB
InfluxDB is an open source time series database developed by InfluxData. Designed for high-performance handling of time-stamped data, it’s widely used for monitoring, analytics, and IoT applications and supports high write and query loads.
InfluxData was an early adopter of DataFusion, which is a key part of InfluxDB 3’s architecture. DataFusion was chosen due to the advantages of being written in Rust, the interoperability benefits of using Arrow for memory management, and its extensibility, which allows InfluxDB 3 to support querying via SQL, InfluxQL, and Flux.
OpenObserve
OpenObserve is an open source observability platform that aims to provide an alternative to tools like DataDog or Splunk. It offers features like real-time monitoring, alerting, and dashboards to help developers and operators understand and optimize their applications.
A primary reason OpenObserve chose DataFusion is that it allows them to directly query Parquet files in object storage while maintaining high performance. This provides users with cost savings by not requiring more expensive disk-based storage. It also enables easier deployment and management of OpenObserve compared to other open source platforms that rely on external databases for their storage backend.
LanceDB
LanceDB is an open source vector database that supports vector embeddings and the actual data used to create them. This allows for data versioning and faster retrievals compared to other vector databases.
LanceDB’s support for storing multi-modal data is due to its custom columnar data format Lance. LanceDB utilizes DataFusion to support SQL queries across all types of data stored in the database.
SpiceAI
SpiceAI is an open source platform designed to simplify the development of AI apps and agents. It provides features like query federation across data lakes and data warehouses, data caching, an LLM gateway compatible with all major LLM provider APIs, and the ability to create semantic data models.
DataFusion is a key part of Spice’s stack that enables them to support querying multiple data sources using a single SQL interface. DataFusion’s integration with Apache Arrow also enables their accelerated data features like locally caching RAG data to reduce latency for LLM responses. Spice is a DataFusion contributor and added query support for Postgres, MySQL, DuckDB, and SQLite.
Cube
Cube is a popular open source tool used to create semantic layers for data applications, with over 18K stars on Github. It aggregates data from various sources, processes it, and provides a unified API for querying and visualization. Cube is designed to handle large datasets and deliver real-time insights, making it suitable for building dashboards and reports.
Cube chose DataFusion to build the next-gen data engine, Tesseract. DataFusion optimizes Cube’s ability to compile, maintain, and generate more complex SQL queries, such as period-over-period comparisons, percentage of total calculations, level of detail calculations, bucketing calculations, and data blending.
Arroyo
Arroyo is a serverless, real-time stream processing framework that allows developers to process and analyze high-volume data with sub-second latency. It provides a scalable and fault-tolerant architecture for handling high-throughput data streams, making it suitable for real-time analytics, monitoring, and event processing.
Arroyo built an entirely new SQL engine on top of DataFusion and Arrow, which allowed them to achieve 3x higher throughput, 20x faster startup, and reduce their docker image size by 11x.
Why choose DataFusion?
As you can see from the projects above, DataFusion is versatile by design and can be utilized across a broad range of use cases. Engineers choose DataFusion for a number of reasons, but here are some of the most common ones:
- Interoperability - One of the biggest advantages of DataFusion is the tight integration with Apache Arrow and Parquet. This makes it easy to work with existing tools within the ecosystem and makes adoption by new users easier. Comet provides Spark compatibility with DataFusion’s performance.
- Rust - Another selling point for developers is that DataFusion is written in Rust. This can help make development easier by avoiding many of the memory issues that can occur when writing a query engine in C++ while still getting excellent performance.
- Extensibility - DataFusion can be extended with custom query plans for specific workloads and data types. It also supports User Defined Functions to add SQL functionality needed for your specific use case. And with custom data sources, DataFusion can be used to integrate and query different data formats or storage backends.
What will you build?
Apache DataFusion is rapidly transforming the data infrastructure landscape, enabling projects of all sizes to deliver high-performance, SQL-based analytics without the burden of building a query engine from scratch. By choosing DataFusion, these companies gain faster time to market, lower development costs, and superior performance compared to legacy systems.
The projects highlighted in this post show DataFusion’s versatility. Whether querying Parquet files in object storage, optimizing real-time analytics, or powering AI-driven applications, DataFusion continues to push the boundaries of modern data processing.
With its growing open source community, strong performance benchmarks, and continued innovation, DataFusion is positioned to become the foundation of the next generation of data-driven applications. If you’re building a data-intensive product, it might be time to see how DataFusion can enhance it.