Optimizing SQL (and DataFrames) in DataFusion: Part 1

Navigate to:

Introduction

Sometimes Query Optimizers are seen as a sort of black magic, “the most challenging problem in computer science,” according to Father Pavlo, or some behind-the-scenes player. We believe this perception is because:

  1. One must implement the rest of a database system (data storage, transactions, SQL parser, expression evaluation, plan execution, etc.) before the optimizer becomes critical1.
  2. Some parts of the optimizer are tightly tied to the rest of the system (e.g., storage or indexes), so many classic optimizers are described with system-specific terminology.
  3. Some optimizer tasks, such as access path selection and join order are known challenges and not yet solved (practically)—maybe they really do require black magic 🤔.

However, Query Optimizers are no more complicated in theory or practice than other parts of a database system, as we will argue in a series of posts:

Part 1:

  • Review what a Query Optimizer is, what it does, and why you need one for SQL and DataFrames.
  • Describe how industrial Query Optimizers are structured and standard optimization classes.

Part 2:

  • Describe the optimization categories with examples and pointers to implementations.
  • Describe Apache DataFusion’s rationale and approach to query optimization, specifically for access path and join ordering.

After reading these blogs, we hope people will use DataFusion to:

  1. Build their own system specific optimizers.
  2. Perform practical academic research on optimization (especially researchers working on new optimizations / join ordering—looking at you CMU 15-799, next year).

Query Optimizer background

The key pitch for querying databases, and likely the key to the longevity of SQL (despite people’s love/hate relationship—see SQL or Death? Seminar Series – Spring 2025), is that it disconnects the WHAT you want to compute from the HOW to do it. SQL is a declarative language—it describes what answers are desired rather than an imperative language such as Python, where you describe how to do the computation as shown in Figure 1.

figure-1

Figure 1: Query Execution: Users describe the answer they want using either a DataFrame or SQL. The query planner or DataFrame API translates that description into an Initial Plan, which is correct but slow. The Query Optimizer then rewrites the initial plan to an Optimized Plan, which computes the same results but faster and more efficiently. Finally, the Execution Engine executes the optimized plan producing results.

SQL, DataFrames, LogicalPlan equivalence

Given their name, it is not surprising that Query Optimizers can improve the performance of SQL queries. However, it is under-appreciated that this also applies to DataFrame style APIs.

Classic DataFrame systems such as pandas and Polars (by default) execute eagerly and thus have limited opportunities for optimization. However, more modern APIs such as Polar’s lazy API, Apache Spark DataFrame, and DataFusion’s DataFrame are much faster as they use the design shown in Figure 1 and apply many query optimization techniques.

Example of Query Optimizer

This section motivates the value of a Query Optimizer with an example. Let’s say you have some observations of animal behavior, as illustrated in Table 1.

Location Species Population Observation Time Notes
North contrarian spider 100 2025-02-21T10:00:00Z Watched Me
       
South contrarian spider 234 2025-02-23T11:23:00Z N/A

Table 1: Example observational data.

If the user wants to know the average population for some species in the last month, a user can write a SQL query or a DataFrame such as the following:

SELECT location, AVG(population)
FROM observations
WHERE species = ‘contrarian spider’ AND 
  observation_time >= now() - interval '1 month'
GROUP BY location
df.scan("observations")
  .filter(col("species").eq("contrarian spider"))
  .filter(col("observation_time").ge(now()).sub(interval('1 month')))
  .agg(vec![col(location)], vec![avg(col("population")])

Within DataFusion, both the SQL and DataFrame are translated into the same LogicalPlan, a “tree of relational operators.” This is a fancy way of saying data flow graphs where the edges represent tabular data (rows + columns) and the nodes represent a transformation (see this DataFusion overview video for more details). The initial LogicalPlan for the queries above is shown in Figure 2.

figure-2

Figure 2: Example initial LogicalPlan for SQL and DataFrame query. The plan is read from bottom to top, computing the results in each step.

The optimizer’s job is to take this query plan and rewrite it into an alternate plan that computes the same results but faster, such as the one shown in Figure 3.

figure-3

Figure 3: An example optimized plan that computes the same result as the plan in Figure 2 more efficiently. The diagram highlights where the optimizer has applied Projection Pushdown, Filter Pushdown, and Constant Evaluation. Note that this is a simplified example for explanatory purposes, and actual optimizers such as the one in DataFusion perform additional tasks such as choosing specific aggregation algorithms.

Query Optimizer implementation

Industrial optimizers, such as DataFusion’s (source), ClickHouse (source, source), DuckDB (source), and Apache Spark (source), are implemented as a series of passes or rules that rewrite a query plan. The overall optimizer is composed of a sequence of these rules,6 as shown in Figure 4. The specific order of the rules also often matters, but we will not discuss this detail in this post.

A multi-pass design is standard because it helps:

  1. Understand, implement, and test each pass in isolation
  2. Easily extend the optimizer by adding new passes

Figure 4

Figure 4: Query Optimizers are implemented as a series of rules that each rewrite the query plan. Each rule’s algorithm is expressed as a transformation of a previous plan.

There are three major classes of optimizations in industrial optimizers:

  1. Always Optimizations: These are always good to do and thus are always applied. This class of optimization includes expression simplification, predicate pushdown, and limit pushdown. These optimizations are typically simple in theory, though they require nontrivial amounts of code and tests to implement in practice.
  2. Engine Specific Optimizations: These optimizations take advantage of specific engine features, such as how expressions are evaluated or what particular hash or join implementations are available.
  3. Access Path and Join Order Selection: These passes choose one access method per table and a join order for execution, typically using heuristics and a cost model to make tradeoffs between the options. Databases often have multiple ways to access the data (e.g., index scan or full-table scan), as well as many potential orders to combine (join) multiple tables. These methods compute the same result but can vary drastically in performance.

This brings us to the end of Part 1. In Part 2, we will explain these classes of optimizations in more detail and provide examples of how they are implemented in DataFusion and other systems.


  1. And thus in academic classes, by the time you get around to an optimizer the semester is over and everyone is ready for the semester to be done. Once industrial systems mature to the point where the optimizer is a bottleneck, the shiny newness of the hype cycle has worn off and it is likely in the trough of disappointment.