Using Parquet’s Bloom Filters

Navigate to:

Note: We updated this post on June 6, 2024 to reflect valuable community feedback.

Summary

In this article, we explore when and how to use Bloom filters in Parquet, their impact on written Parquet files, and measure their effectiveness when dealing with large quantities of high-cardinality1 data.

In summary, we found that:

  • Bloom filter parameters that more closely matched the number of distinct values within each Parquet row group yielded optimal pruning efficiency when dealing with large volumes of high-cardinality clustered data.
  • Query times are reduced to 1/30th of the time using such Bloom filters, at a cost of 2 KB to 8 KB per column per row group in storage space.
  • Choosing Bloom filter parameters that match the cardinality of the data set as a whole incurs a hefty storage penalty, but this was not necessary in our experiments.

A table containing calculated Bloom filter sizes for different input parameters is available in Bloom filter sizes.

Background & motivation

When querying data contained in Apache Parquet files, there are several techniques you can use to make queries faster. One method uses Bloom filters to prune entire row/column chunks. Bloom filters are particularly useful for columns that contain many distinct pseudo-random values, e.g., identifiers or UUIDs. This differs from columns that contain ordered data, have lower cardinality, or are already pruned effectively by min/max statistics in row group metadata. Support for Bloom filters is available in the Apache Rust2 and Java3 implementations of the Parquet format and the Apache DataFusion query engine support Bloom filters.

At InfluxData we frequently work with high-cardinality data and are considering several indexing strategies. Given Bloom filters are part of Apache Parquet and the libraries we use to support them, they are an appealing choice. However, there was a lack of information related to how effective they are in practice or how to set the tuning knobs. Specifically, we wanted to know:

  • How much do Bloom filters increase the size of Parquet files?
  • How much do Parquet Bloom filters improve query performance for data with high-cardinality, e.g., 100k to 1M distinct values?

Use case for Bloom filters

Consider the example Parquet file in Figure 1, which stores CPU usage for containers running on various host servers hosted in different cloud regions.

bloom-filters

Figure 1: Example Parquet file with three columns ordered by increasing cardinality as well as a time and cpu column; region has a cardinality of 10, host a cardinality of 1,000, and container_id a cardinality of 1,000,000. The cardinality of the data set corresponds to that of the highest cardinality column, i.e., 1,000,000. This Parquet file consists of two row groups, each with metadata containing the min and max values for each column.

Rows in this Parquet file are ordered by region, then host, and then container_id, to match the hierarchy of the entities producing the data. In practice, there would be many rows, i.e., representing the CPU usage at different timestamps, for each unique combination of region, host, and container_id.

If we query the Parquet file in Figure 1 using region and container_id as a predicate, e.g.,

SELECT * FROM 'fig1.parquet' WHERE region = 'us-east-2' AND container_id = 'CCC';

then the query engine will prune Row Group 2, given that us-east-2 is outside the min/max values for the region column contained in Row Group 2’s metadata, and therefore will only decode rows from Row Group 1. However, if we query using only container_id, e.g.,

SELECT * FROM 'fig1.parquet' WHERE container_id = 'CCC';

then the query engine must scan both row groups, because the metadata alone cannot rule out either group based on their respective min/max of container_id values (the value CCC falls in both ranges AAA..DDD and BBB..FFF).

This is where Bloom filters come in. Applying a Bloom filter to the container_id column enables the query engine to prune row groups when that column is used as the predicate.

Parquet’s Bloom filter implementation

Parquet uses Split Block Bloom filters (SBBF). When writing Parquet data, an SBBF is configured on a per-column basis by specifying the following parameters:

  • Number of distinct values (NDV)
  • False positive probability (FPP)

Bloom Filters are read from a Parquet file using their length and offset, which are stored in the metadata for each row group and are used by DataFusion to filter out entire row groups when querying a file. It seems reasonable to assume that NDV should closely match the cardinality of the column it is applied to, but the larger the NDV, the more space the Bloom filter occupies in the file.

We are interested in knowing the impact of an underestimated NDV because that would save storage space. Furthermore, we assume that the chosen FPP should correspond to the amount of pruning we will get out of the Bloom filter.

Bloom filter sizes

Bloom filters do not come free. They are stored in the Parquet file and therefore incur a storage penalty. A calculation of the desired NDV and FPP determines the size of the filter. The FPP can be calculated as4:

formula-1-01

Where f is the FPP, k is the number of hash functions, n is NDV, and m is the total number of bits in the Bloom filter. The number of bits required to achieve the desired FPP and NDV can be determined by rearranging the above:

formula-1-01

SBBFs use eight hash functions to cleanly fit in SIMD lanes5. Therefore, k is set to 8. The SBBF spreads those m bits across a set of blocks that are each 256 bits, i.e., 32 bytes, in size. Both the Rust6 and Java7 implementations of Parquet Bloom filters determine the size in bytes as:

formula-1-01

m is divided by 8 to be represented as bytes.

Using the formulas above allows us to calculate sizes for various FPP and NDV8 values:

NDV FPP Size (KB)
1,000 0.1 1
1,000 0.01 2
1,000 0.001 2
10,000 0.1 8
10,000 0.01 16
10,000 0.001 32
10,000 0.0001 32
100,000 0.1 128
100,000 0.01 128
100,000 0.001 256
100,000 0.0001 512
100,000 0.00001 512
1,000,000 0.1 1,024
1,000,000 0.01 2,048
1,000,000 0.001 2,048
1,000,000 0.0001 4,096
1,000,000 0.00001 4,096
1,000,000 0.000001 8,192


Experiments

Setup

While the theoretical results in the previous section are informative, we wanted to validate that they match reality. To do so, we developed the parquet-bloom-filter-analysis tool.

It is a CLI, written in Rust, that generates Parquet files with a specified number of columns, each having:

  • Cardinality, i.e., the number of unique values
  • Optional Bloom filter, with FPP and NDV parameters

In addition, when generating a set of files, we specify:

  • The number of Parquet files to generate
  • Maximum row group size
  • Duplication factor

The rows generated are ordered by columns of increasing cardinality, as in the example in Figure 1 above. The duplication factor is used to generate duplicate rows for each unique combination of the generated column values, thereby clustering the data by unique combinations of column values. This is akin to there being many CPU usage metrics recorded at different timestamps for each region/host/container combination in the example from Figure 1.

We used parquet-bloom-filter-analysis to conduct the following experiments:

  1. NDV Impact on size: For data with 1M cardinality and 100k cardinality, generate a small number of files with a fixed FPP and varying NDV parameters to determine an acceptable baseline for those parameters at each respective cardinality and verify that the impact on Parquet file size matches the theory.
  2. Query performance impact: Generate a larger volume of Parquet files for both 1M and 100k cardinality to assess the query performance gains achievable with Bloom filters at a more realistic scale.

System Specifications

All experiments were run on a machine with the following specifications:

CPU Apple M3 Max
Total Number of Cores 14
Total Memory 36 GB
Storage type SSD

NDV impact on size: 1M cardinality

We generated a data set containing three columns with the following properties:

Number of files 10
Rows per file 10M
Rows per row group 1M
Duplication factor 100
Column 1 Cardinality 100
Column 2 Cardinality 10k
Column 3 Cardinality 1M

This resulted in a total of 100M rows of data, with 100 rows per unique combination of column values.

Here is a sample of the generated data:

+----------+-------------+--------------+
| col_0    | col_1       | col_2        |
+----------+-------------+--------------+
| first-52 | second-4119 | third-136666 |
| first-52 | second-4119 | third-230251 |
| first-52 | second-4119 | third-287794 |
| first-52 | second-4119 | third-360226 |
| first-52 | second-4119 | third-584614 |
| first-52 | second-4119 | third-672996 |
| first-52 | second-4119 | third-952788 |
| first-52 | second-4174 | third-002478 |
| first-52 | second-4174 | third-236114 |
| first-52 | second-4174 | third-439784 |
| first-52 | second-4174 | third-477628 |
| first-52 | second-4174 | third-540914 |
| ...                                   |
+----------+-------------+--------------+

From there, we create a Bloom filter on the highest cardinality column, i.e., Column 3, by setting FPP to 0.01, and varying NDV. An FPP of 0.01 was chosen as the generated data only contains 100 row groups, and this should correspond to a reduction of decoded row groups by two orders of magnitude, i.e., only 1 out of every 100 row groups being decoded would be the best-case pruning efficiency.

The storage cost can be measured by taking the difference between the average Parquet file size when using the Bloom filter and that of the average baseline file size. The baseline file size for this dataset was determined by encoding the same data without any Bloom filters.

To see how many row groups are actually pruned, we used Datafusion’s CLI9 tool to determine how many row groups are pruned when querying with a predicate on the column being Filtered, like so:

EXPLAIN ANALYZE SELECT * FROM '<data\_folder>/' WHERE col_2 = '<target_value>'; 

<data_folder> corresponds to the location of the data set generated with the FPP/NDV of interest, and a representative range of <target_value>s were tested.

Results

The baseline storage cost, i.e., the average Parquet file size without Bloom filters, was 647,080.5 B. Here are the results for a fixed FPP of 0.01, with varying NDV:

FPP NDV Avg. File Size (B) Delta (B) Delta / RG (B) RG Pruned (/100)
0.01 10 647,610.5 530.0 53.0 0
0.01 100 648,590.5 1,510.0 151.0 0
0.01 500 657,588.0 10,507.5 1,050.8 0
0.01 1,000 667,790.5 20,710.0 2,071.0 5
0.01 2,500 688,270.5 41,190.0 4,119.0 45
0.01 5,000 729,250.5 82,170.0 8,217.0 91
0.01 7,500 811,170.5 164,090.0 16,409.0 99
0.01 10,000 811,170.5 164,090.0 16,409.0 99
0.01 100,000 1,958,116.5 1,311,036.0 131,103.6 99
0.01 1,000,000 21,618,939.5 20,971,859.0 2,097,185.9 99

Delta - difference between Avg. File Size and baseline file size; RG - row group.

Discussion/Key Observations

  • DataFusion prunes more than 90% of row groups using an 8 KB Bloom filter per row group / a 12% larger file size. All non-matching row groups are pruned with an overhead of 16 KB per group / 20% larger file size.
  • The relative cost seems significant here but would be less so in practice. For example, the Parquet files generated by InfluxDB tend to be MB in size and store many more columns, including timestamp and 64-bit float columns. Therefore, we expect an increase of 8-16 KB per row group would result in storage growth of ~1% or less.
  • Choosing an NDV that corresponds closely with the actual cardinality of the column incurs a hefty storage cost (2 MB per row group) but does not increase pruning efficiency and therefore does not appear to be necessary.

NDV impact on size: 100k cardinality

Similarly to the previous experiment, ‌this experiment uses data with the following properties:

Number of files 10
Rows per file 10M
Rows per row group 1M
Duplication factor 1000
Column 1 Cardinality 100
Column 2 Cardinality 10k
Column 3 Cardinality 100k

There are two differences: the duplication factor is 1000 (vs. 100), and the cardinality of Column 3 is 100k (vs. 1M). This keeps the number of rows and Parquet files consistent with the previous experiment while keeping the generated data ordered.

File sizes and pruning efficiency were measured in the same way as in the previous experiment.

Results

The baseline storage cost, i.e., the average Parquet file size when no Bloom filter was added, was 71,965.6 B. Here are the results for a fixed FPP of 0.01, with varying NDV:

FPP NDV Avg. File Size (B) Delta (B) Delta / RG (B) RG Pruned (/100)
0.01 100 73,475.6 1,510.0 151.0 0
0.01 500 82,435.6 10,470.0 1,047.0 93
0.01 1,000 92,675.6 20,710.0 2,071.0 99
0.01 2,500 113,155.6 41,190.0 4,119.0 99
0.01 5,000 154,135.6 82,170.0 8,217.0 99
0.01 7,500 236,055.6 164,090.0 16,409.0 99
0.01 10,000 236,055.6 164,090.0 16,409.0 99
0.01 100,000 1,382,997.6 1,311,032.0 131,103.2 99
0.01 1,000,000 21,043,824.6 20,971,859.0 2,097,185.9 99

Delta - difference between Avg. File Size and baseline file size; RG - row group.

Discussion/Key Observations

  • The impact on file size is the same as the previous experiment, consistent with the theoretical results that the Bloom filter size is not related to the cardinality of the data it is filtering
  • DataFusion successfully prunes all non-matching row groups at NDV 1,000, adding only ~2K overhead per row group
  • The pruning efficiency saturates at a lower NDV (1,000) than in the previous experiment (7,500)

Query performance impact: 100k cardinality

To assess how Bloom filters perform at a more realistic scale, we generated Parquet data with the following properties:

Number of files 10K
Rows per file 10M
Rows per row group 1M
Duplication factor 1M
Column 1 Cardinality 100
Column 2 Cardinality 10k
Column 3 Cardinality 100k

This corresponds to a data set with cardinality of 100k, spread across 10k files. The data was generated with no Bloom filters to establish a baseline and with a Bloom filter on Column 3, i.e., with 100k cardinality using FPP of 0.01 and NDV of 1,000.

The two things we wanted to check in this scenario are:

  1. Does the pruning efficiency remain consistent with the small-scale experiments performed above?
  2. Can we measure an improvement in query performance between the files that have Bloom filters and those that do not?

(1.) can be answered as it was above by analyzing the query to select ‌a single value for Column 3, while (2.) can be answered by‌ performing the query and measuring the response time.

Results

The experiment generated 10,000 files, with the following average file sizes:

Data set Average File Size (B)
Bloom filter on 100k Cardinality Column 29,880.87
No Bloom filters 9,112.87

Based on these numbers, the calculated overhead per row group is still 2,076.8 B, as it was in the previous experiment; however, the baseline is 9,112.87 B compared to 71,965.60 B.

Queries were performed using Column 3, i.e., the one with 100k cardinality, as the predicate, with values that span the total range of values within its cardinality (third-[00000, 99999]):

Predicate (col_3 = ) With Bloom filter No Bloom filter
RG Pruned (BF) RG Pruned (Stats) Total RG Pruned Latency (ms) RG Pruned (Stats) Latency (ms)
third-99999 0 99,999 99,999 1,079 99,999 1,069
third-90000 13,078 86,920 99,998 1,106 86,920 21,837
third-80000 17,909 82,089 99,998 1,160 82,089 28,724
third-70000 19,465 80,533 99,998 1,150 80,533 31,560
third-60000 19,874 80,124 99,998 1,168 80,124 31,821
third-50000 19,970 80,028 99,998 1,145 80,028 32,181
third-40000 19,850 80,148 99,998 1,145 80,148 31,866
third-30000 19,393 80,605 99,998 1,145 80,605 31,128
third-20000 17,870 82,126 99,996 1,179 82,126 28,792
third-10000 12,984 87,014 99,998 1,129 87,014 21,582
third-00000 0 99,998 99,998 1,093 99,998 1,051

RG - row group.

Discussion/Key Observations

  • The overhead from Bloom filters is the same as with larger files but is amplified at scale. Presumably, this is because the columnar data is being compressed while the Bloom filters in the metadata are not.
  • The Bloom filters clean up the row groups that the row group statistics do not, resulting in noticeably better query latency (~1/30th)
  • Predicate values at the min/max of all available values will give fast queries based on row group statistics alone.

Query performance impact: 1M cardinality

This experiment used the same setup as the previous example by generating 10,000 files; however, in this case, the cardinality of the third column was increased to 1M. The duplication factor was adjusted to keep the generated data ordered:

Number of files 10K
Rows per file 10M
Rows per row group 1M
Duplication factor 100K
Column 1 Cardinality 100
Column 2 Cardinality 10k
Column 3 Cardinality 100M

The FPP and NDV parameters were also kept the same, at FPP 0.01 and NDV 1,000. This was to see if a lower NDV, and therefore smaller Bloom filters, could be used at this scale and cardinality, despite the fact that at a smaller scale, NDV of 1,000 did not have much effect at 1 M cardinality (see previous experiment).

Results

The experiment generated 10,000 files with the following average file sizes:

Data set Average File Size (B)
Bloom filter on 1M Cardinality Column 30,931.42
No Bloom filters 10,163.42

The overhead of the Bloom filters is therefore 2,076.8 B per row group.

Queries were performed using Column 3, i.e., the one with 1M cardinality, as the predicate, with values that span the total range of values within its cardinality (third-[000000, 999999]):

Predicate (col_3 = ) With Bloom filter No Bloom filter
RG Pruned (BF) RG Pruned (Stats) Total RG Pruned Latency (ms) RG Pruned (Stats) Latency (ms)
third-999999 0 99,998 99,998 1,058 99,998 1,104
third-900000 19,020 80,979 99,999 1,149 80,979 30,511
third-800000 20,408 79,591 99,999 1,129 79,591 34,282
third-700000 20,399 79,600 99,999 1,138 79,600 32,948
third-600000 20,382 79,617 99,999 1,139 79,617 32,534
third-500000 20,419 79,580 99,999 1,027 79,580 32,570
third-400000 20,391 79,608 99,999 1,150 79,608 32,372
third-300000 20,391 79,607 99,998 1,175 79,607 32,321
third-200000 20,385 79,614 99,999 1,145 79,614 32,359
third-100000 19,230 80,769 99,999 1,147 80,769 30,700
third-000000 0 99,999 99,999 1,073 99,999 1,079

RG - row group.

Discussion/Key Observations

  • Bloom filters are effective at this scale and cardinality despite using NDV of only 1,000

Conclusions

When using Parquet, Bloom filters can provide substantial query performance gains at larger data volumes, albeit with additional storage costs. The good news is that more moderate Bloom filter parameters, which were quite effective with our ordered data, can mitigate the storage cost.

Some of the questions that remain open are:

  • Can even more moderate Bloom filter parameters be used, i.e., NDV < 1,000, to the same effect, thus mitigating the storage cost even further?
  • Would compressing the Bloom filters in the written Parquet files, which is not currently part of the Parquet specification, provide any benefit?
  • Would we be able to achieve the same performance gains if the columns we apply Bloom filters to were storing longer or more complex values, e.g., UUIDs?

Related blog posts:


  1. We use the term cardinality to refer to the number of distinct values in a data set as a whole, i.e., across all parquet files and their row groups. The data generated for analysis in this post is ordered such that the number of distinct values in a given row group is less than the cardinality of the entire data set.

  2. See also, tracking issue for Apache Rust Parquet Bloom filter support

  3. See also, blog post on Bloom filter support in Apache Spark

  4. See Cache Efficient Bloom filters for Shared Memory Machines

  5. See Split Block Bloom filters

  6. See Rust Parquet implementation’s optimal number of bytes source

  7. See Java Parquet implementation’s source

  8. See Rust Playground Link

  9. See DataFusion CLI