Using Parquet’s Bloom Filters

By Trevor Hilton / Developer
May 28, 2024

Navigate to:

Note: We updated this post on June 6, 2024 to reflect valuable community feedback.

Summary

In this article, we explore when and how to use Bloom filters in Parquet, their impact on written Parquet files, and measure their effectiveness when dealing with large quantities of high-cardinality¹ data.

In summary, we found that:

Bloom filter parameters that more closely matched the number of distinct values within each Parquet row group yielded optimal pruning efficiency when dealing with large volumes of high-cardinality clustered data.
Query times are reduced to 1/30th of the time using such Bloom filters, at a cost of 2 KB to 8 KB per column per row group in storage space.
Choosing Bloom filter parameters that match the cardinality of the data set as a whole incurs a hefty storage penalty, but this was not necessary in our experiments.

A table containing calculated Bloom filter sizes for different input parameters is available in Bloom filter sizes.

Background & motivation

When querying data contained in Apache Parquet files, there are several techniques you can use to make queries faster. One method uses Bloom filters to prune entire row/column chunks. Bloom filters are particularly useful for columns that contain many distinct pseudo-random values, e.g., identifiers or UUIDs. This differs from columns that contain ordered data, have lower cardinality, or are already pruned effectively by min/max statistics in row group metadata. Support for Bloom filters is available in the Apache Rust² and Java³ implementations of the Parquet format and the Apache DataFusion query engine support Bloom filters.

At InfluxData we frequently work with high-cardinality data and are considering several indexing strategies. Given Bloom filters are part of Apache Parquet and the libraries we use to support them, they are an appealing choice. However, there was a lack of information related to how effective they are in practice or how to set the tuning knobs. Specifically, we wanted to know:

How much do Bloom filters increase the size of Parquet files?
How much do Parquet Bloom filters improve query performance for data with high-cardinality, e.g., 100k to 1M distinct values?

Use case for Bloom filters

Consider the example Parquet file in Figure 1, which stores CPU usage for containers running on various host servers hosted in different cloud regions.

bloom-filters

Figure 1: Example Parquet file with three columns ordered by increasing cardinality as well as a time and cpu column; region has a cardinality of 10, host a cardinality of 1,000, and container_id a cardinality of 1,000,000. The cardinality of the data set corresponds to that of the highest cardinality column, i.e., 1,000,000. This Parquet file consists of two row groups, each with metadata containing the min and max values for each column.

Rows in this Parquet file are ordered by region, then host, and then container_id, to match the hierarchy of the entities producing the data. In practice, there would be many rows, i.e., representing the CPU usage at different timestamps, for each unique combination of region, host, and container_id.

If we query the Parquet file in Figure 1 using region and container_id as a predicate, e.g.,

SELECT * FROM 'fig1.parquet' WHERE region = 'us-east-2' AND container_id = 'CCC';

then the query engine will prune Row Group 2, given that us-east-2 is outside the min/max values for the region column contained in Row Group 2’s metadata, and therefore will only decode rows from Row Group 1. However, if we query using only container_id, e.g.,

SELECT * FROM 'fig1.parquet' WHERE container_id = 'CCC';

then the query engine must scan both row groups, because the metadata alone cannot rule out either group based on their respective min/max of container_id values (the value CCC falls in both ranges AAA..DDD and BBB..FFF).

This is where Bloom filters come in. Applying a Bloom filter to the container_id column enables the query engine to prune row groups when that column is used as the predicate.

Parquet’s Bloom filter implementation

Parquet uses Split Block Bloom filters (SBBF). When writing Parquet data, an SBBF is configured on a per-column basis by specifying the following parameters:

Number of distinct values (NDV)
False positive probability (FPP)

Bloom Filters are read from a Parquet file using their length and offset, which are stored in the metadata for each row group and are used by DataFusion to filter out entire row groups when querying a file. It seems reasonable to assume that NDV should closely match the cardinality of the column it is applied to, but the larger the NDV, the more space the Bloom filter occupies in the file.

We are interested in knowing the impact of an underestimated NDV because that would save storage space. Furthermore, we assume that the chosen FPP should correspond to the amount of pruning we will get out of the Bloom filter.

Bloom filter sizes

Bloom filters do not come free. They are stored in the Parquet file and therefore incur a storage penalty. A calculation of the desired NDV and FPP determines the size of the filter. The FPP can be calculated as⁴:

formula-1-01

Where f is the FPP, k is the number of hash functions, n is NDV, and m is the total number of bits in the Bloom filter. The number of bits required to achieve the desired FPP and NDV can be determined by rearranging the above:

formula-1-01

SBBFs use eight hash functions to cleanly fit in SIMD lanes⁵. Therefore, k is set to 8. The SBBF spreads those m bits across a set of blocks that are each 256 bits, i.e., 32 bytes, in size. Both the Rust⁶ and Java⁷ implementations of Parquet Bloom filters determine the size in bytes as:

formula-1-01

m is divided by 8 to be represented as bytes.

Using the formulas above allows us to calculate sizes for various FPP and NDV⁸ values:

NDV	FPP	Size (KB)
1,000	0.1	1
1,000	0.01	2
1,000	0.001	2
10,000	0.1	8
10,000	0.01	16
10,000	0.001	32
10,000	0.0001	32
100,000	0.1	128
100,000	0.01	128
100,000	0.001	256
100,000	0.0001	512
100,000	0.00001	512
1,000,000	0.1	1,024
1,000,000	0.01	2,048
1,000,000	0.001	2,048
1,000,000	0.0001	4,096
1,000,000	0.00001	4,096
1,000,000	0.000001	8,192

Experiments

Setup

While the theoretical results in the previous section are informative, we wanted to validate that they match reality. To do so, we developed the parquet-bloom-filter-analysis tool.

It is a CLI, written in Rust, that generates Parquet files with a specified number of columns, each having:

Cardinality, i.e., the number of unique values
Optional Bloom filter, with FPP and NDV parameters

In addition, when generating a set of files, we specify:

The number of Parquet files to generate
Maximum row group size
Duplication factor

The rows generated are ordered by columns of increasing cardinality, as in the example in Figure 1 above. The duplication factor is used to generate duplicate rows for each unique combination of the generated column values, thereby clustering the data by unique combinations of column values. This is akin to there being many CPU usage metrics recorded at different timestamps for each region/host/container combination in the example from Figure 1.

We used parquet-bloom-filter-analysis to conduct the following experiments:

NDV Impact on size: For data with 1M cardinality and 100k cardinality, generate a small number of files with a fixed FPP and varying NDV parameters to determine an acceptable baseline for those parameters at each respective cardinality and verify that the impact on Parquet file size matches the theory.
Query performance impact: Generate a larger volume of Parquet files for both 1M and 100k cardinality to assess the query performance gains achievable with Bloom filters at a more realistic scale.

System Specifications

All experiments were run on a machine with the following specifications:

CPU	Apple M3 Max
Total Number of Cores	14
Total Memory	36 GB
Storage type	SSD

NDV impact on size: 1M cardinality

We generated a data set containing three columns with the following properties:

Number of files	10
Rows per file	10M
Rows per row group	1M
Duplication factor	100
Column 1 Cardinality	100
Column 2 Cardinality	10k
Column 3 Cardinality	1M

This resulted in a total of 100M rows of data, with 100 rows per unique combination of column values.

Here is a sample of the generated data:

+----------+-------------+--------------+
| col_0    | col_1       | col_2        |
+----------+-------------+--------------+
| first-52 | second-4119 | third-136666 |
| first-52 | second-4119 | third-230251 |
| first-52 | second-4119 | third-287794 |
| first-52 | second-4119 | third-360226 |
| first-52 | second-4119 | third-584614 |
| first-52 | second-4119 | third-672996 |
| first-52 | second-4119 | third-952788 |
| first-52 | second-4174 | third-002478 |
| first-52 | second-4174 | third-236114 |
| first-52 | second-4174 | third-439784 |
| first-52 | second-4174 | third-477628 |
| first-52 | second-4174 | third-540914 |
| ...                                   |
+----------+-------------+--------------+

From there, we create a Bloom filter on the highest cardinality column, i.e., Column 3, by setting FPP to 0.01, and varying NDV. An FPP of 0.01 was chosen as the generated data only contains 100 row groups, and this should correspond to a reduction of decoded row groups by two orders of magnitude, i.e., only 1 out of every 100 row groups being decoded would be the best-case pruning efficiency.

The storage cost can be measured by taking the difference between the average Parquet file size when using the Bloom filter and that of the average baseline file size. The baseline file size for this dataset was determined by encoding the same data without any Bloom filters.

To see how many row groups are actually pruned, we used Datafusion’s CLI⁹ tool to determine how many row groups are pruned when querying with a predicate on the column being Filtered, like so:

EXPLAIN ANALYZE SELECT * FROM '<data\_folder>/' WHERE col_2 = '<target_value>';

<data_folder> corresponds to the location of the data set generated with the FPP/NDV of interest, and a representative range of <target_value>s were tested.

Results

The baseline storage cost, i.e., the average Parquet file size without Bloom filters, was 647,080.5 B. Here are the results for a fixed FPP of 0.01, with varying NDV:

FPP	NDV	Avg. File Size (B)	Delta (B)	Delta / RG (B)	RG Pruned (/100)
0.01	10	647,610.5	530.0	53.0	0
0.01	100	648,590.5	1,510.0	151.0	0
0.01	500	657,588.0	10,507.5	1,050.8	0
0.01	1,000	667,790.5	20,710.0	2,071.0	5
0.01	2,500	688,270.5	41,190.0	4,119.0	45
0.01	5,000	729,250.5	82,170.0	8,217.0	91
0.01	7,500	811,170.5	164,090.0	16,409.0	99
0.01	10,000	811,170.5	164,090.0	16,409.0	99
0.01	100,000	1,958,116.5	1,311,036.0	131,103.6	99
0.01	1,000,000	21,618,939.5	20,971,859.0	2,097,185.9	99

Delta - difference between Avg. File Size and baseline file size; RG - row group.

Discussion/Key Observations

DataFusion prunes more than 90% of row groups using an 8 KB Bloom filter per row group / a 12% larger file size. All non-matching row groups are pruned with an overhead of 16 KB per group / 20% larger file size.
The relative cost seems significant here but would be less so in practice. For example, the Parquet files generated by InfluxDB tend to be MB in size and store many more columns, including timestamp and 64-bit float columns. Therefore, we expect an increase of 8-16 KB per row group would result in storage growth of ~1% or less.
Choosing an NDV that corresponds closely with the actual cardinality of the column incurs a hefty storage cost (2 MB per row group) but does not increase pruning efficiency and therefore does not appear to be necessary.

NDV impact on size: 100k cardinality

Similarly to the previous experiment, ‌this experiment uses data with the following properties:

Number of files	10
Rows per file	10M
Rows per row group	1M
Duplication factor	1000
Column 1 Cardinality	100
Column 2 Cardinality	10k
Column 3 Cardinality	100k

There are two differences: the duplication factor is 1000 (vs. 100), and the cardinality of Column 3 is 100k (vs. 1M). This keeps the number of rows and Parquet files consistent with the previous experiment while keeping the generated data ordered.

File sizes and pruning efficiency were measured in the same way as in the previous experiment.

Results

The baseline storage cost, i.e., the average Parquet file size when no Bloom filter was added, was 71,965.6 B. Here are the results for a fixed FPP of 0.01, with varying NDV:

FPP	NDV	Avg. File Size (B)	Delta (B)	Delta / RG (B)	RG Pruned (/100)
0.01	100	73,475.6	1,510.0	151.0	0
0.01	500	82,435.6	10,470.0	1,047.0	93
0.01	1,000	92,675.6	20,710.0	2,071.0	99
0.01	2,500	113,155.6	41,190.0	4,119.0	99
0.01	5,000	154,135.6	82,170.0	8,217.0	99
0.01	7,500	236,055.6	164,090.0	16,409.0	99
0.01	10,000	236,055.6	164,090.0	16,409.0	99
0.01	100,000	1,382,997.6	1,311,032.0	131,103.2	99
0.01	1,000,000	21,043,824.6	20,971,859.0	2,097,185.9	99

Delta - difference between Avg. File Size and baseline file size; RG - row group.

Discussion/Key Observations

The impact on file size is the same as the previous experiment, consistent with the theoretical results that the Bloom filter size is not related to the cardinality of the data it is filtering
DataFusion successfully prunes all non-matching row groups at NDV 1,000, adding only ~2K overhead per row group
The pruning efficiency saturates at a lower NDV (1,000) than in the previous experiment (7,500)

Query performance impact: 100k cardinality

To assess how Bloom filters perform at a more realistic scale, we generated Parquet data with the following properties:

Number of files	10K
Rows per file	10M
Rows per row group	1M
Duplication factor	1M
Column 1 Cardinality	100
Column 2 Cardinality	10k
Column 3 Cardinality	100k

This corresponds to a data set with cardinality of 100k, spread across 10k files. The data was generated with no Bloom filters to establish a baseline and with a Bloom filter on Column 3, i.e., with 100k cardinality using FPP of 0.01 and NDV of 1,000.

The two things we wanted to check in this scenario are:

Does the pruning efficiency remain consistent with the small-scale experiments performed above?
Can we measure an improvement in query performance between the files that have Bloom filters and those that do not?

(1.) can be answered as it was above by analyzing the query to select ‌a single value for Column 3, while (2.) can be answered by‌ performing the query and measuring the response time.

Results

The experiment generated 10,000 files, with the following average file sizes:

Data set	Average File Size (B)
Bloom filter on 100k Cardinality Column	29,880.87
No Bloom filters	9,112.87

Based on these numbers, the calculated overhead per row group is still 2,076.8 B, as it was in the previous experiment; however, the baseline is 9,112.87 B compared to 71,965.60 B.

Queries were performed using Column 3, i.e., the one with 100k cardinality, as the predicate, with values that span the total range of values within its cardinality (third-[00000, 99999]):

Predicate (col_3 = )	With Bloom filter				No Bloom filter
Predicate (col_3 = )	RG Pruned (BF)	RG Pruned (Stats)	Total RG Pruned	Latency (ms)	RG Pruned (Stats)	Latency (ms)
third-99999	0	99,999	99,999	1,079	99,999	1,069
third-90000	13,078	86,920	99,998	1,106	86,920	21,837
third-80000	17,909	82,089	99,998	1,160	82,089	28,724
third-70000	19,465	80,533	99,998	1,150	80,533	31,560
third-60000	19,874	80,124	99,998	1,168	80,124	31,821
third-50000	19,970	80,028	99,998	1,145	80,028	32,181
third-40000	19,850	80,148	99,998	1,145	80,148	31,866
third-30000	19,393	80,605	99,998	1,145	80,605	31,128
third-20000	17,870	82,126	99,996	1,179	82,126	28,792
third-10000	12,984	87,014	99,998	1,129	87,014	21,582
third-00000	0	99,998	99,998	1,093	99,998	1,051

RG - row group.

Discussion/Key Observations

The overhead from Bloom filters is the same as with larger files but is amplified at scale. Presumably, this is because the columnar data is being compressed while the Bloom filters in the metadata are not.
The Bloom filters clean up the row groups that the row group statistics do not, resulting in noticeably better query latency (~1/30th)
Predicate values at the min/max of all available values will give fast queries based on row group statistics alone.

Query performance impact: 1M cardinality

This experiment used the same setup as the previous example by generating 10,000 files; however, in this case, the cardinality of the third column was increased to 1M. The duplication factor was adjusted to keep the generated data ordered:

Number of files	10K
Rows per file	10M
Rows per row group	1M
Duplication factor	100K
Column 1 Cardinality	100
Column 2 Cardinality	10k
Column 3 Cardinality	100M

The FPP and NDV parameters were also kept the same, at FPP 0.01 and NDV 1,000. This was to see if a lower NDV, and therefore smaller Bloom filters, could be used at this scale and cardinality, despite the fact that at a smaller scale, NDV of 1,000 did not have much effect at 1 M cardinality (see previous experiment).

Results

The experiment generated 10,000 files with the following average file sizes:

Data set	Average File Size (B)
Bloom filter on 1M Cardinality Column	30,931.42
No Bloom filters	10,163.42

The overhead of the Bloom filters is therefore 2,076.8 B per row group.

Queries were performed using Column 3, i.e., the one with 1M cardinality, as the predicate, with values that span the total range of values within its cardinality (third-[000000, 999999]):

Predicate (col_3 = )	With Bloom filter				No Bloom filter
Predicate (col_3 = )	RG Pruned (BF)	RG Pruned (Stats)	Total RG Pruned	Latency (ms)	RG Pruned (Stats)	Latency (ms)
third-999999	0	99,998	99,998	1,058	99,998	1,104
third-900000	19,020	80,979	99,999	1,149	80,979	30,511
third-800000	20,408	79,591	99,999	1,129	79,591	34,282
third-700000	20,399	79,600	99,999	1,138	79,600	32,948
third-600000	20,382	79,617	99,999	1,139	79,617	32,534
third-500000	20,419	79,580	99,999	1,027	79,580	32,570
third-400000	20,391	79,608	99,999	1,150	79,608	32,372
third-300000	20,391	79,607	99,998	1,175	79,607	32,321
third-200000	20,385	79,614	99,999	1,145	79,614	32,359
third-100000	19,230	80,769	99,999	1,147	80,769	30,700
third-000000	0	99,999	99,999	1,073	99,999	1,079

RG - row group.

Discussion/Key Observations

Bloom filters are effective at this scale and cardinality despite using NDV of only 1,000

Conclusions

When using Parquet, Bloom filters can provide substantial query performance gains at larger data volumes, albeit with additional storage costs. The good news is that more moderate Bloom filter parameters, which were quite effective with our ordered data, can mitigate the storage cost.

Some of the questions that remain open are:

Can even more moderate Bloom filter parameters be used, i.e., NDV < 1,000, to the same effect, thus mitigating the storage cost even further?
Would compressing the Bloom filters in the written Parquet files, which is not currently part of the Parquet specification, provide any benefit?
Would we be able to achieve the same performance gains if the columns we apply Bloom filters to were storing longer or more complex values, e.g., UUIDs?

Related blog posts:

We use the term cardinality to refer to the number of distinct values in a data set as a whole, i.e., across all parquet files and their row groups. The data generated for analysis in this post is ordered such that the number of distinct values in a given row group is less than the cardinality of the entire data set.
See also, tracking issue for Apache Rust Parquet Bloom filter support
See also, blog post on Bloom filter support in Apache Spark
See Cache Efficient Bloom filters for Shared Memory Machines
See Split Block Bloom filters
See Rust Parquet implementation’s optimal number of bytes source
See Java Parquet implementation’s source
See Rust Playground Link
See DataFusion CLI

Navigate to:

Try InfluxDB Cloud

Stop flying blind

Using Parquet’s Bloom Filters

By Trevor Hilton / Developer
May 28, 2024

Navigate to:

Summary

Background & motivation

Use case for Bloom filters

Parquet’s Bloom filter implementation

Bloom filter sizes

Experiments

Setup

System Specifications

NDV impact on size: 1M cardinality

Results

Discussion/Key Observations

NDV impact on size: 100k cardinality

Results

Discussion/Key Observations

Query performance impact: 100k cardinality

Results

Discussion/Key Observations

Query performance impact: 1M cardinality

Results

Discussion/Key Observations

Conclusions

Ready to get started?

InfluxDB 3 Core & Enterprise GA: The Next Generation Time Series Platform for Developers is Here

Data Lakes and Warehouses

InfluxDB for Industrial IoT:
A Live Demonstration

Time Series Databases Explained

Network Monitoring

Time Series Data Analysis: Definitions and Best Techniques in 2025

Product & Solutions

Developers

Company

Navigate to:

Try InfluxDB Cloud

Stop flying blind

Get Updates

Using Parquet’s Bloom Filters

By Trevor Hilton / Developer May 28, 2024

Navigate to:

Summary

Background & motivation

Use case for Bloom filters

Parquet’s Bloom filter implementation

Bloom filter sizes

Experiments

Setup

System Specifications

NDV impact on size: 1M cardinality

Results

Discussion/Key Observations

NDV impact on size: 100k cardinality

Results

Discussion/Key Observations

Query performance impact: 100k cardinality

Results

Discussion/Key Observations

Query performance impact: 1M cardinality

Results

Discussion/Key Observations

Conclusions

Ready to get started?

InfluxDB 3 Core & Enterprise GA: The Next Generation Time Series Platform for Developers is Here

Data Lakes and Warehouses

InfluxDB for Industrial IoT: A Live Demonstration

Time Series Databases Explained

Network Monitoring

Time Series Data Analysis: Definitions and Best Techniques in 2025

Product & Solutions

Developers

Company

Sign up for the InfluxData newsletter

Follow Us

By Trevor Hilton / Developer
May 28, 2024

InfluxDB for Industrial IoT:
A Live Demonstration