Spark Bucketing Demystified

When working with massive data, shuffling is one of the costliest operations, and engineers make every effort to reduce it as much as possible. This is achieved through two data-skipping pillars: partitioning and clustering.

Clustering is preferred for high-cardinality field searches, and lakehouse architectures like Delta Lake, Iceberg, and Hudi provide different options such as Z-order, liquid clustering, and ingestion-time clustering. However, in some situations, there is a need to keep the data outside of the lakehouse, such as in simple Parquet tables (probably for downstream streaming).

Bucketing is one of the clustering techniques that uses hashed columns so that, when the table is joined next time, shuffling can be avoided.

Spark bucketing applications

Mostly used for joins. Pre-shuffle tables for future joins
Number of buckets should be between 0 and 100000
The number of partitions on both sides of a join has to be exactly the same
Bucketing is not allowed at delta tables.

How it works

Spark computes a hash (Murmur3Hash algorithm) of the bucket column value. Rows with the same bucket id are stored together in one file. During joins, Spark can directly map matching buckets, avoiding full data shuffle.

Spark knows whether the table is bucketed or not, by using metastore details.

Formula used in spark to derive bucket id.

bucket_id = (hash_function(column_value) & Integer.MAX_VALUE) % num_buckets

Syntax

Experiment

Below Code: spark_bucketing_code

Switched off broadcast and AQE to understand bucketing, otherwise spark will use the smartness and do broadcast join or AQE shuffle read.

Dataframe created using create & written with bucketing.

Performing join on the dataframe which read from tables.

Below is the result of with and without bucketing. It is clear that Exchange (shuffling) can be skipped if rightly bucketed.

Conclusion

Bucketing in Spark minimizes costly shuffles by pre-hashing data into fixed buckets, enabling faster joins. It’s ideal for static Parquet datasets where performance and shuffle avoidance matter most.

KaniniPro

Spark Bucketing Demystified

Leave a comment Cancel reply

Let’s connect

Recent posts

Databricks Serverless Compute

Processing ADLS delta-table using DuckDB

DeltaLake change tracking with CDF & Row Tracking

Introducing Lakeflow Spark Declarative Pipelines

SQL Queries that make the code simple

Databricks data quality with declarative pipeline

Share this:

Leave a comment Cancel reply

Let’s connect

Recent posts