How to Improve Performance with Bucketing
Bucketing is an optimization technique in Apache Spark SQL. Data is allocated among a specified number of buckets, according to values derived from one or more bucketing columns. Bucketing improves performance by shuffling and sorting data prior to downstream operations such as table joins. The tradeoff is the initial overhead due to shuffling and sorting, but for certain data transformations, this technique can improve performance by avoiding later shuffling and sorting.
This technique is useful for dimension tables, which are frequently used tables containing primary keys. It is also useful when there are frequent join operations involving large and small tables.
The example notebook below shows the differences in physical plans when performing joins of bucketed and unbucketed tables.
Bucketing example notebook