Parallel Hash Join

SQL Server uses one of two different strategies to parallelize a hash join.  The more common strategy uses hash partitioning.  In some cases, we use broadcast partitioning; this strategy is often called a “broadcast hash join.”

Hash Partitioning

The more common strategy for parallelizing a hash join involves distributing the build rows (i.e., the rows from the first input) and the probe rows (i.e., the rows from the second input) among the individual hash join threads using hash partitioning.  If a build and probe row share the same key value (i.e, they will join), they are guaranteed to hash to the same hash join thread.  After the data has been hash partitioned among the threads, the hash join instances all run completely independently on their respective data sets.  The absence of any inter-thread dependencies ensures that this strategy scales extremely well as we increase the degree of parallelism (i.e., the number of threads).

As with all of my parallelism examples, I am using a large table to induce the optimizer to choose a parallel plan.  If you try these examples, it may take a few minutes to create these tables.

create table T1 (a int, b int, x char(200))

set nocount on

declare @i int

set @i = 0

while @i < 1000000


    insert T1 values(@i, @i, @i)

   set @i = @i + 1


select * into T2 from T1

select * into T3 from T1

select * from T1 join T2 on T1.b = T2.a

  |--Parallelism(Gather Streams)
|--Hash Match(Inner Join, HASH:([T1].[b])=([T2].[a]), RESIDUAL:([T2].[a]=[T1].[b]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([T1].[b]))
| |--Table Scan(OBJECT:([T1]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([T2].[a]))
|--Table Scan(OBJECT:([T2]))

Note that unlike the parallel nested loops join, we have exchanges (i.e., parallelism operator) between both table scans (both the build and the probe inputs) and the hash join.  These exchanges hash partition the data among the hash join threads.

Broadcast Partitioning

Consider what will happen if we try to parallelize a hash join using hash partitioning, but we have only a small number of rows on the build side of the hash join.  If we have fewer rows than hash join threads, some threads might receive no rows at all.  In this case, those threads would have no work to do during the probe phase of the join and would remain idle.  Even if we have more rows than threads, due to the presence of duplicate key values and/or skew in the hash function, some threads might get many more rows than others.

To eliminate the risk of skew, when the optimizer estimates that the number of build rows is relatively small, it may choose to broadcast these rows to all of the hash join threads.  Since all build rows are broadcast to all hash join threads, in a broadcast hash join, it does not matter where we send the probe rows.  Each probe row can be sent to any thread and, if it can join with any build rows, it will.

Here is an example:

select * from T1 join T2 on T1.b = T2.a where T1.a = 0

  |--Parallelism(Gather Streams)
|--Hash Match(Inner Join, HASH:([T1].[b])=([T2].[a]), RESIDUAL:([T2].[a]=[T1].[b]))
|--Parallelism(Distribute Streams, Broadcast Partitioning)
| |--Table Scan(OBJECT:([T1]), WHERE:([T1].[a]=(0)))
|--Table Scan(OBJECT:([T2]))

Note that the exchange above the scan of T1 is now a broadcast exchange while we have completely eliminated the exchange above the scan of T2.  We do not need an exchange above T2 because the parallel scan automatically distributes the pages and rows of T2 among the hash join threads.  This is similar to how the parallel scan distributed rows among nested loops join threads for the parallel nested loops join.  Similar to the parallel nested loops join, if we have a serial zone on the probe input of a broadcast hash join (e.g., due to a top operator), we may need a round robin exchange to redistribute the rows.

So, if broadcast hash joins are so great (they do reduce the risk of skew problems), why don’t we use them in all cases?  The answer is that broadcast hash joins use more memory than their hash partitioned counterparts.  Since we send every build row to every hash join thread, if we double the number of threads, we double the amount of memory that we need.  With a hash partitioned parallel hash join, we need the same amount of memory regardless of the degree of parallelism.

Bitmap Filtering

select * from T1 join T2 on T1.b = T2.a where T1.a < 100000

Finally, suppose we have a fairly selective filter on the build input to a hash join:

  |--Parallelism(Gather Streams)
|--Hash Match(Inner Join, HASH:([T1].[b])=([T2].[a]), RESIDUAL:([T2].[a]=[T1].[b]))
|--Bitmap(HASH:([T1].[b]), DEFINE:([Bitmap1008]))
| |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([T1].[b]))
| |--Table Scan(OBJECT:([T1]), WHERE:([T1].[a]<(100000)))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([T2].[a]), WHERE:(PROBE([Bitmap1008])=TRUE))
|--Table Scan(OBJECT:([T2]))

What’s the bitmap operator?  The predicate “T1.a < 100000” eliminates 90% of the build rows from T1.  It also indirectly eliminates 90% of the rows from T2 because they no longer join with rows from T1.  The bitmap operator provides an efficient way to apply the T1 filter directly to T2 without passing the rows all the way through the exchange to the join.  As its name suggests, it builds a bitmap.  Just like a hash join, we hash each row of T1 on the join key T1.b and set the corresponding bit in the bitmap.  Once the scan of T1 and the hash join build is complete, we transfer the bitmap to the exchange above the scan of T2 where we use it as a filter.  This time we hash each row of T2 on the join key T2.a and test the corresponding bit in the bitmap.  If the bit is set, the row may join and we pass it along to the hash join.  If the bit is not set, the row cannot join and we discard it.  For more information on bitmaps see this post from the SQL Server Query Processing Team blog.