倾斜联接优化 Skew Join optimization

数据歪斜是指表的数据在群集中的分区之间不均匀分布的情况。Data skew is a condition in which a table’s data is unevenly distributed among partitions in the cluster. 数据偏斜会严重降低查询性能,尤其是那些具有联接的查询。Data skew can severely downgrade performance of queries, especially those with joins. 大表之间的联接要求混排数据,而倾斜可能导致群集中的工作极其不平衡。Joins between big tables require shuffling data and the skew can lead to an extreme imbalance of work in the cluster. 如果查询看起来停滞完成非常少的任务(例如,200的最后3个任务),则数据歪斜可能会影响查询。It’s likely that data skew is affecting a query if a query appears to be stuck finishing very few tasks (for example, the last 3 tasks out of 200). 要验证:To verify:

  1. 单击停滞的阶段,并验证其是否正在执行联接。Click the stage that is stuck and verify that it is doing a join.
  2. 查询完成后,查找执行联接的阶段,并检查任务持续时间分布。After the query finishes, find the stage that does a join and check the task duration distribution.
  3. 通过缩短工期对任务进行排序,并检查前几个任务。Sort the tasks by decreasing duration and check the first few tasks. 如果一项任务比其他任务花费的时间要长得多,则会发生扭曲。If one task took much longer to complete than the other tasks, there is skew.

若要寻求解决歪斜,Azure Databricks SQL 上的增量 Lake 在查询中接受_倾斜提示_。To ameliorate skew, Delta Lake on Azure Databricks SQL accepts skew hints in queries. 使用这些提示中的信息,Spark 可以构造更好的查询计划,而不会受到数据歪斜的影响。With the information from these hints, Spark can construct a better query plan, one that does not suffer from data skew.

仅关系名称Only relation name

_倾斜提示_必须至少包含具有倾斜的关系的名称。A skew hint must contain at least the name of the relation with skew. 关系是表、视图或子查询。A relation is a table, view, or a subquery. 此关系的所有联接都使用倾斜联接优化。All joins with this relation then use skew join optimization.

-- table with skew
SELECT /*+ SKEW('orders') */ * FROM orders, customers WHERE c_custId = o_custId

-- subquery with skew
SELECT /*+ SKEW('C1') */ *
  FROM (SELECT * FROM customers WHERE c_custId < 100) C1, orders
  WHERE C1.c_custId = o_custId

关系和列Relation and columns

关系上可能有多个联接,并且只有其中一些联接受到了倾斜。There might be multiple joins on a relation and only some of them will suffer from skew. 倾斜联接优化具有一些开销,因此最好仅在需要时使用。Skew join optimization has some overhead so it is better to use it only when needed. 出于此目的,_倾斜提示_接受列名称。For this purpose, the skew hint accepts column names. 仅具有这些列的联接使用倾斜联接优化。Only joins with these columns use skew join optimization.

-- single column
SELECT /*+ SKEW('orders', 'o_custId') */ *
  FROM orders, customers
  WHERE o_custId = c_custId

-- multiple columns
SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId')) */ *
  FROM orders, customers
  WHERE o_custId = c_custId AND o_storeRegionId = c_regionId

关系、列和倾斜值Relation, columns, and skew values

您还可以在提示中指定倾斜值。You can also specify skew values in the hint. 根据查询和数据的不同,可能会知道倾斜值(例如,因为这些值永远不会变化)或可能很容易找到。这样做可以减少斜切联接优化的开销。Depending on the query and data, the skew values might be known (for example, because they never change) or might be easy to find out. Doing this reduces the overhead of skew join optimization. 否则,Delta Lake 会自动检测它们。Otherwise, Delta Lake detects them automatically.

-- single column, single skew value
SELECT /*+ SKEW('orders', 'o_custId', 0) */ *
  FROM orders, customers
  WHERE o_custId = c_custId

-- single column, multiple skew values
SELECT /*+ SKEW('orders', 'o_custId', (0, 1, 2)) */ *
  FROM orders, customers
  WHERE o_custId = c_custId

-- multiple columns, multiple skew values
SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId'), ((0, 1001), (1, 1002))) */ *
  FROM orders, customers
  WHERE o_custId = c_custId AND o_storeRegionId = c_regionId