傾斜聯結優化 Skew Join optimization

資料扭曲是一種情況,其中資料表的資料會在叢集中的分割區之間分散。Data skew is a condition in which a table’s data is unevenly distributed among partitions in the cluster. 資料扭曲可能會嚴重地降級查詢的效能,特別是具有聯結的查詢效能。Data skew can severely downgrade performance of queries, especially those with joins. 大型資料表之間的聯結需要跳過資料,而扭曲可能會導致叢集中的工作極端不平衡。Joins between big tables require shuffling data and the skew can lead to an extreme imbalance of work in the cluster. 如果查詢似乎卡在完成極少的工作,可能會導致資料扭曲影響查詢 (例如,200) 的最後3項工作。It’s likely that data skew is affecting a query if a query appears to be stuck finishing very few tasks (for example, the last 3 tasks out of 200). 若要確認:To verify:

  1. 按一下停滯的階段,並確認它正在進行聯結。Click the stage that is stuck and verify that it is doing a join.
  2. 在查詢完成之後,尋找執行聯結的階段,並檢查工作持續時間分佈。After the query finishes, find the stage that does a join and check the task duration distribution.
  3. 藉由減少持續時間來排序工作,並檢查前幾項工作。Sort the tasks by decreasing duration and check the first few tasks. 如果某個工作花費的時間比其他工作花費更長的時間,則會有扭曲。If one task took much longer to complete than the other tasks, there is skew.

若要改善扭曲,Azure Databricks SQL 上的 Delta Lake 接受查詢中的 扭曲提示To ameliorate skew, Delta Lake on Azure Databricks SQL accepts skew hints in queries. 透過這些提示的資訊,Spark 可以建立更好的查詢計劃,而不會受到資料扭曲的影響。With the information from these hints, Spark can construct a better query plan, one that does not suffer from data skew.

僅限關聯名稱Only relation name

扭曲提示 必須至少包含具有扭曲之關聯性的名稱。A skew hint must contain at least the name of the relation with skew. 關聯性是資料表、視圖或子查詢。A relation is a table, view, or a subquery. 所有與此關聯的聯結都會使用扭曲聯結優化。All joins with this relation then use skew join optimization.

-- table with skew
SELECT /*+ SKEW('orders') */ * FROM orders, customers WHERE c_custId = o_custId

-- subquery with skew
SELECT /*+ SKEW('C1') */ *
  FROM (SELECT * FROM customers WHERE c_custId < 100) C1, orders
  WHERE C1.c_custId = o_custId

關聯和資料行Relation and columns

關聯上可能會有多個聯結,而且只有部分會受到扭曲的影響。There might be multiple joins on a relation and only some of them will suffer from skew. 扭曲聯結優化有一些額外負荷,因此最好只在需要時才使用它。Skew join optimization has some overhead so it is better to use it only when needed. 基於這個目的, 扭曲提示 會接受資料行名稱。For this purpose, the skew hint accepts column names. 只有具有這些資料行的聯結會使用扭曲聯結優化。Only joins with these columns use skew join optimization.

-- single column
SELECT /*+ SKEW('orders', 'o_custId') */ *
  FROM orders, customers
  WHERE o_custId = c_custId

-- multiple columns
SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId')) */ *
  FROM orders, customers
  WHERE o_custId = c_custId AND o_storeRegionId = c_regionId

關聯性、資料行和扭曲值Relation, columns, and skew values

您也可以在提示中指定扭曲值。You can also specify skew values in the hint. 根據查詢和資料的不同,誤差值可能是已知的 (例如,因為它們永遠不會變更) 或可能很容易找到。這樣做可減少扭曲聯結優化的額外負荷。Depending on the query and data, the skew values might be known (for example, because they never change) or might be easy to find out. Doing this reduces the overhead of skew join optimization. 否則,Delta Lake 會自動偵測它們。Otherwise, Delta Lake detects them automatically.

-- single column, single skew value
SELECT /*+ SKEW('orders', 'o_custId', 0) */ *
  FROM orders, customers
  WHERE o_custId = c_custId

-- single column, multiple skew values
SELECT /*+ SKEW('orders', 'o_custId', (0, 1, 2)) */ *
  FROM orders, customers
  WHERE o_custId = c_custId

-- multiple columns, multiple skew values
SELECT /*+ SKEW('orders', ('o_custId', 'o_storeRegionId'), ((0, 1001), (1, 1002))) */ *
  FROM orders, customers
  WHERE o_custId = c_custId AND o_storeRegionId = c_regionId