最佳作法Best practices

本文說明使用 Delta Lake 時的最佳做法。This article describes best practices when using Delta Lake.

提供資料位置提示Provide data location hints

如果您預期查詢述詞中通常會使用資料行,而且該資料行具有高基數 (也就是大量相異值) ,則使用 Z-ORDER BYIf you expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values), then use Z-ORDER BY. Delta Lake 會根據資料行值自動設定檔案中的資料,並在查詢時使用版面配置資訊略過不相關的資料。Delta Lake automatically lays out the data in the files based on the column values and use the layout information to skip irrelevant data while querying.

如需詳細資訊,請參閱 Z 順序 (多維度群集) For details, see Z-Ordering (multi-dimensional clustering).

選擇正確的分割資料行Choose the right partition column

您可以依資料行分割 Delta 資料表。You can partition a Delta table by a column. 最常使用的資料分割資料行是 dateThe most commonly used partition column is date. 遵循下列兩個經驗法則來決定要分割的資料行:Follow these two rules of thumb for deciding on what column to partition by:

  • 如果資料行的基數非常高,請勿使用該資料行進行資料分割。If the cardinality of a column will be very high, do not use that column for partitioning. 例如,如果您依資料行分割 userId ,而且可以有1百萬個不同的使用者識別碼,則這是不正確的資料分割策略。For example, if you partition by a column userId and if there can be 1M distinct user IDs, then that is a bad partitioning strategy.
  • 每個分割區中的資料量:如果預期該分割區中的資料至少為 1 GB,您可以依資料行分割。Amount of data in each partition: You can partition by a column if you expect data in that partition to be at least 1 GB.

壓縮檔案 Compact files

如果您持續將資料寫入 Delta 資料表,則會隨著時間累積大量的檔案,尤其是當您以小型批次新增資料時。If you continuously write data to a Delta table, it will over time accumulate a large number of files, especially if you add data in small batches. 這可能會對資料表讀取的效率造成負面影響,而且也會影響檔案系統的效能。This can have an adverse effect on the efficiency of table reads, and it can also affect the performance of your file system. 在理想的情況下,您應該定期將大量的小型檔案重寫成較小數量的檔案。Ideally, a large number of small files should be rewritten into a smaller number of larger files on a regular basis. 這稱為「壓縮」。This is known as compaction.

您可以使用 OPTIMIZE 命令壓縮資料表。You can compact a table using the OPTIMIZE command.

取代資料表的內容 或架構 Replace the content or schema of a table

有時您可能會想要取代 Delta 資料表。Sometimes you may want to replace a Delta table. 例如:For example:

  • 您發現資料表中的資料不正確,而且想要取代內容。You discover the data in the table is incorrect and want to replace the content.
  • 您想要重寫整個資料表,以進行不相容的架構變更 (卸載資料行或變更資料行類型) 。You want to rewrite the whole table to do incompatible schema changes (drop columns or change column types).

雖然您可以刪除 Delta 資料表的整個目錄,並在相同的路徑上建立新的資料表,但 不建議這樣做 ,因為:While you can delete the entire directory of a Delta table and create a new table on the same path, it’s not recommended because:

  • 刪除目錄的效率並不高。Deleting a directory is not efficient. 包含非常大型檔案的目錄可能需要數小時或甚至數天的時間才能刪除。A directory containing very large files can take hours or even days to delete.
  • 您會遺失已刪除檔案中的所有內容;如果您刪除錯誤的資料表,則很難復原。You lose all of content in the deleted files; it’s hard to recover if you delete the wrong table.
  • 刪除目錄不是不可部分完成的。The directory deletion is not atomic. 當您刪除資料表時,讀取資料表的並行查詢可能會失敗或看到部分資料表。While you are deleting the table a concurrent query reading the table can fail or see a partial table.

如果您不需要變更資料表架構,您可以 刪除 差異資料表中的資料,並插入新的資料,或 更新 資料表以修正不正確的值。If you don’t need to change the table schema, you can delete data from a Delta table and insert your new data, or update the table to fix the incorrect values.

如果您想要變更資料表架構,您可以以不可部分完成的方式取代整個資料表。If you want to change the table schema, you can replace the whole table atomically. 例如:For example:

PythonPython

dataframe.write \
  .format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", "true") \
  .partitionBy(<your-partition-columns>) \
  .saveAsTable("<your-table>") # Managed table
dataframe.write \
  .format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", "true") \
  .option("path", "<your-table-path>") \
  .partitionBy(<your-partition-columns>) \
  .saveAsTable("<your-table>") # External table

SQLSQL

REPLACE TABLE <your-table> USING DELTA PARTITIONED BY (<your-partition-columns>) AS SELECT ... -- Managed table
REPLACE TABLE <your-table> USING DELTA PARTITIONED BY (<your-partition-columns>) LOCATION "<your-table-path>" AS SELECT ... -- External table

ScalaScala

dataframe.write
  .format("delta")
  .mode("overwrite")
  .option("overwriteSchema", "true")
  .partitionBy(<your-partition-columns>)
  .saveAsTable("<your-table>") // Managed table
dataframe.write
  .format("delta")
  .mode("overwrite")
  .option("overwriteSchema", "true")
  .option("path", "<your-table-path>")
  .partitionBy(<your-partition-columns>)
  .saveAsTable("<your-table>") // External table

這種方法有多項優點:There are multiple benefits with this approach:

  • 覆寫資料表的速度更快,因為它不需要以遞迴方式列出目錄或刪除任何檔案。Overwriting a table is much faster because it doesn’t need to list the directory recursively or delete any files.
  • 舊版本的資料表仍然存在。The old version of the table still exists. 如果您刪除錯誤的資料表,則可以使用 時間移動輕鬆地取出舊的資料。If you delete the wrong table you can easily retrieve the old data using Time Travel.
  • 這是不可部分完成的作業。It’s an atomic operation. 當您刪除資料表時,並行查詢仍然可以讀取資料表。Concurrent queries can still read the table while you are deleting the table.
  • 由於 Delta Lake ACID 交易保證,如果覆寫資料表失敗,資料表將會處於先前的狀態。Because of Delta Lake ACID transaction guarantees, if overwriting the table fails, the table will be in its previous state.

此外,如果您想要在覆寫資料表之後,刪除舊檔案以節省儲存成本,您可以 使用 [清理] 將它們 刪除。In addition, if you want to delete old files to save storage cost after overwriting the table, you can use VACUUM to delete them. 它最適合用來刪除檔案,而且通常比刪除整個目錄更快。It’s optimized for file deletion and usually faster than deleting the entire directory.