最佳做法Best practices

本文介绍使用增量 Lake 时的最佳做法。This article describes best practices when using Delta Lake.

提供数据位置提示Provide data location hints

如果希望一列通常在查询谓词中使用,并且该列具有高基数 (也就是说,) 的非重复值,请使用 Z-ORDER BYIf you expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values), then use Z-ORDER BY. Delta Lake 根据列值自动在文件中布局数据,并在查询时使用布局信息跳过不相关的数据。Delta Lake automatically lays out the data in the files based on the column values and use the layout information to skip irrelevant data while querying.

有关详细信息,请参阅 Z 顺序 (多维群集) For details, see Z-Ordering (multi-dimensional clustering).

选择正确的分区列Choose the right partition column

可以按列对增量表进行分区。You can partition a Delta table by a column. 最常用的分区列是 dateThe most commonly used partition column is date. 请按照以下两个经验法则来确定要分区的列:Follow these two rules of thumb for deciding on what column to partition by:

  • 如果列的基数很高,则不要将该列用于分区。If the cardinality of a column will be very high, do not use that column for partitioning. 例如,如果按列进行分区, userId 并且可以有1m 个不同的用户 id,则这是一种错误的分区策略。For example, if you partition by a column userId and if there can be 1M distinct user IDs, then that is a bad partitioning strategy.
  • 每个分区中的数据量:如果希望该分区中的数据至少为 1 GB,可以按列进行分区。Amount of data in each partition: You can partition by a column if you expect data in that partition to be at least 1 GB.

压缩文件 Compact files

如果您连续将数据写入增量表,则它将累积大量文件,尤其是在较小的批次中添加数据时。If you continuously write data to a Delta table, it will over time accumulate a large number of files, especially if you add data in small batches. 这可能会对表读取的效率产生不利影响,并且还会影响文件系统的性能。This can have an adverse effect on the efficiency of table reads, and it can also affect the performance of your file system. 理想情况下,需要将大量小文件定期重写到较小数量的大型文件中。Ideally, a large number of small files should be rewritten into a smaller number of larger files on a regular basis. 这称为压缩。This is known as compaction.

您可以使用 OPTIMIZE 命令压缩表。You can compact a table using the OPTIMIZE command.

替换表的内容 或架构 Replace the content or schema of a table

有时,您可能希望替换增量表。Sometimes you may want to replace a Delta table. 例如:For example:

  • 发现表中的数据不正确,需要替换内容。You discover the data in the table is incorrect and want to replace the content.
  • 您希望重写整个表,以便 (删除列或更改列类型) 执行不兼容的架构更改。You want to rewrite the whole table to do incompatible schema changes (drop columns or change column types).

尽管可以删除增量表的整个目录并在同一路径上创建新表,但 不建议这样做 ,因为:While you can delete the entire directory of a Delta table and create a new table on the same path, it’s not recommended because:

  • 删除目录不是有效的。Deleting a directory is not efficient. 删除包含非常大文件的目录可能需要数小时甚至数天的时间。A directory containing very large files can take hours or even days to delete.
  • 删除的文件中的所有内容都将丢失;如果删除了错误的表,则很难恢复。You lose all of content in the deleted files; it’s hard to recover if you delete the wrong table.
  • 目录删除不是原子的。The directory deletion is not atomic. 删除表时,读取表的并发查询可能会失败或看不到部分表。While you are deleting the table a concurrent query reading the table can fail or see a partial table.

如果不需要更改表架构,可以从增量表中 删除 数据并插入新数据,或 更新 表以修复不正确的值。If you don’t need to change the table schema, you can delete data from a Delta table and insert your new data, or update the table to fix the incorrect values.

如果要更改表架构,可以以原子方式替换整个表。If you want to change the table schema, you can replace the whole table atomically. 例如:For example:

PythonPython

dataframe.write \
  .format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", "true") \
  .partitionBy(<your-partition-columns>) \
  .saveAsTable("<your-table>") # Managed table
dataframe.write \
  .format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", "true") \
  .option("path", "<your-table-path>") \
  .partitionBy(<your-partition-columns>) \
  .saveAsTable("<your-table>") # External table

SQLSQL

REPLACE TABLE <your-table> USING DELTA PARTITIONED BY (<your-partition-columns>) AS SELECT ... -- Managed table
REPLACE TABLE <your-table> USING DELTA PARTITIONED BY (<your-partition-columns>) LOCATION "<your-table-path>" AS SELECT ... -- External table

ScalaScala

dataframe.write
  .format("delta")
  .mode("overwrite")
  .option("overwriteSchema", "true")
  .partitionBy(<your-partition-columns>)
  .saveAsTable("<your-table>") // Managed table
dataframe.write
  .format("delta")
  .mode("overwrite")
  .option("overwriteSchema", "true")
  .option("path", "<your-table-path>")
  .partitionBy(<your-partition-columns>)
  .saveAsTable("<your-table>") // External table

此方法有多个优点:There are multiple benefits with this approach:

  • 覆盖表的速度要快得多,因为它不需要以递归方式列出目录或删除任何文件。Overwriting a table is much faster because it doesn’t need to list the directory recursively or delete any files.
  • 表的旧版本仍然存在。The old version of the table still exists. 如果删除了错误的表,则可以使用 时间段轻松地检索旧数据。If you delete the wrong table you can easily retrieve the old data using Time Travel.
  • 这是一个原子操作。It’s an atomic operation. 在删除表时,并发查询仍可以读取表。Concurrent queries can still read the table while you are deleting the table.
  • 由于增量 Lake ACID 事务保证,如果覆盖表失败,则该表将处于其以前的状态。Because of Delta Lake ACID transaction guarantees, if overwriting the table fails, the table will be in its previous state.

此外,如果想要在覆盖表后删除旧文件以节省存储成本,则可以使用 真空 来删除它们。In addition, if you want to delete old files to save storage cost after overwriting the table, you can use VACUUM to delete them. 它针对文件删除进行了优化,并且通常比删除整个目录快。It’s optimized for file deletion and usually faster than deleting the entire directory.