使用 dbio 對對雲端儲存體進行交易式寫入 Transactional writes to cloud storage with DBIO

Databricks DBIO 對封裝可針對 Apache Spark 工作,提供雲端儲存體的交易式寫入。The Databricks DBIO package provides transactional writes to cloud storage for Apache Spark jobs. 這可解決一些在雲端原生設定中使用 Spark 時所發生的效能和正確性問題 (例如,直接寫入儲存體服務) 。This solves a number of performance and correctness issues that occur when Spark is used in a cloud-native setting (for example, writing directly to storage services).

使用 DBIO 對交易認可時,中繼資料檔案會以 _started_<id>_committed_<id> 隨附于 Spark 作業所建立的資料檔案為開頭。With DBIO transactional commit, metadata files starting with _started_<id> and _committed_<id> accompany data files created by Spark jobs. 一般來說,您不應該直接改變這些檔案。Generally you shouldn’t alter these files directly. 相反地,您應該使用 VACUUM 命令來清除它們。Rather, you should use the VACUUM command to clean them up.

清除未認可 的檔案 Clean up uncommitted files

若要清除從 Spark 作業留下的未認可檔案,請使用 VACUUM 命令將其移除。To clean up uncommitted files left over from Spark jobs, use the VACUUM command to remove them. 通常會 VACUUM 在 Spark 作業完成後自動發生,但如果工作已中止,您也可以手動執行它。Normally VACUUM happens automatically after Spark jobs complete, but you can also run it manually if a job is aborted.

例如, VACUUM ... RETAIN 1 HOUR 移除超過一小時的未認可檔案。For example, VACUUM ... RETAIN 1 HOUR removes uncommitted files older than one hour.

重要

  • 避免使用不到一小時的清除。Avoid vacuuming with a horizon of less than one hour. 這可能會導致資料不一致。It can cause data inconsistency.

另請參閱「清理」。Also see Vacuum.

SQLSQL

-- recursively vacuum an output path
VACUUM '/path/to/output/directory' [RETAIN <N> HOURS]

-- vacuum all partitions of a catalog table
VACUUM tableName [RETAIN <N> HOURS]

ScalaScala

// recursively vacuum an output path
spark.sql("VACUUM '/path/to/output/directory' [RETAIN <N> HOURS]")

// vacuum all partitions of a catalog table
spark.sql("VACUUM tableName [RETAIN <N> HOURS]")