建立 Bloom 篩選索引 (Azure Databricks 上的 Delta Lake)Create Bloom Filter Index (Delta Lake on Azure Databricks)

CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS(columnName1 [OPTIONS(..)], columnName2, ...)]
[OPTIONS(..)]

建立新的或重寫資料的 Bloom 篩選索引;它不 建立現有資料的 Bloom 篩選。Create a Bloom filter index for new or rewritten data; it does not create Bloom filters for existing data. 如果資料表名稱或其中一個資料行不存在,則命令會失敗。The command fails if either the table name or one of the columns does not exist. 如果已針對資料行啟用 Bloom 篩選,則新的選項會取代現有的 Bloom 篩選選項。If Bloom filtering is enabled for a column, existing Bloom filter options are replaced by the new options.

雖然無法為已經寫入的資料建立 Bloom 篩選索引,但 OPTIMIZE 命令會針對重新組織的資料更新 Bloom 篩選。While it is not possible to build a Bloom filter index for data that is already written, the OPTIMIZE command updates Bloom filters for data that is reorganized. 因此,您可以藉由 OPTIMIZE 在資料表上執行來回填 Bloom 篩選:Therefore, you can backfill a Bloom filter by running OPTIMIZE on a table:

  • 如果您先前尚未優化資料表。If you have not previously optimized the table.
  • 使用不同的檔案大小,需要重新寫入資料檔案。With a different file size, requiring that the data files be re-written.
  • 如果有 ZORDER (或不同(如果有的話) ZORDER) ,則需要重新寫入資料檔案。With a ZORDER (or a different ZORDER, if one is already present), requiring that the data files be re-written.

您可以藉由在資料行層級或資料表層級定義選項,來調整 Bloom 篩選:You can tune the Bloom filter by defining options at the column level or at the table level:

  • fpp:誤報的機率。fpp: False positive probability. 每個所寫入 Bloom 篩選準則的預期錯誤正面率。The desired false positive rate per written Bloom filter. 這會影響在 Bloom 篩選中放置單一專案所需的位數,並影響 Bloom 篩選的大小。This influences the number of bits needed to put a single item in the Bloom filter and influences the size of the Bloom filter. 值必須大於0且小於或等於1。The value must be larger than 0 and smaller than or equal to 1. 預設值為0.1,每個專案需要5個位。The default value is 0.1 which requires 5 bits per item.
  • numItems:檔案可包含的相異專案數目。numItems: Number of distinct items the file can contain. 這項設定對於篩選品質很重要,因為它會影響 Bloom 篩選中使用的總位數 * (專案數 * 每個專案的位數) 。This setting is important for the quality of filtering as it influences the total number of bits used in the Bloom filter (number of items * number of bits per item). 如果這項設定不正確,則 Bloom 篩選器可能會非常稀疏地擴展、浪費磁碟空間,以及減緩必須下載此檔案的查詢,或者它太滿,且 (更高的 FPP) 不正確。If this setting is incorrect, the Bloom filter is either very sparsely populated, wasting disk space and slowing queries that must download this file, or it is too full and is less accurate (higher FPP). 值必須大於0。The value must be larger than 0. 預設值為1000000個專案。The default is 1 million items.
  • maxExpectedFpp:預期的 FPP 臨界值,Bloom 篩選不會寫入磁片。maxExpectedFpp: The expected FPP threshold for which a Bloom filter is not written to disk. 寫入 Bloom 篩選時預期的最大預期錯誤正面機率。The maximum expected false positive probability at which a Bloom filter is written. 如果預期的 FPP 大於此閾值,則 Bloom 篩選準則的選擇性太低;使用 Bloom 濾波器所花費的時間和資源超過其實用性。If the expected FPP is larger than this threshold, the Bloom filter’s selectivity is too low; the time and resources it takes to use the Bloom filter outweighs its usefulness. 值長度必須介於 0 到 1 之間。The value must be between 0 and 1. 預設值為 1.0 (停用) 。The default is 1.0 (disabled).

只有在寫入資料時,這些選項才會扮演角色。These options play a role only when writing the data. 您可以在各種階層層級設定這些屬性:寫入作業、資料表層級和資料行層級。You can configure these properties at various hierarchical levels: write operation, table level, and column level. 資料行層級的優先順序高於資料表和作業層級,而資料表層級則優先于作業層級。The column level takes precedence over the table and operation levels, and the table level takes precedence over the operation level.

請參閱 Bloom 篩選索引See Bloom filter indexes.