在 Azure Databricks) 上创建布隆筛选器索引 (增量 LakeCreate Bloom Filter Index (Delta Lake on Azure Databricks)

CREATE BLOOMFILTER INDEX
ON [TABLE] table_name
[FOR COLUMNS(columnName1 [OPTIONS(..)], columnName2, ...)]
[OPTIONS(..)]

为新数据或重写的数据创建布隆筛选器索引;它不 为现有数据创建布隆筛选器。Create a Bloom filter index for new or rewritten data; it does not create Bloom filters for existing data. 如果表名或列之一不存在,则该命令将失败。The command fails if either the table name or one of the columns does not exist. 如果对某一列启用了布隆筛选,则现有的布隆筛选器选项将替换为新的选项。If Bloom filtering is enabled for a column, existing Bloom filter options are replaced by the new options.

尽管无法为已写入的数据生成布隆筛选器索引,但 OPTIMIZE 命令会为重新组织的数据更新布隆筛选器。While it is not possible to build a Bloom filter index for data that is already written, the OPTIMIZE command updates Bloom filters for data that is reorganized. 因此,您可以通过对表运行来回填布隆筛选器 OPTIMIZETherefore, you can backfill a Bloom filter by running OPTIMIZE on a table:

  • 如果之前未对表进行优化,则为。If you have not previously optimized the table.
  • 具有不同的文件大小,需要重新写入数据文件。With a different file size, requiring that the data files be re-written.
  • ZORDER如果已存在 (或其他 ZORDER) ,则需要重新写入数据文件。With a ZORDER (or a different ZORDER, if one is already present), requiring that the data files be re-written.

可以通过定义列级或表级的选项来优化布隆筛选器:You can tune the Bloom filter by defining options at the column level or at the table level:

  • fpp:误报概率。fpp: False positive probability. 每个书面布隆筛选器所需的假正利率。The desired false positive rate per written Bloom filter. 这会影响在布隆筛选器中放置单个项所需的位数,并影响布隆筛选器的大小。This influences the number of bits needed to put a single item in the Bloom filter and influences the size of the Bloom filter. 该值必须大于0且小于或等于1。The value must be larger than 0 and smaller than or equal to 1. 默认值为0.1,这需要每个项5位。The default value is 0.1 which requires 5 bits per item.
  • numItems:文件可包含的非重复项的数目。numItems: Number of distinct items the file can contain. 此设置对筛选质量非常重要,因为它会影响布隆筛选器中使用的总位数 (项数 * 每个项) 的位数。This setting is important for the quality of filtering as it influences the total number of bits used in the Bloom filter (number of items * number of bits per item). 如果此设置不正确,则布隆筛选器将非常稀疏地填充,从而浪费磁盘空间并降低必须下载此文件的查询速度,或者该筛选器太满,且) (更高。If this setting is incorrect, the Bloom filter is either very sparsely populated, wasting disk space and slowing queries that must download this file, or it is too full and is less accurate (higher FPP). 该值必须大于0。The value must be larger than 0. 默认值为1000000项。The default is 1 million items.
  • maxExpectedFpp:所需的 FPP 阈值,布隆筛选器不会写入磁盘。maxExpectedFpp: The expected FPP threshold for which a Bloom filter is not written to disk. 写入布隆筛选器的最大预期误报概率。The maximum expected false positive probability at which a Bloom filter is written. 如果预期 FPP 大于此阈值,则布隆筛选器的选择性太低;使用布隆筛选器所用的时间和资源超出了其有用性。If the expected FPP is larger than this threshold, the Bloom filter’s selectivity is too low; the time and resources it takes to use the Bloom filter outweighs its usefulness. 该值必须介于 0 和 1 之间。The value must be between 0 and 1. 默认值为 1.0 (禁用) 。The default is 1.0 (disabled).

这些选项仅在写入数据时才起作用。These options play a role only when writing the data. 可以在不同的层次结构级别配置这些属性:写入操作、表级和列级。You can configure these properties at various hierarchical levels: write operation, table level, and column level. 列级别优先于表和操作级别,表级别优先于操作级别。The column level takes precedence over the table and operation levels, and the table level takes precedence over the operation level.

请参阅 布隆筛选器索引See Bloom filter indexes.