您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

如何通过 Azure 认知搜索索引 Azure 表存储中的表How to index tables from Azure Table storage with Azure Cognitive Search

本文介绍如何使用 Azure 认知搜索对 Azure 表存储中存储的数据编制索引。This article shows how to use Azure Cognitive Search to index data stored in Azure Table storage.

设置 Azure 表存储索引Set up Azure Table storage indexing

可以使用以下资源设置 Azure 表存储索引器:You can set up an Azure Table storage indexer by using these resources:

在这里,我们使用 REST API 演示流。Here we demonstrate the flow by using the REST API.

步骤 1:创建数据源Step 1: Create a datasource

数据源指定要编制索引的数据、访问数据所需的凭据,以及允许 Azure 认知搜索有效标识数据更改的策略。A datasource specifies which data to index, the credentials needed to access the data, and the policies that enable Azure Cognitive Search to efficiently identify changes in the data.

若要为表编制索引,数据源必须具有以下属性:For table indexing, the datasource must have the following properties:

  • name 是搜索服务中数据源的唯一名称。name is the unique name of the datasource within your search service.
  • type 必须是 azuretabletype must be azuretable.
  • credentials 参数包含存储帐户连接字符串。credentials parameter contains the storage account connection string. 有关详细信息,请参阅指定凭据部分。See the Specify credentials section for details.
  • container 设置表名称和可选查询。container sets the table name and an optional query.
    • 使用 name 参数指定表名称。Specify the table name by using the name parameter.
    • (可选)使用 query 参数指定查询。Optionally, specify a query by using the query parameter.

重要

为使性能更佳,请尽可能对 PartitionKey 使用筛选器。Whenever possible, use a filter on PartitionKey for better performance. 任何其他查询会执行全表扫描,导致大型表性能不佳。Any other query does a full table scan, resulting in poor performance for large tables. 请参阅性能注意事项部分。See the Performance considerations section.

若要创建数据源,请执行以下操作:To create a datasource:

POST https://[service name].search.windows.net/datasources?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
    "name" : "table-datasource",
    "type" : "azuretable",
    "credentials" : { "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>;" },
    "container" : { "name" : "my-table", "query" : "PartitionKey eq '123'" }
}   

有关创建数据源 API 的详细信息,请参阅创建数据源For more information on the Create Datasource API, see Create Datasource.

指定凭据的方式Ways to specify credentials

可通过以下一种方式提供表的凭据:You can provide the credentials for the table in one of these ways:

  • 完全访问存储帐户连接字符串DefaultEndpointsProtocol=https;AccountName=<your storage account>;AccountKey=<your account key>可通过导航到“存储帐户”边栏选项卡 “设置” “密钥”(对于经典存储帐户)或“设置” > “访问密钥”(对于 Azure 资源管理器存储帐户),从 Azure 门户获取连接字符串。 > > Full access storage account connection string: DefaultEndpointsProtocol=https;AccountName=<your storage account>;AccountKey=<your account key> You can get the connection string from the Azure portal by going to the Storage account blade > Settings > Keys (for classic storage accounts) or Settings > Access keys (for Azure Resource Manager storage accounts).
  • 存储帐户共享访问签名连接字符串: TableEndpoint=https://<your account>.table.core.windows.net/;SharedAccessSignature=?sv=2016-05-31&sig=<the signature>&spr=https&se=<the validity end time>&srt=co&ss=t&sp=rl共享访问签名应具有容器(本例中为表)和对象(表行)的列出和读取权限。Storage account shared access signature connection string: TableEndpoint=https://<your account>.table.core.windows.net/;SharedAccessSignature=?sv=2016-05-31&sig=<the signature>&spr=https&se=<the validity end time>&srt=co&ss=t&sp=rl The shared access signature should have the list and read permissions on containers (tables in this case) and objects (table rows).
  • 表共享访问签名ContainerSharedAccessUri=https://<your storage account>.table.core.windows.net/<table name>?tn=<table name>&sv=2016-05-31&sig=<the signature>&se=<the validity end time>&sp=r共享访问签名应具有表的查询(读取)权限。Table shared access signature: ContainerSharedAccessUri=https://<your storage account>.table.core.windows.net/<table name>?tn=<table name>&sv=2016-05-31&sig=<the signature>&se=<the validity end time>&sp=r The shared access signature should have query (read) permissions on the table.

有关存储共享访问签名的详细信息,请参阅使用共享访问签名For more information on storage shared access signatures, see Using shared access signatures.

备注

如果使用共享访问签名凭据,则需使用续订的签名定期更新数据源凭据,以防止其过期。If you use shared access signature credentials, you will need to update the datasource credentials periodically with renewed signatures to prevent their expiration. 如果共享访问签名凭据过期,索引器会失败并出现类似于“连接字符串中提供的凭据无效或已过期”的错误消息。If shared access signature credentials expire, the indexer fails with an error message similar to "Credentials provided in the connection string are invalid or have expired."

步骤 2:创建索引Step 2: Create an index

索引指定文档、属性和其他构造中可以塑造搜索体验的字段。The index specifies the fields in a document, the attributes, and other constructs that shape the search experience.

若要创建索引,请执行以下操作:To create an index:

POST https://[service name].search.windows.net/indexes?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
      "name" : "my-target-index",
      "fields": [
        { "name": "key", "type": "Edm.String", "key": true, "searchable": false },
        { "name": "SomeColumnInMyTable", "type": "Edm.String", "searchable": true }
      ]
}

有关创建索引的详细信息,请参阅创建索引For more information on creating indexes, see Create Index.

步骤 3:创建索引器Step 3: Create an indexer

索引器将数据源与目标搜索索引关联,并提供自动执行数据刷新的计划。An indexer connects a datasource with a target search index and provides a schedule to automate the data refresh.

创建索引和数据源后,可以创建索引器:After the index and datasource are created, you're ready to create the indexer:

POST https://[service name].search.windows.net/indexers?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
  "name" : "table-indexer",
  "dataSourceName" : "table-datasource",
  "targetIndexName" : "my-target-index",
  "schedule" : { "interval" : "PT2H" }
}

此索引器每两小时运行一次。This indexer runs every two hours. ("计划间隔" 设置为 "PT2H"。)若要每隔30分钟运行一次索引器,请将间隔设置为 "PT30M"。(The schedule interval is set to "PT2H".) To run an indexer every 30 minutes, set the interval to "PT30M". 支持的最短间隔为 5 分钟。The shortest supported interval is five minutes. 计划是可选的;如果省略,则索引器在创建后只运行一次。The schedule is optional; if omitted, an indexer runs only once when it's created. 但是,可以随时根据需要运行索引器。However, you can run an indexer on demand at any time.

有关创建索引器 API 的详细信息,请参阅创建索引器For more information on the Create Indexer API, see Create Indexer.

有关定义索引器计划的详细信息,请参阅如何为 Azure 认知搜索计划索引器For more information about defining indexer schedules see How to schedule indexers for Azure Cognitive Search.

处理不同的字段名称Deal with different field names

有时,现有索引中的字段名称会不同于表中的属性名称。Sometimes, the field names in your existing index are different from the property names in your table. 可以使用字段映射将表中的属性名称映射到搜索索引中的字段名称。You can use field mappings to map the property names from the table to the field names in your search index. 若要详细了解字段映射,请参阅Azure 认知搜索索引器字段映射桥接数据源和搜索索引之间的差异To learn more about field mappings, see Azure Cognitive Search indexer field mappings bridge the differences between datasources and search indexes.

处理文档键Handle document keys

在 Azure 认知搜索中,文档键唯一标识一个文档。In Azure Cognitive Search, the document key uniquely identifies a document. 每个搜索索引必须只有一个类型为 Edm.String 的键字段。Every search index must have exactly one key field of type Edm.String. 键字段对于要添加到索引的每个文档必不可少。The key field is required for each document that is being added to the index. (事实上,它是唯一的必填字段。)(In fact, it's the only required field.)

由于表行具有复合键,因此 Azure 认知搜索会生成一个名为 Key 的合成字段,该字段是分区键和行键值的串联。Because table rows have a compound key, Azure Cognitive Search generates a synthetic field called Key that is a concatenation of partition key and row key values. 例如,如果行的 PartitionKey 为 PK1、RowKey 为 RK1,那么 Key 字段的值为 PK1RK1For example, if a row’s PartitionKey is PK1 and RowKey is RK1, then the Key field's value is PK1RK1.

备注

Key 值中可能含有文档键中无效的字符(如短划线)。The Key value may contain characters that are invalid in document keys, such as dashes. 可以通过使用 base64Encode 字段映射函数处理无效字符。You can deal with invalid characters by using the base64Encode field mapping function. 如果执行此操作,在 API 调用(如 Lookup)中传递文档键时,还请记得使用 URL-safe Base64 编码。If you do this, remember to also use URL-safe Base64 encoding when passing document keys in API calls such as Lookup.

增量索引和删除检测Incremental indexing and deletion detection

当将表索引器设置为按计划运行时,它仅对由行的 Timestamp 值确定的新行或更新行重新编制索引。When you set up a table indexer to run on a schedule, it reindexes only new or updated rows, as determined by a row’s Timestamp value. 无需指定更改检测策略。You don’t have to specify a change detection policy. 系统会自动启用增量索引。Incremental indexing is enabled for you automatically.

若要指示必须从索引中删除某些文档,可使用软删除策略。To indicate that certain documents must be removed from the index, you can use a soft delete strategy. 不删除行,而是添加一个属性来指示删除行,并对数据源设置软删除检测策略。Instead of deleting a row, add a property to indicate that it's deleted, and set up a soft deletion detection policy on the datasource. 例如,如果某行具有值为 IsDeleted 的属性 "true",以下策略会将该行视为已删除:For example, the following policy considers that a row is deleted if the row has a property IsDeleted with the value "true":

PUT https://[service name].search.windows.net/datasources?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
    "name" : "my-table-datasource",
    "type" : "azuretable",
    "credentials" : { "connectionString" : "<your storage connection string>" },
    "container" : { "name" : "table name", "query" : "<query>" },
    "dataDeletionDetectionPolicy" : { "@odata.type" : "#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy", "softDeleteColumnName" : "IsDeleted", "softDeleteMarkerValue" : "true" }
}   

性能注意事项Performance considerations

默认情况下,Azure 认知搜索使用以下查询筛选器: "Timestamp >= HighWaterMarkValue"。By default, Azure Cognitive Search uses the following query filter: Timestamp >= HighWaterMarkValue. 由于 Azure 表在 Timestamp 字段上没有辅助索引,因此该类型的查询需要执行全表扫描,导致大型表查询速度慢。Because Azure tables don’t have a secondary index on the Timestamp field, this type of query requires a full table scan and is therefore slow for large tables.

下面是两种可能提高表索引性能的方法。Here are two possible approaches for improving table indexing performance. 这两种方法都依赖于使用表分区:Both of these approaches rely on using table partitions:

  • 如果可自然地将数据分区到多个分区范围中,可为每个分区范围创建数据源和相应的索引器。If your data can naturally be partitioned into several partition ranges, create a datasource and a corresponding indexer for each partition range. 现在每个索引器仅能处理一个特定分区范围,使得查询性能更佳。Each indexer now has to process only a specific partition range, resulting in better query performance. 如果需编制索引的数据具有较少的固定分区,查询性能会更好:每个索引器仅执行一次分区扫描。If the data that needs to be indexed has a small number of fixed partitions, even better: each indexer only does a partition scan. 例如,若要创建一个数据源用来处理含有键 000100 的分区范围,请使用以下查询:For example, to create a datasource for processing a partition range with keys from 000 to 100, use a query like this:

    "container" : { "name" : "my-table", "query" : "PartitionKey ge '000' and PartitionKey lt '100' " }
    
  • 如果数据按时间分区(例如,每天或每周创建一个新分区),请考虑以下方法:If your data is partitioned by time (for example, you create a new partition every day or week), consider the following approach:

    • 使用此格式的查询:(PartitionKey ge <TimeStamp>) and (other filters)Use a query of the form: (PartitionKey ge <TimeStamp>) and (other filters).
    • 使用获取索引器状态 API监视器索引器进度,并基于最新的成功的高使用标记值定期更新查询的 <TimeStamp> 条件。Monitor indexer progress by using Get Indexer Status API, and periodically update the <TimeStamp> condition of the query based on the latest successful high-water-mark value.
    • 借助此方法,如果需要触发完整的索引重编制,除了重置索引器外还需要重置数据源查询。With this approach, if you need to trigger a complete reindexing, you need to reset the datasource query in addition to resetting the indexer.

帮助我们提高 Azure 认知搜索Help us make Azure Cognitive Search better

如果有功能请求或改进建议,请在我们的 UserVoice 站点上提交。If you have feature requests or ideas for improvements, submit them on our UserVoice site.