您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

使用 Azure 认知搜索索引器进行字段映射和转换Field mappings and transformations using Azure Cognitive Search indexers

使用 Azure 认知搜索索引器时,有时会发现输入数据与目标索引的架构不完全匹配。When using Azure Cognitive Search indexers, you sometimes find that the input data doesn't quite match the schema of your target index. 在这种情况下,可以在索引编制过程中使用字段映射来调整数据的形状。In those cases, you can use field mappings to reshape your data during the indexing process.

在某些情况下,字段映射会很有用:Some situations where field mappings are useful:

  • 数据源具有一个名为 "_id" 的字段,但 Azure 认知搜索不允许以下划线开头的字段名称。Your data source has a field named _id, but Azure Cognitive Search doesn't allow field names that start with an underscore. 使用字段映射可以有效地为字段重命名。A field mapping lets you effectively rename a field.
  • 你希望使用同一数据源数据填充索引中的多个字段。You want to populate several fields in the index from the same data source data. 例如,你可能想要将不同的分析器应用到这些字段。For example, you might want to apply different analyzers to those fields.
  • 你希望使用多个数据源中的数据填充索引字段,而每个数据源使用不同的字段名称。You want to populate an index field with data from more than one data source, and the data sources each use different field names.
  • 需要对数据进行 Base64 编码或解码。You need to Base64 encode or decode your data. 字段映射支持多个映射函数,包括用于 Base64 编码和解码的函数。Field mappings support several mapping functions, including functions for Base64 encoding and decoding.

备注

Azure 认知搜索索引器的字段映射功能提供了一种简单的方法来将数据字段映射到索引字段,并提供用于数据转换的几个选项。The field mapping feature of Azure Cognitive Search indexers provides a simple way to map data fields to index fields, with a few options for data conversion. 较复杂的数据可能需要经过预处理,才能将形状调整为易于编制索引的形式。More complex data might require pre-processing to reshape it into a form that's easy to index.

Microsoft Azure 数据工厂是一个功能强大的基于云的解决方案,可用于导入和转换数据。Microsoft Azure Data Factory is a powerful cloud-based solution for importing and transforming data. 你也可以在编制索引之前编写代码来转换源数据。You can also write code to transform source data before indexing. 有关代码示例,请参阅为关系数据建模为多级分面建模For code examples, see Model relational data and Model multilevel facets.

设置字段映射Set up field mappings

字段映射由 3 部分组成:A field mapping consists of three parts:

  1. sourceFieldName,它表示数据源中的字段。A sourceFieldName, which represents a field in your data source. 此属性是必需项。This property is required.
  2. 可选的 targetFieldName,它表示搜索索引中的字段。An optional targetFieldName, which represents a field in your search index. 如果已省略,则使用数据源中相同的名称。If omitted, the same name as in the data source is used.
  3. 可选的 mappingFunction,它可以使用几个预定义函数中的一个来转换数据。An optional mappingFunction, which can transform your data using one of several predefined functions. 函数的完整列表如下The full list of functions is below.

字段映射将添加到索引器定义的 fieldMappings 数组中。Field mappings are added to the fieldMappings array of the indexer definition.

使用 REST API 映射字段Map fields using the REST API

使用创建索引器 API 请求创建新的索引器时,可以添加字段映射。You can add field mappings when creating a new indexer using the Create Indexer API request. 可以使用更新索引器 API 请求来管理现有索引器的字段映射。You can manage the field mappings of an existing indexer using the Update Indexer API request.

例如,下面演示了如何将一个源字段映射到具有不同名称的目标字段:For example, here's how to map a source field to a target field with a different name:


PUT https://[service name].search.windows.net/indexers/myindexer?api-version=[api-version]
Content-Type: application/json
api-key: [admin key]
{
    "dataSourceName" : "mydatasource",
    "targetIndexName" : "myindex",
    "fieldMappings" : [ { "sourceFieldName" : "_id", "targetFieldName" : "id" } ]
}

可以在多个字段映射中引用一个源字段。A source field can be referenced in multiple field mappings. 以下示例演示如何“分叉”字段 - 将同一个源字段复制到两个不同的索引字段:The following example shows how to "fork" a field, copying the same source field to two different index fields:


"fieldMappings" : [
    { "sourceFieldName" : "text", "targetFieldName" : "textStandardEnglishAnalyzer" },
    { "sourceFieldName" : "text", "targetFieldName" : "textSoundexAnalyzer" }
]

备注

Azure 认知搜索使用不区分大小写的比较来解析字段映射中的字段和函数名称。Azure Cognitive Search uses case-insensitive comparison to resolve the field and function names in field mappings. 此操作很方便(大小写无需全都正确),但这表示数据源或索引无法具有仅大小写不同的字段。This is convenient (you don't have to get all the casing right), but it means that your data source or index cannot have fields that differ only by case.

使用 .NET SDK 映射字段Map fields using the .NET SDK

在 .NET SDK 中,使用 FieldMapping 类定义字段映射,该类包含属性 SourceFieldNameTargetFieldName,以及可选的 MappingFunction 引用。You define field mappings in the .NET SDK using the FieldMapping class, which has the properties SourceFieldName and TargetFieldName, and an optional MappingFunction reference.

可以在构造索引器时指定字段映射,以后也可以通过直接设置 Indexer.FieldMappings 属性来指定字段映射。You can specify field mappings when constructing the indexer, or later by directly setting the Indexer.FieldMappings property.

以下 C# 示例在构造索引器时设置字段映射。The following C# example sets the field mappings when constructing an indexer.

  List<FieldMapping> map = new List<FieldMapping> {
    // removes a leading underscore from a field name
    new FieldMapping("_custId", "custId"),
    // URL-encodes a field for use as the index key
    new FieldMapping("docPath", "docId", FieldMappingFunction.Base64Encode() )
  };

  Indexer sqlIndexer = new Indexer(
    name: "azure-sql-indexer",
    dataSourceName: sqlDataSource.Name,
    targetIndexName: index.Name,
    fieldMappings: map,
    schedule: new IndexingSchedule(TimeSpan.FromDays(1)));

  await searchService.Indexers.CreateOrUpdateAsync(indexer);

字段映射函数Field mapping functions

字段映射函数在将字段存储到索引中之前转换该字段的内容。A field mapping function transforms the contents of a field before it's stored in the index. 目前支持以下映射函数:The following mapping functions are currently supported:

base64Encode 函数base64Encode function

执行输入字符串的 URL 安全 Base64 编码。Performs URL-safe Base64 encoding of the input string. 假定输入采用 UTF-8 进行编码。Assumes that the input is UTF-8 encoded.

示例 - 文档键查找Example - document key lookup

只有 URL 安全字符才能出现在 Azure 认知搜索文档密钥中(因为客户必须能够使用查找 API来处理文档)。Only URL-safe characters can appear in an Azure Cognitive Search document key (because customers must be able to address the document using the Lookup API ). 如果键的源字段包含 URL 不安全的字符,在编制索引时,你可以使用 base64Encode 函数来转换该字段。If the source field for your key contains URL-unsafe characters, you can use the base64Encode function to convert it at indexing time.

在搜索时检索编码的键时,可以使用 base64Decode 函数获取原始键值,然后使用该值来检索源文档。When you retrieve the encoded key at search time, you can then use the base64Decode function to get the original key value, and use that to retrieve the source document.


"fieldMappings" : [
  {
    "sourceFieldName" : "SourceKey",
    "targetFieldName" : "IndexKey",
    "mappingFunction" : {
      "name" : "base64Encode",
      "parameters" : { "useHttpServerUtilityUrlTokenEncode" : false }
    }
  }]

如果未包含映射函数的 parameters 属性,该属性的默认值为 {"useHttpServerUtilityUrlTokenEncode" : true}If you don't include a parameters property for your mapping function, it defaults to the value {"useHttpServerUtilityUrlTokenEncode" : true}.

Azure 认知搜索支持两个不同的 Base64 编码。Azure Cognitive Search supports two different Base64 encodings. 在编码和解码同一字段时,应使用相同的参数。You should use the same parameters when encoding and decoding the same field. 在决定要使用哪些参数时,请参阅 base64 编码选项了解详细信息。For more information, see base64 encoding options to decide which parameters to use.

base64Decode 函数base64Decode function

执行输入字符串的 Base64 解码。Performs Base64 decoding of the input string. 假设输入是 URL 安全的 Base64 编码字符串。The input is assumed to be a URL-safe Base64-encoded string.

示例 - 解码 Blob 元数据或 URLExample - decode blob metadata or URLs

源数据可能包含 Base64 编码的字符串(例如 Blob 元数据字符串或 Web URL),你希望这些字符串可作为纯文本进行搜索。Your source data might contain Base64-encoded strings, such as blob metadata strings or web URLs, that you want to make searchable as plain text. 可以在填充搜索索引时,使用 base64Decode 函数将编码的数据转换回到常规字符串。You can use the base64Decode function to turn the encoded data back into regular strings when populating your search index.


"fieldMappings" : [
  {
    "sourceFieldName" : "Base64EncodedMetadata",
    "targetFieldName" : "SearchableMetadata",
    "mappingFunction" : { 
      "name" : "base64Decode", 
      "parameters" : { "useHttpServerUtilityUrlTokenDecode" : false }
    }
  }]

如果未包含 parameters 属性,该属性的默认值为 {"useHttpServerUtilityUrlTokenEncode" : true}If you don't include a parameters property, it defaults to the value {"useHttpServerUtilityUrlTokenEncode" : true}.

Azure 认知搜索支持两个不同的 Base64 编码。Azure Cognitive Search supports two different Base64 encodings. 在编码和解码同一字段时,应使用相同的参数。You should use the same parameters when encoding and decoding the same field. 在决定要使用哪些参数时,请参阅 base64 编码选项了解更多详细信息。For more details, see base64 encoding options to decide which parameters to use.

base64 编码选项base64 encoding options

Azure 认知搜索支持 URL 安全 base64 编码和常规 base64 编码。Azure Cognitive Search supports URL-safe base64 encoding and normal base64 encoding. 应在以后使用相同的编码选项对在索引过程中进行 base64 编码的字符串进行解码,否则结果与原始的结果不匹配。A string that is base64 encoded during indexing should be decoded later with the same encoding options, or else the result won't match the original.

如果将用于编码或解码的 useHttpServerUtilityUrlTokenEncodeuseHttpServerUtilityUrlTokenDecode 参数分别设置为 true,则 base64Encode 的行为与 HttpServerUtility.UrlTokenEncode 类似,base64Decode 的行为与 HttpServerUtility.UrlTokenDecode 类似。If the useHttpServerUtilityUrlTokenEncode or useHttpServerUtilityUrlTokenDecode parameters for encoding and decoding respectively are set to true, then base64Encode behaves like HttpServerUtility.UrlTokenEncode and base64Decode behaves like HttpServerUtility.UrlTokenDecode.

警告

如果 base64Encode 用于生成键值,则 useHttpServerUtilityUrlTokenEncode 必须设置为 true。If base64Encode is used to produce key values, useHttpServerUtilityUrlTokenEncode must be set to true. 只有 URL 安全 base64 编码可用于键值。Only URL-safe base64 encoding can be used for key values. 请参阅命名(规则 Azure)认知搜索以获取对键值中字符的完整限制集。See Naming rules (Azure Cognitive Search) for the full set of restrictions on characters in key values.

Azure 认知搜索中的 .NET 库采用 .NET Framework 提供内置编码的完整。The .NET libraries in Azure Cognitive Search assume the full .NET Framework, which provides built-in encoding. useHttpServerUtilityUrlTokenEncodeuseHttpServerUtilityUrlTokenDecode 选项利用此内置 functionaity。The useHttpServerUtilityUrlTokenEncode and useHttpServerUtilityUrlTokenDecode options leverage this built-in functionaity. 如果你使用的是 .NET Core 或其他框架,我们建议将这些选项设置为 false 并直接调用框架的编码和解码函数。If you are using .NET Core or another framework, we recommend setting those options to false and calling your framework's encoding and decoding functions directly.

下表比较了对字符串 00>00?00 进行不同的 base64 编码的结果。The following table compares different base64 encodings of the string 00>00?00. 若要确定 base64 函数所需的其他处理(如有),请对字符串 00>00?00 应用库编码函数,然后比较输出和预期的输出 MDA-MDA_MDATo determine the required additional processing (if any) for your base64 functions, apply your library encode function on the string 00>00?00 and compare the output with the expected output MDA-MDA_MDA.

编码Encoding Base64 编码输出Base64 encode output 库编码后的其他处理Additional processing after library encoding 库解码前的其他处理Additional processing before library decoding
带填充的 Base64Base64 with padding MDA+MDA/MDA= 使用 URL 安全字符并删除填充Use URL-safe characters and remove padding 使用标准 base64 字符并添加填充Use standard base64 characters and add padding
不带填充的 Base64Base64 without padding MDA+MDA/MDA 使用 URL 安全字符Use URL-safe characters 使用标准 base64 字符Use standard base64 characters
带填充的 URL 安全 Base64URL-safe base64 with padding MDA-MDA_MDA= 删除填充Remove padding 添加填充Add padding
不带填充的 URL 安全 Base64URL-safe base64 without padding MDA-MDA_MDA None None

extractTokenAtPosition 函数extractTokenAtPosition function

使用指定的分隔符拆分字符串字段,并在所生成拆分的指定位置处选取令牌。Splits a string field using the specified delimiter, and picks the token at the specified position in the resulting split.

此函数使用以下参数:This function uses the following parameters:

  • delimiter:在拆分输入字符串时,用作分隔符的字符串。delimiter: a string to use as the separator when splitting the input string.
  • position:在拆分输入字段串后要选取的位置,以零为底的整数。position: an integer zero-based position of the token to pick after the input string is split.

例如,如果输入是 Jane Doedelimiter" "(空格)并且 position 是 0,则结果为 Jane;如果 position 是 1,则结果是 DoeFor example, if the input is Jane Doe, the delimiter is " "(space) and the position is 0, the result is Jane; if the position is 1, the result is Doe. 如果位置引用的令牌不存在,则会返回错误。If the position refers to a token that doesn't exist, an error is returned.

示例 - 提取名称Example - extract a name

数据源包含 PersonName 字段,并且想要为其编制索引作为两个单独的 FirstNameLastName 字段。Your data source contains a PersonName field, and you want to index it as two separate FirstName and LastName fields. 可以使用此函数来拆分将空格字符用作分隔符的输入。You can use this function to split the input using the space character as the delimiter.


"fieldMappings" : [
  {
    "sourceFieldName" : "PersonName",
    "targetFieldName" : "FirstName",
    "mappingFunction" : { "name" : "extractTokenAtPosition", "parameters" : { "delimiter" : " ", "position" : 0 } }
  },
  {
    "sourceFieldName" : "PersonName",
    "targetFieldName" : "LastName",
    "mappingFunction" : { "name" : "extractTokenAtPosition", "parameters" : { "delimiter" : " ", "position" : 1 } }
  }]

jsonArrayToStringCollection 函数jsonArrayToStringCollection function

将已格式化为 JSON 字符串数组的字符串转换为可用于填充索引中 Collection(Edm.String) 字段的字符串数组。Transforms a string formatted as a JSON array of strings into a string array that can be used to populate a Collection(Edm.String) field in the index.

例如,如果输入字符串是 ["red", "white", "blue"],类型 Collection(Edm.String) 的目标字段由 redwhiteblue 这三个值填充。For example, if the input string is ["red", "white", "blue"], then the target field of type Collection(Edm.String) will be populated with the three values red, white, and blue. 对于无法分析为 JSON 字符串数组的输入值,则会返回错误。For input values that cannot be parsed as JSON string arrays, an error is returned.

示例 - 使用关系数据填充集合Example - populate collection from relational data

Azure SQL 数据库没有内置的数据类型,该数据类型自然映射到 Azure 认知搜索中 Collection(Edm.String) 字段。Azure SQL Database doesn't have a built-in data type that naturally maps to Collection(Edm.String) fields in Azure Cognitive Search. 若要填充字符串集合字段,可将源数据预处理成 JSON 字符串数组,然后使用 jsonArrayToStringCollection 映射函数。To populate string collection fields, you can pre-process your source data as a JSON string array and then use the jsonArrayToStringCollection mapping function.


"fieldMappings" : [
  {
    "sourceFieldName" : "tags", 
    "mappingFunction" : { "name" : "jsonArrayToStringCollection" }
  }]

有关将关系数据转换为索引集合字段的详细示例,请参阅为关系数据建模For a detailed example that transforms relational data into index collection fields, see Model relational data.

urlEncode 函数urlEncode function

此函数可用于对字符串进行编码,使其是“URL 安全的”。This function can be used to encode a string so that it is "URL safe". 与包含 URL 中不允许的字符的字符串一起使用时,此函数会将这些“不安全”字符转换为字符实体等效项。When used with a string that contains characters that are not allowed in a URL, this function will convert those "unsafe" characters into character-entity equivalents. 此函数使用 UTF-8 编码格式。This function uses the UTF-8 encoding format.

示例 - 文档键查找Example - document key lookup

如果只转换 URL 不安全字符,而将其他字符保留原样,则可以使用 urlEncode 函数来代替 base64Encode 函数。urlEncode function can be used as an alternative to the base64Encode function, if only URL unsafe characters are to be converted, while keeping other characters as-is.

例如,如果输入字符串是 <hello> - 则 (Edm.String) 类型的目标字段中将填充值 %3chello%3eSay, the input string is <hello> - then the target field of type (Edm.String) will be populated with the value %3chello%3e

在搜索时检索编码的键时,可以使用 urlDecode 函数获取原始键值,然后使用该值来检索源文档。When you retrieve the encoded key at search time, you can then use the urlDecode function to get the original key value, and use that to retrieve the source document.


"fieldMappings" : [
  {
    "sourceFieldName" : "SourceKey",
    "targetFieldName" : "IndexKey",
    "mappingFunction" : {
      "name" : "urlEncode"
    }
  }]

urlDecode 函数urlDecode function

此函数使用 UTF-8 编码格式将 URL 编码的字符串转换为解码的字符串。This function converts a URL-encoded string into a decoded string using UTF-8 encoding format.

示例 - 解码 Blob 元数据Example - decode blob metadata

如果 Blob 元数据包含非 ASCII 字符,某些 Azure 存储客户端会自动对这些元数据进行 URL 编码。Some Azure storage clients automatically url encode blob metadata if it contains non-ASCII characters. 但是,若要使此类元数据可搜索(作为纯文本),可以在填充搜索索引时,使用 urlDecode 函数将编码的数据转换回到常规字符串。However, if you want to make such metadata searchable (as plain text), you can use the urlDecode function to turn the encoded data back into regular strings when populating your search index.


"fieldMappings" : [
 {
   "sourceFieldName" : "UrlEncodedMetadata",
   "targetFieldName" : "SearchableMetadata",
   "mappingFunction" : {
     "name" : "urlDecode"
   }
 }]