您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

如何在 Azure 中使用 Blob 索引器为 JSON blob 编制索引认知搜索How to index JSON blobs using a Blob indexer in Azure Cognitive Search

本文介绍如何配置 Azure 认知搜索 blob索引器,以便从 azure blob 存储中的 JSON 文档中提取结构化内容并使其在 azure 认知搜索中可搜索。This article shows you how to configure an Azure Cognitive Search blob indexer to extract structured content from JSON documents in Azure Blob storage and make it searchable in Azure Cognitive Search. 此工作流创建 Azure 认知搜索索引,并使用从 JSON blob 中提取的现有文本加载该索引。This workflow creates an Azure Cognitive Search index and loads it with existing text extracted from JSON blobs.

可以使用门户REST API.NET SDK 为 JSON 内容编制索引。You can use the portal, REST APIs, or .NET SDK to index JSON content. 所有方法共有的是 JSON 文档位于 Azure 存储帐户中的 blob 容器中。Common to all approaches is that JSON documents are located in a blob container in an Azure Storage account. 有关将 JSON 文档推送到其他非 Azure 平台的指南,请参阅Azure 认知搜索中的数据导入For guidance on pushing JSON documents from other non-Azure platforms, see Data import in Azure Cognitive Search.

Azure Blob 存储中的 JSON blob 通常是单个 JSON 文档(分析模式为 json)或 JSON 实体的集合。JSON blobs in Azure Blob storage are typically either a single JSON document (parsing mode is json) or a collection of JSON entities. 对于集合,该 blob 可以具有格式正确的 JSON 元素数组(分析模式为 jsonArray)。For collections, the blob could have an array of well-formed JSON elements (parsing mode is jsonArray). Blob 还可以由由换行符分隔的多个单独的 JSON 实体组成(分析模式为 jsonLines)。Blobs could also be composed of multiple individual JSON entities separated by a newline (parsing mode is jsonLines). 请求上的parsingMode参数确定输出结构。The parsingMode parameter on the request determines the output structures.

备注

若要详细了解单个 blob 中的多个搜索文档的索引,请参阅一对多索引For more information about indexing multiple search documents from a single blob, see One-to-many indexing.

使用门户Use the portal

为 JSON 文档编制索引的最简单方法是使用 Azure 门户中的向导。The easiest method for indexing JSON documents is to use a wizard in the Azure portal. 通过分析 Azure Blob 容器中的元数据,导入数据向导可以创建默认索引、将源字段映射到目标索引字段,并以单个操作加载索引。By parsing metadata in the Azure blob container, the Import data wizard can create a default index, map source fields to target index fields, and load the index in a single operation. 根据源数据的大小和复杂性,在数分钟内就能创建一个有效的全文搜索索引。Depending on the size and complexity of source data, you could have an operational full text search index in minutes.

建议为 Azure 认知搜索和 Azure 存储使用同一区域或位置以实现较低的延迟,并避免带宽费用。We recommend using the same region or location for both Azure Cognitive Search and Azure Storage for lower latency and to avoid bandwidth charges.

1 - 准备源数据1 - Prepare source data

登录到 Azure 门户创建一个包含数据的 Blob 容器Sign in to the Azure portal and create a Blob container to contain your data. 可以将公共访问级别设置为其任何有效值。The Public Access Level can be set to any of its valid values.

需要使用存储帐户名称、容器名称和访问密钥,才能在 "导入数据" 向导中检索数据。You will need the storage account name, container name, and an access key to retrieve your data in the Import data wizard.

2 - 启动“导入数据”向导2 - Start Import data wizard

在搜索服务的 "概述" 页中,可以从命令栏启动向导In the Overview page of your search service, you can start the wizard from the command bar.

门户中的 "导入数据" 命令Import data command in portal

3 - 设置数据源3 - Set the data source

在“数据源”页中,源必须是“Azure Blob 存储”,其规范如下:In the data source page, the source must be Azure Blob Storage, with the following specifications:

  • “要提取的数据”应是“内容和元数据”。Data to extract should be Content and metadata. 选择此选项可让向导推断索引架构并映射要导入的字段。Choosing this option allows the wizard to infer an index schema and map the fields for import.

  • 分析模式应该设置为jsonjson 数组json 行Parsing mode should be set to JSON, JSON array or JSON lines.

    “JSON”将每个 Blob 表达为单个搜索文档,在搜索结果中作为独立的项显示。JSON articulates each blob as a single search document, showing up as an independent item in search results.

    Json 数组适用于包含格式正确的 json 数据的 blob,格式正确的 json 对应于对象的数组,或具有一个属性,该属性是一个对象数组,您希望每个元素都被表述为独立的独立搜索文档。JSON array is for blobs that contain well-formed JSON data - the well-formed JSON corresponds to an array of objects, or has a property which is an array of objects and you want each element to be articulated as a standalone, independent search document. 如果 Blob 比较复杂,而且你未选择“JSON 数组”,则会将整个 Blob 作为单个文档引入。If blobs are complex, and you don't choose JSON array the entire blob is ingested as a single document.

    JSON 行用于由多个以新行分隔的 JSON 实体组成的 blob,你希望每个实体被表述为独立的独立搜索文档。JSON lines is for blobs composed of multiple JSON entities separated by a new-line, where you want each entity to be articulated as a standalone independent search document. 如果 blob 是复杂的,并且没有选择JSON 行分析模式,则整个 blob 引入为单个文档。If blobs are complex, and you don't choose JSON lines parsing mode, then the entire blob is ingested as a single document.

  • “存储容器”必须指定你的存储帐户和容器,或指定解析成容器的连接字符串。Storage container must specify your storage account and container, or a connection string that resolves to the container. 可在 Blob 服务门户页上获取连接字符串。You can get connection strings on the Blob service portal page.

    Blob 数据源定义

4-跳过向导中的 "充实内容" 页面4 - Skip the "Enrich content" page in the wizard

添加认知技能(或扩充)不是一种导入需求。Adding cognitive skills (or enrichment) is not an import requirement. 除非有特定需要将 AI 扩充添加到索引管道,否则应跳过此步骤。Unless you have a specific need to add AI enrichment to your indexing pipeline, you should skip this step.

若要跳过该步骤,请单击页面底部的蓝色按钮 "下一步" 和 "跳过"。To skip the step, click the blue buttons at the bottom of the page for "Next" and "Skip".

5 - 设置索引属性5 - Set index attributes

在“索引”页中,应会看到带有数据类型的字段列表,以及一系列用于设置索引属性的复选框。In the Index page, you should see a list of fields with a data type and a series of checkboxes for setting index attributes. 此向导可以基于元数据生成字段列表,并通过采样源数据。The wizard can generate a fields list based on metadata and by sampling the source data.

通过单击属性列顶部的复选框,可以批量选择属性。You can bulk-select attributes by clicking the checkbox at the top of an attribute column. 对于应返回到客户端应用并受全文搜索处理的每个字段,选择 "可检索和可搜索"。Choose Retrievable and Searchable for every field that should be returned to a client app and subject to full text search processing. 您会注意到整数不是全文或模糊搜索(数字按原义计算,在筛选器中通常很有用)。You'll notice that integers are not full text or fuzzy searchable (numbers are evaluated verbatim and are often useful in filters).

有关详细信息,请查看索引属性语言分析器的说明。Review the description of index attributes and language analyzers for more information.

花费片刻时间来检查所做的选择。Take a moment to review your selections. 运行向导后,将创建物理数据结构,到时,除非删除再重新创建所有对象,否则无法编辑这些字段。Once you run the wizard, physical data structures are created and you won't be able to edit these fields without dropping and recreating all objects.

Blob 索引定义

6 - 创建索引器6 - Create indexer

完全指定设置后,向导将在搜索服务中创建三个不同的对象。Fully specified, the wizard creates three distinct objects in your search service. 数据源对象和索引对象保存为 Azure 认知搜索服务中的已命名资源。A data source object and index object are saved as named resources in your Azure Cognitive Search service. 最后一个步骤创建索引器对象。The last step creates an indexer object. 为索引器命名可让它作为独立的资源存在,无论在同一向导序列中创建了哪种索引和数据源对象,都可以计划和管理该索引器。Naming the indexer allows it to exist as a standalone resource, which you can schedule and manage independently of the index and data source object, created in the same wizard sequence.

如果你不熟悉索引器,索引器是 Azure 认知搜索中的资源,用于对外部数据源进行爬网搜索。If you are not familiar with indexers, an indexer is a resource in Azure Cognitive Search that crawls an external data source for searchable content. 导入数据向导的输出是一个索引器,该索引器对 JSON 数据源进行爬网,提取可搜索的内容,然后将其导入 Azure 认知搜索的索引。The output of the Import data wizard is an indexer that crawls your JSON data source, extracts searchable content, and imports it into an index on Azure Cognitive Search.

Blob 索引器定义

单击“确定”运行向导并创建所有对象。Click OK to run the wizard and create all objects. 随后会立即开始编制索引。Indexing commences immediately.

可以在门户页监视数据导入。You can monitor data import in the portal pages. 进度通知指示索引状态以及已上传的文档数。Progress notifications indicate indexing status and how many documents are uploaded.

索引编制完成后,可以使用搜索浏览器来查询索引。When indexing is complete, you can use Search explorer to query your index.

备注

如果看不到预期的数据,则可能需要对更多字段设置更多属性。If you don't see the data you expect, you might need to set more attributes on more fields. 删除你刚创建的索引和索引器,并再次逐句通过向导,同时修改步骤5中索引属性所做的选择。Delete the index and indexer you just created, and step through the wizard again, modifying your selections for index attributes in step 5.

使用 REST APIUse REST APIs

对于 Azure 中的所有索引器通用的三部分工作流,可以使用 REST API 来索引 JSON blob 认知搜索:创建数据源、创建索引、创建索引器。You can use the REST API to index JSON blobs, following a three-part workflow common to all indexers in Azure Cognitive Search: create a data source, create an index, create an indexer. 提交 Create 索引器请求时,将发生从 blob 存储中提取的数据。Data extraction from blob storage occurs when you submit the Create Indexer request. 完成此请求后,将具有可查询的索引。After this request is finished, you will have a queryable index.

可以在本部分末尾查看REST 示例代码,其中显示了如何创建所有三个对象。You can review REST example code at the end of this section that shows how to create all three objects. 本部分还包含有关JSON 分析模式单个 blobJSON 数组嵌套数组的详细信息。This section also contains details about JSON parsing modes, single blobs, JSON arrays, and nested arrays.

对于基于代码的 JSON 索引,请使用Postman和 REST API 来创建这些对象:For code-based JSON indexing, use Postman and the REST API to create these objects:

操作顺序要求按此顺序创建和调用对象。Order of operations requires that you create and call objects in this order. 与门户工作流相反,代码方法需要一个可用的索引来接受通过创建索引器请求发送的 JSON 文档。In contrast with the portal workflow, a code approach requires an available index to accept the JSON documents sent through the Create Indexer request.

Azure Blob 存储中的 JSON blob 通常是单个 JSON 文档或 JSON "数组"。JSON blobs in Azure Blob storage are typically either a single JSON document or a JSON "array". Azure 认知搜索中的 blob 索引器可以分析任意构造,具体取决于你对请求设置parsingMode参数的方式。The blob indexer in Azure Cognitive Search can parse either construction, depending on how you set the parsingMode parameter on the request.

JSON 文档JSON document parsingModeparsingMode 描述Description 可用性Availability
每个 Blob 各有一个One per blob json 将 JSON Blob 分析为单个文本区块。Parses JSON blobs as a single chunk of text. 每个 JSON blob 都将成为单个 Azure 认知搜索文档。Each JSON blob becomes a single Azure Cognitive Search document. REST API 和.net SDK 中公开提供。Generally available in both REST API and .NET SDK.
每个 Blob 有多个Multiple per blob jsonArray 分析 blob 中的 JSON 数组,其中数组的每个元素都成为单独的 Azure 认知搜索文档。Parses a JSON array in the blob, where each element of the array becomes a separate Azure Cognitive Search document. REST API 和.net SDK 中公开提供。Generally available in both REST API and .NET SDK.
每个 Blob 有多个Multiple per blob jsonLines 分析包含由换行符分隔的多个 JSON 实体("数组")的 blob,其中每个实体将成为单独的 Azure 认知搜索文档。Parses a blob which contains multiple JSON entities (an "array") separated by a newline, where each entity becomes a separate Azure Cognitive Search document. REST API 和.net SDK 中公开提供。Generally available in both REST API and .NET SDK.

1-汇集请求的输入1 - Assemble inputs for the request

对于每个请求,必须提供 Azure 认知搜索的服务名称和管理密钥(在 POST 标头中)以及 blob 存储的存储帐户名称和密钥。For each request, you must provide the service name and admin key for Azure Cognitive Search (in the POST header), and the storage account name and key for blob storage. 可以使用Postman将 HTTP 请求发送到 Azure 认知搜索。You can use Postman to send HTTP requests to Azure Cognitive Search.

将以下四个值复制到记事本中,以便可以将它们粘贴到请求中:Copy the following four values into Notepad so that you can paste them into a request:

  • Azure 认知搜索服务名称Azure Cognitive Search service name
  • Azure 认知搜索管理密钥Azure Cognitive Search admin key
  • Azure 存储帐户名称Azure storage account name
  • Azure 存储帐户密钥Azure storage account key

可以在门户中找到以下值:You can find these values in the portal:

  1. 在 Azure 认知搜索的门户页中,从 "概述" 页复制 "搜索服务 URL"。In the portal pages for Azure Cognitive Search, copy the search service URL from the Overview page.

  2. 在左侧导航窗格中,单击 "密钥",然后复制 "主密钥" 或 "辅助密钥" (它们等效)。In the left navigation pane, click Keys and then copy either the primary or secondary key (they are equivalent).

  3. 切换到存储帐户的门户页。Switch to the portal pages for your storage account. 在左侧导航窗格中的 "设置" 下,单击 "访问密钥"。In the left navigation pane, under Settings, click Access Keys. 此页提供帐户名和密钥。This page provides both the account name and key. 将存储帐户名称和密钥之一复制到记事本。Copy the storage account name and one of the keys to Notepad.

2-创建数据源2 - Create a data source

此步骤提供索引器使用的数据源连接信息。This step provides data source connection information used by the indexer. 数据源是 Azure 认知搜索中的命名对象,用于保存连接信息。The data source is a named object in Azure Cognitive Search that persists the connection information. 数据源类型 azureblob确定索引器调用哪些数据提取行为。The data source type, azureblob, determines which data extraction behaviors are invoked by the indexer.

将有效值替换为 "服务名称"、"管理密钥"、"存储帐户" 和 "帐户密钥占位符"。Substitute valid values for service name, admin key, storage account, and account key placeholders.

POST https://[service name].search.windows.net/datasources?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key for Azure Cognitive Search]

{
    "name" : "my-blob-datasource",
    "type" : "azureblob",
    "credentials" : { "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>;" },
    "container" : { "name" : "my-container", "query" : "optional, my-folder" }
}   

3-创建目标搜索索引3 - Create a target search index

索引器与索引架构配对。Indexers are paired with an index schema. 如果使用的是 API(而不是门户),请提前准备索引,以便可以在索引器操作中指定它。If you are using the API (rather than the portal), prepare an index in advance so that you can specify it on the indexer operation.

此索引存储 Azure 认知搜索中的可搜索内容。The index stores searchable content in Azure Cognitive Search. 若要创建索引,请提供一个架构,用于在文档、属性和其他构造中指定可以塑造搜索体验的字段。To create an index, provide a schema that specifies the fields in a document, attributes, and other constructs that shape the search experience. 如果创建与源具有相同字段名称和数据类型的索引,索引器将会匹配源和目标字段,使你无需显式映射字段。If you create an index that has the same field names and data types as the source, the indexer will match the source and destination fields, saving you the work of having to explicitly map the fields.

以下示例演示了一个创建索引请求。The following example shows a Create Index request. 该索引包含一个可搜索的 content 字段,该字段存储从 Blob 提取的文本:The index will have a searchable content field to store the text extracted from blobs:

POST https://[service name].search.windows.net/indexes?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key for Azure Cognitive Search]

{
      "name" : "my-target-index",
      "fields": [
        { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
        { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
      ]
}

4-配置并运行索引器4 - Configure and run the indexer

与索引和数据源一样,索引器也是在 Azure 认知搜索服务上创建和重用的命名对象。As with an index and a data source, and indexer is also a named object that you create and reuse on an Azure Cognitive Search service. 创建索引器的完全指定的请求可能如下所示:A fully specified request to create an indexer might look as follows:

POST https://[service name].search.windows.net/indexers?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key for Azure Cognitive Search]

{
  "name" : "my-json-indexer",
  "dataSourceName" : "my-blob-datasource",
  "targetIndexName" : "my-target-index",
  "schedule" : { "interval" : "PT2H" },
  "parameters" : { "configuration" : { "parsingMode" : "json" } }
}

索引器配置位于请求正文中。Indexer configuration is in the body of the request. 它需要数据源和 Azure 认知搜索中已存在的空目标索引。It requires a data source and an empty target index that already exists in Azure Cognitive Search.

计划和参数是可选的。Schedule and parameters are optional. 如果省略它们,索引器将立即运行,并使用 json 作为分析模式。If you omit them, the indexer runs immediately, using json as the parsing mode.

此特定索引器不包含字段映射。This particular indexer does not include field mappings. 在索引器定义中,如果源 JSON 文档的属性与目标搜索索引的字段相匹配,则可以保留字段映射Within the indexer definition, you can leave out field mappings if the properties of the source JSON document match the fields of your target search index.

REST 示例REST Example

本部分概述了用于创建对象的所有请求。This section is a recap of all the requests used for creating objects. 有关组件部件的讨论,请参阅本文前面的部分。For a discussion of component parts, see the previous sections in this article.

数据源请求Data source request

所有索引器都需要一个为现有数据提供连接信息的数据源对象。All indexers require a data source object that provides connection information to existing data.

POST https://[service name].search.windows.net/datasources?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key for Azure Cognitive Search]

{
    "name" : "my-blob-datasource",
    "type" : "azureblob",
    "credentials" : { "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>;" },
    "container" : { "name" : "my-container", "query" : "optional, my-folder" }
}  

索引请求Index request

所有索引器都需要一个接收数据的目标索引。All indexers require a target index that receives the data. 请求的主体定义索引架构(包含字段),这些字段不支持可搜索索引中所需的行为。The body of the request defines the index schema, consisting of fields, attributed to support the desired behaviors in a searchable index. 运行索引器时,此索引应为空。This index should be empty when you run the indexer.

POST https://[service name].search.windows.net/indexes?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key for Azure Cognitive Search]

{
      "name" : "my-target-index",
      "fields": [
        { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
        { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
      ]
}

索引器请求Indexer request

此请求显示一个完全指定的索引器。This request shows a fully-specified indexer. 它包括前面的示例中省略的字段映射。It includes field mappings, which were omitted in previous examples. 请记住,只要有可用的默认值,"schedule"、"parameters" 和 "fieldMappings" 都是可选的。Recall that "schedule", "parameters", and "fieldMappings" are optional as long as there is an available default. 省略 "schedule" 将导致立即运行索引器。Omitting "schedule" causes the indexer to run immediately. 省略 "parsingMode" 将导致索引使用 "json" 默认值。Omitting "parsingMode" causes the index to use the "json" default.

在 Azure 上创建索引器认知搜索触发数据导入。Creating the indexer on Azure Cognitive Search triggers data import. 它会立即运行,如果你提供了一个计划,则按计划运行。It runs immediately, and thereafter on a schedule if you've provided one.

POST https://[service name].search.windows.net/indexers?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key for Azure Cognitive Search]

{
  "name" : "my-json-indexer",
  "dataSourceName" : "my-blob-datasource",
  "targetIndexName" : "my-target-index",
  "schedule" : { "interval" : "PT2H" },
  "parameters" : { "configuration" : { "parsingMode" : "json" } },
  "fieldMappings" : [
    { "sourceFieldName" : "/article/text", "targetFieldName" : "text" },
    { "sourceFieldName" : "/article/datePublished", "targetFieldName" : "date" },
    { "sourceFieldName" : "/article/tags", "targetFieldName" : "tags" }
    ]
}

使用 .NET SDKUse .NET SDK

.NET SDK 与 REST API 具有完全的奇偶校验。The .NET SDK has full parity with the REST API. 我们建议查看前面的 REST API 部分,以了解相关概念、工作流和要求。We recommend that you review the previous REST API section to learn concepts, workflow, and requirements. 然后,可以参阅以下 .NET API 参考文档,在托管代码中实现 JSON 索引器。You can then refer to following .NET API reference documentation to implement a JSON indexer in managed code.

分析模式Parsing modes

JSON blob 可以采用多种形式。JSON blobs can assume multiple forms. JSON 索引器上的parsingMode参数确定如何在 Azure 认知搜索索引中分析和构造 json blob 内容:The parsingMode parameter on the JSON indexer determines how JSON blob content is parsed and structured in an Azure Cognitive Search index:

parsingModeparsingMode 描述Description
json 将每个 blob 作为单个文档进行索引。Index each blob as a single document. 这是默认值。This is the default.
jsonArray 如果 blob 包含 JSON 数组,并且你需要数组的每个元素成为 Azure 认知搜索中的单独文档,请选择此模式。Choose this mode if your blobs consist of JSON arrays, and you need each element of the array to become a separate document in Azure Cognitive Search.
jsonLines 如果你的 blob 由多个 JSON 实体组成,并由新行分隔,你需要每个实体成为 Azure 认知搜索中的单独文档,请选择此模式。Choose this mode if your blobs consist of multiple JSON entities, that are separated by a new line, and you need each entity to become a separate document in Azure Cognitive Search.

可将某个文档视为搜索结果中的单个项。You can think of a document as a single item in search results. 如果希望数组中的每个元素都作为独立项显示在搜索结果中,请根据需要使用 "jsonArray" 或 "jsonLines" 选项。If you want each element in the array to show up in search results as an independent item, then use the jsonArray or jsonLines option as appropriate.

在索引器定义中,可以选择性地使用字段映射来选择要将源 JSON 文档的哪些属性用于填充目标搜索索引。Within the indexer definition, you can optionally use field mappings to choose which properties of the source JSON document are used to populate your target search index. 对于 jsonArray 分析模式,如果数组作为较低级别的属性存在,则可以设置文档根目录,指示数组在 blob 中的放置位置。For jsonArray parsing mode, if the array exists as a lower-level property, you can set a document root indicating where the array is placed within the blob.

重要

当你使用 jsonjsonArrayjsonLines 分析模式时,Azure 认知搜索假定数据源中的所有 blob 都包含 JSON。When you use json, jsonArray or jsonLines parsing mode, Azure Cognitive Search assumes that all blobs in your data source contain JSON. 如果需要在同一数据源中支持混合使用 JSON 和非 JSON blob,请通过我们的 UserVoice 站点告知我们。If you need to support a mix of JSON and non-JSON blobs in the same data source, let us know on our UserVoice site.

分析单个 JSON blobParse single JSON blobs

默认情况下, Azure 认知搜索 blob 索引器将 JSON blob 分析为单个文本块。By default, Azure Cognitive Search blob indexer parses JSON blobs as a single chunk of text. 通常会希望保留 JSON 文档的结构。Often, you want to preserve the structure of your JSON documents. 例如,假设 Azure Blob 存储中包含以下 JSON 文档:For example, assume you have the following JSON document in Azure Blob storage:

{
    "article" : {
        "text" : "A hopefully useful article explaining how to parse JSON blobs",
        "datePublished" : "2016-04-13",
        "tags" : [ "search", "storage", "howto" ]    
    }
}

Blob 索引器将 JSON 文档解析为单个 Azure 认知搜索文档。The blob indexer parses the JSON document into a single Azure Cognitive Search document. 索引器通过将源中的“text”、“datePublished”和“tags”与同名且类型相同的目标索引字段进行匹配,来加载索引。The indexer loads an index by matching "text", "datePublished", and "tags" from the source against identically named and typed target index fields.

如前所述,字段映射不是必需的。As noted, field mappings are not required. 假设某个索引包含“text”、“datePublished”和“tags”字段,则请求不需要存在字段映射,Blob 索引器就能推断正确的映射。Given an index with "text", "datePublished, and "tags" fields, the blob indexer can infer the correct mapping without a field mapping present in the request.

分析 JSON 数组Parse JSON arrays

或者,可以使用 JSON array 选项。Alternatively, you can use the JSON array option. 当 blob 包含格式正确的 JSON 对象的数组,并且你希望每个元素成为单独的 Azure 认知搜索文档时,此选项很有用。This option is useful when blobs contain an array of well-formed JSON objects, and you want each element to become a separate Azure Cognitive Search document. 例如,假设有以下 JSON blob,则可以用三个单独的文档(每个文档都带有 "id" 和 "文本" 字段)填充 Azure 认知搜索索引。For example, given the following JSON blob, you can populate your Azure Cognitive Search index with three separate documents, each with "id" and "text" fields.

[
    { "id" : "1", "text" : "example 1" },
    { "id" : "2", "text" : "example 2" },
    { "id" : "3", "text" : "example 3" }
]

对于 JSON 数组,索引器定义应如以下示例所示。For a JSON array, the indexer definition should look similar to the following example. 请注意,parsingMode 参数指定 jsonArray 分析器。Notice that the parsingMode parameter specifies the jsonArray parser. 为 JSON blob 编制索引时,指定正确的分析器并具有正确的数据输入是唯一两个特定于数组的要求。Specifying the right parser and having the right data input are the only two array-specific requirements for indexing JSON blobs.

POST https://[service name].search.windows.net/indexers?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
  "name" : "my-json-indexer",
  "dataSourceName" : "my-blob-datasource",
  "targetIndexName" : "my-target-index",
  "schedule" : { "interval" : "PT2H" },
  "parameters" : { "configuration" : { "parsingMode" : "jsonArray" } }
}

同样请注意,可以省略字段映射。Again, notice that field mappings can be omitted. 假设某个索引包含同名的“id”和“text”字段,则无需显式字段映射列表,Blob 索引器就能推断正确的映射。Assuming an index with identically named "id" and "text" fields, the blob indexer can infer the correct mapping without an explicit field mapping list.

分析嵌套数组Parse nested arrays

对于包含嵌套元素的 JSON 数组,你可以指定一个 documentRoot 来指示多级别结构。For JSON arrays having nested elements, you can specify a documentRoot to indicate a multi-level structure. 例如,如果 blob 如下所示:For example, if your blobs look like this:

{
    "level1" : {
        "level2" : [
            { "id" : "1", "text" : "Use the documentRoot property" },
            { "id" : "2", "text" : "to pluck the array you want to index" },
            { "id" : "3", "text" : "even if it's nested inside the document" }  
        ]
    }
}

使用此配置可为 level2 属性中包含的数组编制索引:Use this configuration to index the array contained in the level2 property:

{
    "name" : "my-json-array-indexer",
    ... other indexer properties
    "parameters" : { "configuration" : { "parsingMode" : "jsonArray", "documentRoot" : "/level1/level2" } }
}

分析由换行符分隔的 blobParse blobs separated by newlines

如果 blob 包含由换行符分隔的多个 JSON 实体,并且你希望每个元素成为单独的 Azure 认知搜索文档,则可以选择 "JSON 行" 选项。If your blob contains multiple JSON entities separated by a newline, and you want each element to become a separate Azure Cognitive Search document, you can opt for the JSON lines option. 例如,给定以下 blob (其中有三个不同的 JSON 实体),可以用三个单独的文档填充 Azure 认知搜索索引,每个文档都有 "id" 和 "文本" 字段。For example, given the following blob (where there are three different JSON entities), you can populate your Azure Cognitive Search index with three separate documents, each with "id" and "text" fields.

{ "id" : "1", "text" : "example 1" }
{ "id" : "2", "text" : "example 2" }
{ "id" : "3", "text" : "example 3" }

对于 JSON 行,索引器定义应类似于下面的示例。For JSON lines, the indexer definition should look similar to the following example. 请注意,parsingMode 参数指定 jsonLines 分析器。Notice that the parsingMode parameter specifies the jsonLines parser.

POST https://[service name].search.windows.net/indexers?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
  "name" : "my-json-indexer",
  "dataSourceName" : "my-blob-datasource",
  "targetIndexName" : "my-target-index",
  "schedule" : { "interval" : "PT2H" },
  "parameters" : { "configuration" : { "parsingMode" : "jsonLines" } }
}

同样,请注意,可以省略字段映射,这类似于 jsonArray 分析模式。Again, notice that field mappings can be omitted, similar to the jsonArray parsing mode.

添加字段映射Add field mappings

当源与目标字段未完美对齐时,可以在请求正文中定义一个字段映射节,以建立显式的字段间关联。When source and target fields are not perfectly aligned, you can define a field mapping section in the request body for explicit field-to-field associations.

目前,Azure 认知搜索无法直接为任意 JSON 文档编制索引,因为它仅支持基元数据类型、字符串数组和 GeoJSON 点。Currently, Azure Cognitive Search cannot index arbitrary JSON documents directly because it supports only primitive data types, string arrays, and GeoJSON points. 不过,可以使用字段映射选取 JSON 文档的部分,然后将它们“提升”到搜索文档的顶级字段。However, you can use field mappings to pick parts of your JSON document and "lift" them into top-level fields of the search document. 若要了解字段映射的基础知识,请参阅Azure 中的字段映射认知搜索索引器To learn about field mappings basics, see Field mappings in Azure Cognitive Search indexers.

回到前面的示例 JSON 文档:Revisiting our example JSON document:

{
    "article" : {
        "text" : "A hopefully useful article explaining how to parse JSON blobs",
        "datePublished" : "2016-04-13"
        "tags" : [ "search", "storage", "howto" ]    
    }
}

假设某个搜索索引包含以下字段:Edm.String 类型的 textEdm.DateTimeOffset 类型的 date,以及 Collection(Edm.String) 类型的 tagsAssume a search index with the following fields: text of type Edm.String, date of type Edm.DateTimeOffset, and tags of type Collection(Edm.String). 请注意源中“datePublished”与索引中 date 字段之间的差异。Notice the discrepancy between "datePublished" in the source and date field in the index. 要将 JSON 映射到所需形状,请使用以下字段映射:To map your JSON into the desired shape, use the following field mappings:

"fieldMappings" : [
    { "sourceFieldName" : "/article/text", "targetFieldName" : "text" },
    { "sourceFieldName" : "/article/datePublished", "targetFieldName" : "date" },
    { "sourceFieldName" : "/article/tags", "targetFieldName" : "tags" }
  ]

使用 JSON 指针表示法指定映射中的源字段名称。The source field names in the mappings are specified using the JSON Pointer notation. 以正斜杠开头引用 JSON 文档的根,并通过使用正斜杠分隔的路径选取所需属性(任意层级的嵌套)。You start with a forward slash to refer to the root of your JSON document, then pick the desired property (at arbitrary level of nesting) by using forward slash-separated path.

还可以通过使用从零开始的索引来引用个别数组元素。You can also refer to individual array elements by using a zero-based index. 例如,若要选取上述示例中“tags”数组的第一个元素,请使用如下所示的字段映射:For example, to pick the first element of the "tags" array from the above example, use a field mapping like this:

{ "sourceFieldName" : "/article/tags/0", "targetFieldName" : "firstTag" }

备注

如果字段映射路径中的源字段名称引用了 JSON 中不存在的属性,则会跳过该映射,不会出错。If a source field name in a field mapping path refers to a property that doesn't exist in JSON, that mapping is skipped without an error. 如此,我们便可以支持具有不同架构的文档(这是一个常见用例)。This is done so that we can support documents with a different schema (which is a common use case). 因为没有任何验证,所以需要注意避免字段映射规范中出现拼写错误。Because there is no validation, you need to take care to avoid typos in your field mapping specification.

另请参阅See also