您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

如何在 Azure 中的 AI 扩充管道中创建技能组合认知搜索How to create a skillset in an AI enrichment pipeline in Azure Cognitive Search

AI 扩充提取和丰富数据,使其可在 Azure 认知搜索中搜索。AI enrichment extracts and enriches data to make it searchable in Azure Cognitive Search. 我们将提取和扩充步骤称作认知技能,这些技能将合并成索引编制期间所引用的技能集。We call extraction and enrichment steps cognitive skills, combined into a skillset referenced during indexing. 技能组合可以使用内置技能或自定义技能(有关详细信息,请参阅示例:在 AI 扩充管道中创建自定义技能)。A skillset can use built-in skills or custom skills (see Example: Creating a custom skill in an AI enrichment pipeline for more information).

本文介绍如何对想要使用的技能创建扩充管道。In this article, you learn how to create an enrichment pipeline for the skills you want to use. 技能组合附加到 Azure 认知搜索索引器A skillset is attached to an Azure Cognitive Search indexer. 本文介绍的管道设计的一个部分是构造技能集本身。One part of pipeline design, covered in this article, is constructing the skillset itself.

备注

管道设计的另一个部分是指定下一步骤所述的索引器。Another part of pipeline design is specifying an indexer, covered in the next step. 索引器定义包括对技能的引用,以及用于将目标索引中的输入连接到输出的字段映射。An indexer definition includes a reference to the skillset, plus field mappings used for connecting inputs to outputs in the target index.

请记住以下要点:Key points to remember:

  • 只能为每个索引器创建一个技能集。You can only have one skillset per indexer.
  • 技能组合必须具有至少一种技能。A skillset must have at least one skill.
  • 可以创建相同类型的多个技能(例如,图像分析技能的变体)。You can create multiple skills of the same type (for example, variants of an image analysis skill).

一开始就想到最终结果Begin with the end in mind

建议的初始步骤是确定要从原始数据提取哪些数据,以及如何在搜索解决方案中使用该数据。A recommended initial step is deciding which data to extract from your raw data and how you want to use that data in a search solution. 创建整个扩充管道的演示有助于确定所需的步骤。Creating an illustration of the entire enrichment pipeline can help you identify the necessary steps.

假设我们要处理一系列金融分析师评论。Suppose you are interested in processing a set of financial analyst comments. 对于每个文件,我们需要提取公司名称和一般性的评论情绪。For each file, you want to extract company names and the general sentiment of the comments. 此外,可能还需要编写自定义的扩充器,以便使用必应实体搜索服务来查找有关公司的其他信息,例如,该公司经营哪种业务。You might also want to write a custom enricher that uses the Bing Entity Search service to find additional information about the company, such as what kind of business the company is engaged in. 实质上,我们需要提取针对每个文档编制索引的如下所述的信息:Essentially, you want to extract information like the following, indexed for each document:

记录文本record-text 公司companies 情绪sentiment 公司说明company descriptions
sample-recordsample-record ["Microsoft", "LinkedIn"]["Microsoft", "LinkedIn"] 0.990.99 ["Microsoft Corporation is an American multinational technology company ..." , "LinkedIn is a business- and employment-oriented social networking..."]["Microsoft Corporation is an American multinational technology company ..." , "LinkedIn is a business- and employment-oriented social networking..."]

下图演示了一个虚构的扩充管道:The following diagram illustrates a hypothetical enrichment pipeline:

假设的扩充管道A hypothetical enrichment pipeline

对管道包含的内容进行适当的构思后,可以表达用于提供这些步骤的技能集。Once you have fair idea of what you want in the pipeline, you can express the skillset that provides these steps. 在功能上,技能组合在将索引器定义上载到 Azure 认知搜索时表示。Functionally, the skillset is expressed when you upload your indexer definition to Azure Cognitive Search. 若要详细了解如何上传索引器,请参阅索引器文档To learn more about how to upload your indexer, see the indexer-documentation.

在图中,文档破解步骤会自动发生。In the diagram, the document cracking step happens automatically. 实质上,Azure 认知搜索知道如何打开众所周知的文件并创建一个内容字段,其中包含从每个文档中提取的文本。Essentially, Azure Cognitive Search knows how to open well-known files and creates a content field containing the text extracted from each document. 白框是内置的扩充器,“必应实体搜索”虚线框表示要创建的自定义扩充器。The white boxes are built-in enrichers, and the dotted "Bing Entity Search" box represents a custom enricher that you are creating. 如图所示,该技能集包含三个技能。As illustrated, the skillset contains three skills.

REST 中的技能集定义Skillset definition in REST

技能定义为技能数组。A skillset is defined as an array of skills. 每个技能定义其输入的源,以及生成的输出的名称。Each skill defines the source of its inputs and the name of the outputs produced. 使用创建技能 REST API 可以定义对应于上图的技能集:Using the Create Skillset REST API, you can define a skillset that corresponds to the previous diagram:

PUT https://[servicename].search.windows.net/skillsets/[skillset name]?api-version=2019-05-06
api-key: [admin key]
Content-Type: application/json
{
  "description": 
  "Extract sentiment from financial records, extract company names, and then find additional information about each company mentioned.",
  "skills":
  [
    {
      "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
      "context": "/document",
      "categories": [ "Organization" ],
      "defaultLanguageCode": "en",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "organizations",
          "targetName": "organizations"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.SentimentSkill",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "score",
          "targetName": "mySentiment"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
     "description": "Calls an Azure function, which in turn calls Bing Entity Search",
      "uri": "https://indexer-e2e-webskill.azurewebsites.net/api/InvokeTextAnalyticsV3?code=foo",
      "httpHeaders": {
          "Ocp-Apim-Subscription-Key": "foobar"
      },
      "context": "/document/organizations/*",
      "inputs": [
        {
          "name": "query",
          "source": "/document/organizations/*"
        }
      ],
      "outputs": [
        {
          "name": "description",
          "targetName": "companyDescription"
        }
      ]
    }
  ]
}

创建技能集Create a skillset

创建技能集时,可以提供说明,使技能集具有自述性。While creating a skillset, you can provide a description to make the skillset self-documenting. 说明是可选的,但可用于跟踪技能集的用途。A description is optional, but useful for keeping track of what a skillset does. 由于技能集是不允许注释的 JSON 文档,因此必须为其使用 description 元素。Because skillset is a JSON document, which does not allow comments, you must use a description element for this.

{
  "description": 
  "This is our first skill set, it extracts sentiment from financial records, extract company names, and then finds additional information about each company mentioned.",
  ...
}

技能集的下一个片段是技能数组。The next piece in the skillset is an array of skills. 可将每个技能视为扩充的基元。You can think of each skill as a primitive of enrichment. 每个技能在此扩充管道中执行小型任务。Each skill performs a small task in this enrichment pipeline. 每个技能接受一个输入(或一组输入),并返回一些输出。Each one takes an input (or a set of inputs), and returns some outputs. 接下来的几节重点介绍如何指定内置和自定义的技能,通过输入和输出引用将技能结合起来。The next few sections focus on how to specify built-in and custom skills, chaining skills together through input and output references. 输入可以来自源数据或来自另一个技能。Inputs can come from source data or from another skill. 输出可映射到搜索索引中的字段,或用作下游技能的输入。Outputs can be mapped to a field in a search index or used as an input to a downstream skill.

添加内置技能Add built-in skills

让我们看一下第一项技能,这是内置实体识别技能Let's look at the first skill, which is the built-in entity recognition skill:

    {
      "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
      "context": "/document",
      "categories": [ "Organization" ],
      "defaultLanguageCode": "en",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "organizations",
          "targetName": "organizations"
        }
      ]
    }
  • 每个内置技能都有 odata.typeinputoutput 属性。Every built-in skill has odata.type, input, and output properties. 技能特定的属性提供适用于该技能的附加信息。Skill-specific properties provide additional information applicable to that skill. 对于实体识别,categories 是一组固定的实体类型中,可由预先训练的模型识别的一个实体。For entity recognition, categories is one entity among a fixed set of entity types that the pretrained model can recognize.

  • 每个技能应包含 "context"Each skill should have a "context". 上下文表示发生操作的级别。The context represents the level at which operations take place. 在上述技能中,上下文是整个文档,这意味着实体识别技能每个文档都调用一次。In the skill above, the context is the whole document, meaning that the entity recognition skill is called once per document. 输出也会在该级别生成。Outputs are also produced at that level. 更具体地说,将生成 "organizations" 作为 "/document" 的成员。More specifically, "organizations" are generated as a member of "/document". 在下游技能中,可以使用 "/document/organizations" 的形式引用此新建信息。In downstream skills, you can refer to this newly created information as "/document/organizations". 如果未显式设置 "context" 字段,则默认上下文是文档。If the "context" field is not explicitly set, the default context is the document.

  • 技能包含一个名为“text”的输入,其源输入设置为 "/document/content"The skill has one input called "text", with a source input set to "/document/content". 技能(实体识别)作用于每个文档的内容字段,这是由 Azure blob 索引器创建的标准字段。The skill (entity recognition) operates on the content field of each document, which is a standard field created by the Azure blob indexer.

  • 该技能包含一个名为 "organizations" 的输出。The skill has one output called "organizations". 输出只会在处理期间存在。Outputs exist only during processing. 若要将此输出链接到下游技能的输入,请以 "/document/organizations" 的形式引用输出。To chain this output to a downstream skill's input, reference the output as "/document/organizations".

  • 对于特定的文档,"/document/organizations" 的值是从文本提取的组织数组。For a particular document, the value of "/document/organizations" is an array of organizations extracted from the text. 例如:For example:

    ["Microsoft", "LinkedIn"]
    

在某些情况下,需要单独引用数组的每个元素。Some situations call for referencing each element of an array separately. 例如,假设我们要将 "/document/organizations" 的每个元素单独传递给另一个技能(例如自定义的必应实体搜索扩充器)。For example, suppose you want to pass each element of "/document/organizations" separately to another skill (such as the custom Bing entity search enricher). 可以通过在路径中添加星号,来引用该数组的每个元素:"/document/organizations/*"You can refer to each element of the array by adding an asterisk to the path: "/document/organizations/*"

第二个情绪提取技能遵循与第一个扩充器相同的模式。The second skill for sentiment extraction follows the same pattern as the first enricher. 它采用 "/document/content" 作为输入,并返回每个内容实例的情绪评分。It takes "/document/content" as input, and returns a sentiment score for each content instance. 由于未显式设置 "context" 字段,输出 (mySentiment) 现在是 "/document" 的子级。Since you did not set the "context" field explicitly, the output (mySentiment) is now a child of "/document".

    {
      "@odata.type": "#Microsoft.Skills.Text.SentimentSkill",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "score",
          "targetName": "mySentiment"
        }
      ]
    },

添加自定义技能Add a custom skill

回顾自定义必应实体搜索扩充器的结构:Recall the structure of the custom Bing entity search enricher:

    {
      "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
     "description": "This skill calls an Azure function, which in turn calls Bing Entity Search",
      "uri": "https://indexer-e2e-webskill.azurewebsites.net/api/InvokeTextAnalyticsV3?code=foo",
      "httpHeaders": {
          "Ocp-Apim-Subscription-Key": "foobar"
      },
      "context": "/document/organizations/*",
      "inputs": [
        {
          "name": "query",
          "source": "/document/organizations/*"
        }
      ],
      "outputs": [
        {
          "name": "description",
          "targetName": "companyDescription"
        }
      ]
    }

此定义是在扩充过程中调用某个 Web API 的自定义技能This definition is a custom skill that calls a web API as part of the enrichment process. 对于由实体识别识别的每个组织,此技能将调用一个 web API 来查找该组织的描述。For each organization identified by entity recognition, this skill calls a web API to find the description of that organization. 扩充引擎会在内部协调处理何时调用 Web API,以及如何流式传输收到的信息。The orchestration of when to call the web API and how to flow the information received is handled internally by the enrichment engine. 但是,必须在 JSON 中提供调用此自定义 API 所需的初始化(例如所需的 uri、httpHeaders 和 inputs)。However, the initialization necessary for calling this custom API must be provided in the JSON (such as uri, httpHeaders, and the inputs expected). 有关为扩充管道创建自定义 Web API 的指导,请参阅如何定义自定义接口For guidance in creating a custom web API for the enrichment pipeline, see How to define a custom interface.

请注意,“上下文”字段设置为包含星号的 "/document/organizations/*",这意味着,将对 "/document/organizations" 下的每个组织调用扩充步骤。Notice that the "context" field is set to "/document/organizations/*" with an asterisk, meaning the enrichment step is called for each organization under "/document/organizations".

将为识别到的每个组织生成输出(在本例中为公司说明)。Output, in this case a company description, is generated for each organization identified. 引用下游步骤中的说明时(例如,在关键短语提取中),应该使用路径 "/document/organizations/*/description" 执行此操作。When referring to the description in a downstream step (for example, in key phrase extraction), you would use the path "/document/organizations/*/description" to do so.

添加结构Add structure

技能集基于非结构化数据生成结构化信息。The skillset generates structured information out of unstructured data. 请考虑以下示例:Consider the following example:

"在第四季度,Microsoft 记录了来自 LinkedIn 的收入 $1100000000,这是其去年购买的社交网络公司。收购使 Microsoft 能够将 LinkedIn 功能与 CRM 和 Office 功能组合在一起。到目前为止,股东很高兴。 ""In its fourth quarter, Microsoft logged $1.1 billion in revenue from LinkedIn, the social networking company it bought last year. The acquisition enables Microsoft to combine LinkedIn capabilities with its CRM and Office capabilities. Stockholders are excited with the progress so far."

可能的结果是下图所示的生成结构:A likely outcome would be a generated structure similar to the following illustration:

示例输出结构Sample output structure

到目前为止,此结构已仅限内部的仅限内存,仅在 Azure 认知搜索索引中使用。Until now, this structure has been internal-only, memory-only, and used only in Azure Cognitive Search indexes. 添加知识 store 使你可以保存形状的根据,以供在搜索之外使用。The addition of a knowledge store gives you a way to save shaped enrichments for use outside of search.

添加知识库Add a knowledge store

知识存储是 Azure 认知搜索中用于保存已扩充文档的预览功能。Knowledge store is a preview feature in Azure Cognitive Search for saving your enriched document. 你创建的、由 Azure 存储帐户支持的知识存储是你在其中丰富数据的存储库。A knowledge store that you create, backed by an Azure storage account, is the repository where your enriched data lands.

将知识存储定义添加到技能组合。A knowledge store definition is added to a skillset. 有关整个过程的演练,请参阅在 REST 中创建知识库For a walkthrough of the entire process, see Create a knowledge store in REST.

"knowledgeStore": {
  "storageConnectionString": "<an Azure storage connection string>",
  "projections" : [
    {
      "tables": [ ]
    },
    {
      "objects": [
        {
          "storageContainer": "containername",
          "source": "/document/EnrichedShape/",
          "key": "/document/Id"
        }
      ]
    }
  ]
}

您可以选择将已扩充的文档保存为具有分层关系或 blob 存储中的 JSON 文档的表。You can choose to save the enriched documents as tables with hierarchical relationships preserved or as JSON documents in blob storage. 技能组合中任何技能的输出都可以作为投影的输入来源。Output from any of the skills in the skillset can be sourced as the input for the projection. 如果希望将数据投影到特定的形状,则更新后的整形程序技能现在可以为你使用的复杂类型建模。If you are looking to project the data into a specific shape, the updated shaper skill can now model complex types for you to use.

后续步骤Next steps

熟悉扩充管道和技能集后,请继续阅读如何在技能集中引用注释如何将输出映射到索引中的字段Now that you are familiar with the enrichment pipeline and skillsets, continue with How to reference annotations in a skillset or How to map outputs to fields in an index.