How to create a skillset in an enrichment pipeline

Cognitive search extracts and enriches data to make it searchable in Azure Search. We call extraction and enrichment steps cognitive skills, combined into a skillset referenced during indexing. A skillset can use predefined skills or custom skills (see Example: create a custom skill for more information).

In this article, you learn how to create an enrichment pipeline for the skills you want to use. A skillset is attached to an Azure Search indexer. One part of pipeline design, covered in this article, is constructing the skillset itself.

Note

Another part of pipeline design is specifying an indexer, covered in the next step. An indexer definition includes a reference to the skillset, plus field mappings used for connecting inputs to outputs in the target index.

Key points to remember:

  • You can only have one skillset per indexer.
  • A skillset must have at least one skill.
  • You can create multiple skills of the same type (for example, variants of an image analysis skill).

Begin with the end in mind

A recommended initial step is deciding which data to extract from your raw data and how you want to use that data in a search solution. Creating an illustration of the entire enrichment pipeline can help you identify the necessary steps.

Suppose you are interested in processing a set of financial analyst comments. For each file, you want to extract company names and the general sentiment of the comments. You might also want to write a custom enricher that uses the Bing Entity Search service to find additional information about the company, such as what kind of business the company is engaged in. Essentially, you want to extract information like the following, indexed for each document:

record-text companies sentiment company descriptions
sample-record ["Microsoft", "LinkedIn"] 0.99 ["Microsoft Corporation is an American multinational technology company ..." , "LinkedIn is a business- and employment-oriented social networking..."]

The following diagram illustrates a hypothetical enrichment pipeline:

A hypothetical enrichment pipeline

Once you have fair idea of what you want in the pipeline, you can express the skillset that provides these steps. Functionally, the skillset is expressed when you upload your indexer definition to Azure Search. To learn more about how to upload your indexer, see the indexer-documentation.

In the diagram, the document cracking step happens automatically. Essentially, Azure Search knows how to open well-known files and creates a content field containing the text extracted from each document. The white boxes are built-in enrichers, and the dotted "Bing Entity Search" box represents a custom enricher that you are creating. As illustrated, the skillset contains three skills.

Skillset definition in REST

A skillset is defined as an array of skills. Each skill defines the source of its inputs and the name of the outputs produced. Using the Create Skillset REST API, you can define a skillset that corresponds to the previous diagram:

PUT https://[servicename].search.windows.net/skillsets/[skillset name]?api-version=2017-11-11-Preview
api-key: [admin key]
Content-Type: application/json
{
  "description": 
  "Extract sentiment from financial records, extract company names, and then find additional information about each company mentioned.",
  "skills":
  [
    {
      "@odata.type": "#Microsoft.Skills.Text.NamedEntityRecognitionSkill",
      "context": "/document",
      "categories": [ "Organization" ],
      "defaultLanguageCode": "en",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "organizations",
          "targetName": "organizations"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.SentimentSkill",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "score",
          "targetName": "mySentiment"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
     "description": "Calls an Azure function, which in turn calls Bing Entity Search",
      "uri": "https://indexer-e2e-webskill.azurewebsites.net/api/InvokeTextAnalyticsV3?code=foo",
      "httpHeaders": {
          "Ocp-Apim-Subscription-Key": "foobar"
      },
      "context": "/document/organizations/*",
      "inputs": [
        {
          "name": "query",
          "source": "/document/organizations/*"
        }
      ],
      "outputs": [
        {
          "name": "description",
          "targetName": "companyDescription"
        }
      ]
    }
  ]
}

Create a skillset

While creating a skillset, you can provide a description to make the skillset self-documenting. A description is optional, but useful for keeping track of what a skillset does. Because skillset is a JSON document, which does not allow comments, you must use a description element for this.

{
  "description": 
  "This is our first skill set, it extracts sentiment from financial records, extract company names, and then finds additional information about each company mentioned.",
  ...
}

The next piece in the skillset is an array of skills. You can think of each skill as a primitive of enrichment. Each skill performs a small task in this enrichment pipeline. Each one takes an input (or a set of inputs), and returns some outputs. The next few sections focus on how to specify predefined and custom skills, chaining skills together through input and output references. Inputs can come from source data or from another skill. Outputs can be mapped to a field in a search index or used as an input to a downstream skill.

Add predefined skills

Let's look at the first skill, which is the predefined named entity recognition skill:

    {
      "@odata.type": "#Microsoft.Skills.Text.NamedEntityRecognitionSkill",
      "context": "/document",
      "categories": [ "Organization" ],
      "defaultLanguageCode": "en",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],      "outputs": [
        {
          "name": "organizations",
          "targetName": "organizations"
        }
      ]
    }
  • Every predefined skill has odata.type, input, and output properties. Skill-specific properties provide additional information applicable to that skill. For entity recognition, categories is one entity among a fixed set of entity types that the pretrained model can recognize.

  • Each skill should have a "context". The context represents the level at which operations take place. In the skill above, the context is the whole document, meaning that the named entity recognition skill is called once per document. Outputs are also produced at that level. More specifically, "organizations" are generated as a member of "/document". In downstream skills, you can refer to this newly created information as "/document/organizations". If the "context" field is not explicitly set, the default context is the document.

  • The skill has one input called "text", with a source input set to "/document/content". The skill (named entity recognition) operates on the content field of each document, which is a standard field created by the Azure blob indexer.

  • The skill has one output called "organizations". Outputs exist only during processing. To chain this output to a downstream skill's input, reference the output as "/document/organizations".

  • For a particular document, the value of "/document/organizations" is an array of organizations extracted from the text. For example:

    ["Microsoft", "LinkedIn"]
    

Some situations call for referencing each element of an array separately. For example, suppose you want to pass each element of "/document/organizations" separately to another skill (such as the custom Bing entity search enricher). You can refer to each element of the array by adding an asterisk to the path: "/document/organizations/*"

The second skill for sentiment extraction follows the same pattern as the first enricher. It takes "/document/content" as input, and returns a sentiment score for each content instance. Since you did not set the "context" field explicitly, the output (mySentiment) is now a child of "/document".

    {
      "@odata.type": "#Microsoft.Skills.Text.SentimentSkill",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "score",
          "targetName": "mySentiment"
        }
      ]
    },

Add a custom skill

Recall the structure of the custom Bing entity search enricher:

    {
      "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
     "description": "This skill calls an Azure function, which in turn calls Bing Entity Search",
      "uri": "https://indexer-e2e-webskill.azurewebsites.net/api/InvokeTextAnalyticsV3?code=foo",
      "httpHeaders": {
          "Ocp-Apim-Subscription-Key": "foobar"
      }
      "context": "/document/organizations/*",
      "inputs": [
        {
          "name": "query",
          "source": "/document/organizations/*"
        }
      ],
      "outputs": [
        {
          "name": "description",
          "targetName": "companyDescription"
        }
      ]
    }

This definition is a custom skill that calls a web API as part of the enrichment process. For each organization identified by named entity recognition, this skill calls a web API to find the description of that organization. The orchestration of when to call the web API and how to flow the information received is handled internally by the enrichment engine. However, the initialization necessary for calling this custom API must be provided in the JSON (such as uri, httpHeaders, and the inputs expected). For guidance in creating a custom web API for the enrichment pipeline, see How to define a custom interface.

Notice that the "context" field is set to "/document/organizations/*" with an asterisk, meaning the enrichment step is called for each organization under "/document/organizations".

Output, in this case a company description, is generated for each organization identified. When referring to the description in a downstream step (for example, in key phrase extraction), you would use the path "/document/organizations/*/description" to do so.

Enrichments create structure out of unstructured information

The skillset generates structured information out of unstructured data. Consider the following example:

"In its fourth quarter, Microsoft logged $1.1 billion in revenue from LinkedIn, the social networking company it bought last year. The acquisition enables Microsoft to combine LinkedIn capabilities with its CRM and Office capabilities. Stockholders are excited with the progress so far."

A likely outcome would be a generated structure similar to the following illustration:

Sample output structure

Recall that this structure is internal. You cannot actually retrieve this graph in code.

Next steps

Now that you are familiar with the enrichment pipeline and skillsets, continue with How to reference annotations in a skillset or How to map outputs to fields in an index.