Entity Recognition cognitive skill (v2)

The Entity Recognition skill (v2) extracts entities of different types from text. This skill uses the machine learning models provided by Text Analytics in Azure AI services.

Important

The Entity Recognition skill (v2) (Microsoft.Skills.Text.EntityRecognitionSkill) is now discontinued replaced by Microsoft.Skills.Text.V3.EntityRecognitionSkill. Follow the recommendations in Deprecated skills to migrate to a supported skill.

Note

As you expand scope by increasing the frequency of processing, adding more documents, or adding more AI algorithms, you will need to attach a billable Azure AI services resource. Charges accrue when calling APIs in Azure AI services, and for image extraction as part of the document-cracking stage in Azure AI Search. There are no charges for text extraction from documents.

Execution of built-in skills is charged at the existing Azure AI services pay-as-you go price. Image extraction pricing is described on the Azure AI Search pricing page.

@odata.type

Microsoft.Skills.Text.EntityRecognitionSkill

Data limits

The maximum size of a record should be 50,000 characters as measured by String.Length. If you need to break up your data before sending it to the key phrase extractor, consider using the Text Split skill. If you do use a text split skill, set the page length to 5000 for the best performance.

Skill parameters

Parameters are case-sensitive and are all optional.

Parameter name Description
categories Array of categories that should be extracted. Possible category types: "Person", "Location", "Organization", "Quantity", "Datetime", "URL", "Email". If no category is provided, all types are returned.
defaultLanguageCode Language code of the input text. The following languages are supported: ar, cs, da, de, en, es, fi, fr, hu, it, ja, ko, nl, no, pl, pt-BR, pt-PT, ru, sv, tr, zh-hans. Not all entity categories are supported for all languages; see note below.
minimumPrecision A value between 0 and 1. If the confidence score (in the namedEntities output) is lower than this value, the entity is not returned. The default is 0.
includeTypelessEntities Set to true if you want to recognize well-known entities that don't fit the current categories. Recognized entities are returned in the entities complex output field. For example, "Windows 10" is a well-known entity (a product), but since "Products" is not a supported category, this entity would be included in the entities output field. Default is false

Skill inputs

Input name Description
languageCode Optional. Default is "en".
text The text to analyze.

Skill outputs

Note

Not all entity categories are supported for all languages. The "Person", "Location", and "Organization" entity category types are supported for the full list of languages above. Only de, en, es, fr, and zh-hans support extraction of "Quantity", "Datetime", "URL", and "Email" types. For more information, see Language and region support for the Text Analytics API.

Output name Description
persons An array of strings where each string represents the name of a person.
locations An array of strings where each string represents a location.
organizations An array of strings where each string represents an organization.
quantities An array of strings where each string represents a quantity.
dateTimes An array of strings where each string represents a DateTime (as it appears in the text) value.
urls An array of strings where each string represents a URL
emails An array of strings where each string represents an email
namedEntities An array of complex types that contains the following fields:
  • category
  • value (The actual entity name)
  • offset (The location where it was found in the text)
  • confidence (Higher value means it's more to be a real entity)
entities An array of complex types that contains rich information about the entities extracted from text, with the following fields
  • name (the actual entity name. This represents a "normalized" form)
  • wikipediaId
  • wikipediaLanguage
  • wikipediaUrl (a link to Wikipedia page for the entity)
  • bingId
  • type (the category of the entity recognized)
  • subType (available only for certain categories, this gives a more granular view of the entity type)
  • matches (a complex collection that contains)
    • text (the raw text for the entity)
    • offset (the location where it was found)
    • length (the length of the raw entity text)

Sample definition

  {
    "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
    "categories": [ "Person", "Email"],
    "defaultLanguageCode": "en",
    "includeTypelessEntities": true,
    "minimumPrecision": 0.5,
    "inputs": [
      {
        "name": "text",
        "source": "/document/content"
      }
    ],
    "outputs": [
      {
        "name": "persons",
        "targetName": "people"
      },
      {
        "name": "emails",
        "targetName": "contact"
      },
      {
        "name": "entities"
      }
    ]
  }

Sample input

{
    "values": [
      {
        "recordId": "1",
        "data":
           {
             "text": "Contoso corporation was founded by John Smith. They can be reached at contact@contoso.com",
             "languageCode": "en"
           }
      }
    ]
}

Sample output

{
  "values": [
    {
      "recordId": "1",
      "data" : 
      {
        "persons": [ "John Smith"],
        "emails":["contact@contoso.com"],
        "namedEntities": 
        [
          {
            "category":"Person",
            "value": "John Smith",
            "offset": 35,
            "confidence": 0.98
          }
        ],
        "entities":  
        [
          {
            "name":"John Smith",
            "wikipediaId": null,
            "wikipediaLanguage": null,
            "wikipediaUrl": null,
            "bingId": null,
            "type": "Person",
            "subType": null,
            "matches": [{
                "text": "John Smith",
                "offset": 35,
                "length": 10
            }]
          },
          {
            "name": "contact@contoso.com",
            "wikipediaId": null,
            "wikipediaLanguage": null,
            "wikipediaUrl": null,
            "bingId": null,
            "type": "Email",
            "subType": null,
            "matches": [
            {
                "text": "contact@contoso.com",
                "offset": 70,
                "length": 19
            }]
          },
          {
            "name": "Contoso",
            "wikipediaId": "Contoso",
            "wikipediaLanguage": "en",
            "wikipediaUrl": "https://en.wikipedia.org/wiki/Contoso",
            "bingId": "349f014e-7a37-e619-0374-787ebb288113",
            "type": null,
            "subType": null,
            "matches": [
            {
                "text": "Contoso",
                "offset": 0,
                "length": 7
            }]
          }
        ]
      }
    }
  ]
}

Note that the offsets returned for entities in the output of this skill are directly returned from the Text Analytics API, which means if you are using them to index into the original string, you should use the StringInfo class in .NET in order to extract the correct content. More details can be found here.

Warning cases

If the language code for the document is unsupported, a warning is returned and no entities are extracted.

See also