Document Extraction cognitive skill

Important

This skill is currently in public preview. Preview functionality is provided without a service level agreement, and is not recommended for production workloads. For more information, see Supplemental Terms of Use for Microsoft Azure Previews. The REST API version 2019-05-06-Preview provides preview features. There is currently no portal or .NET SDK support.

The Document Extraction skill extracts content from a file within the enrichment pipeline. This allows you to take advantage of the document extraction step that normally happens before the skillset execution with files that may be generated by other skills.

Note

As you expand scope by increasing the frequency of processing, adding more documents, or adding more AI algorithms, you will need to attach a billable Cognitive Services resource. Charges accrue when calling APIs in Cognitive Services, and for image extraction as part of the document-cracking stage in indexing. There are no charges for text extraction from documents.

Execution of built-in skills is charged at the existing Cognitive Services pay-as-you go price. Image extraction pricing is described on the pricing page.

@odata.type

Microsoft.Skills.Util.DocumentExtractionSkill

Skill parameters

Parameters are case-sensitive.

Inputs Allowed Values Description
parsingMode default
text
json
Set to default for document extraction from files that are not pure text or json. Set to text to improve performance on plain text files. Set to json to extract structured content from json files. If parsingMode is not defined explicitly, it will be set to default.
dataToExtract contentAndMetadata
allMetadata
Set to contentAndMetadata to extract all metadata and textual content from each file. Set to allMetadata to extract only the content-type specific metadata (for example, metadata unique to just .png files). If dataToExtract is not defined explicitly, it will be set to contentAndMetadata.
configuration See below. A dictionary of optional parameters that adjust how the document extraction is performed. See the below table for descriptions of supported configuration properties.
Configuration Parameter Allowed Values Description
imageAction none
generateNormalizedImages
generateNormalizedImagePerPage
Set to none to ignore embedded images or image files in the data set. This is the default.
For image analysis using cognitive skills, set to generateNormalizedImages to have the skill create an array of normalized images as part of document cracking. This action requires that parsingMode is set to default and dataToExtract is set to contentAndMetadata. A normalized image refers to additional processing resulting in uniform image output, sized and rotated to promote consistent rendering when you include images in visual search results (for example, same-size photographs in a graph control as seen in the JFK demo). This information is generated for each image when you use this option.
If you set to generateNormalizedImagePerPage, PDF files will be treated differently in that instead of extracting embedded images, each page will be rendered as an image and normalized accordingly. Non-PDF file types will be treated the same as if generateNormalizedImages was set.
normalizedImageMaxWidth Any integer between 50-10000 The maximum width (in pixels) for normalized images generated. The default is 2000.
normalizedImageMaxHeight Any integer between 50-10000 The maximum height (in pixels) for normalized images generated. The default is 2000.

Note

The default of 2000 pixels for the normalized images maximum width and height is based on the maximum sizes supported by the OCR skill and the image analysis skill. The OCR skill supports a maximum width and height of 4200 for non-English languages, and 10000 for English. If you increase the maximum limits, processing could fail on larger images depending on your skillset definition and the language of the documents.

Skill inputs

Input name Description
file_data The file that content should be extracted from.

The "file_data" input must be an object defined as follows:

{
  "$type": "file",
  "data": "BASE64 encoded string of the file"
}

This file reference object can be generated one of 3 ways:

  • Setting the allowSkillsetToReadFileData parameter on your indexer definition to "true". This will create a path /document/file_data that is an object representing the original file data downloaded from your blob data source. This parameter only applies to data in Blob storage.

  • Setting the imageAction parameter on your indexer definition to a value other than none. This creates an array of images that follows the required convention for input to this skill if passed individually (i.e. /document/normalized_images/*).

  • Having a custom skill return a json object defined EXACTLY as above. The $type parameter must be set to exactly file and the data parameter must be the base 64 encoded byte array data of the file content.

Skill outputs

Output name Description
content The textual content of the document.
normalized_images When the imageAction is set to a value other then none, the new normalized_images field will contain an array of images. See the documentation for image extraction for more details on the output format of each image.

Sample definition

 {
    "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
    "parsingMode": "default",
    "dataToExtract": "contentAndMetadata",
    "configuration": {
        "imageAction": "generateNormalizedImages",
        "normalizedImageMaxWidth": 2000,
        "normalizedImageMaxHeight": 2000
    },
    "context": "/document",
    "inputs": [
      {
        "name": "file_data",
        "source": "/document/file_data"
      }
    ],
    "outputs": [
      {
        "name": "content",
        "targetName": "content"
      },
      {
        "name": "normalized_images",
        "targetName": "normalized_images"
      }
    ]
  }

Sample input

{
  "values": [
    {
      "recordId": "1",
      "data":
      {
        "file_data": {
          "$type": "file",
          "data": "aGVsbG8="
        }
      }
    }
  ]
}

Sample output

{
  "values": [
    {
      "recordId": "1",
      "data": {
        "content": "hello",
        "normalized_images": []
      }
    }
  ]
}

See also