Tutorial: Extract text and structure from JSON blobs in Azure using REST APIs (Azure Cognitive Search)

If you have unstructured text or image content in Azure Blob storage, an AI enrichment pipeline can help you extract information and create new content that is useful for full-text search or knowledge mining scenarios. Although a pipeline can process image files (JPG, PNG, TIFF), this tutorial focuses on word-based content, applying language detection and text analytics to create new fields and information that you can leverage in queries, facets, and filters.

  • Start with whole documents (unstructured text) such as PDF, MD, DOCX, and PPTX in Azure Blob storage.
  • Define a pipeline that extracts text, detects language, recognizes entities, and detects key phrases.
  • Define an index to store the output (raw content, plus pipeline-generated name-value pairs).
  • Execute the pipeline to start transformations and analysis, and to create and load the index.
  • Explore results using full text search and a rich query syntax.

You'll need several services to complete this walkthrough, plus the Postman desktop app or another Web testing tool to make REST API calls.

If you don't have an Azure subscription, open a free account before you begin.

Download files

  1. Open this OneDrive folder and on the top-left corner, click Download to copy the files to your computer.

  2. Right-click the zip file and select Extract All. There are 14 files of various types. You'll use 7 for this exercise.

1 - Create services

This walkthrough uses Azure Cognitive Search for indexing and queries, Cognitive Services for AI enrichment, and Azure Blob storage to provide the data. If possible, create all three services in the same region and resource group for proximity and manageability. In practice, your Azure Storage account can be in any region.

Start with Azure Storage

  1. Sign in to the Azure portal and click + Create Resource.

  2. Search for storage account and select Microsoft's Storage Account offering.

    Create Storage account

  3. In the Basics tab, the following items are required. Accept the defaults for everything else.

    • Resource group. Select an existing one or create a new one, but use the same group for all services so that you can manage them collectively.

    • Storage account name. If you think you might have multiple resources of the same type, use the name to disambiguate by type and region, for example blobstoragewestus.

    • Location. If possible, choose the same location used for Azure Cognitive Search and Cognitive Services. A single location voids bandwidth charges.

    • Account Kind. Choose the default, StorageV2 (general purpose v2).

  4. Click Review + Create to create the service.

  5. Once it's created, click Go to the resource to open the Overview page.

  6. Click Blobs service.

  7. Click + Container to create a container and name it cog-search-demo.

  8. Select cog-search-demo and then click Upload to open the folder where you saved the download files. Select all of the non-image files. You should have 7 files. Click OK to upload.

    Upload sample files

  9. Before you leave Azure Storage, get a connection string so that you can formulate a connection in Azure Cognitive Search.

    1. Browse back to the Overview page of your storage account (we used blobstragewestus as an example).

    2. In the left navigation pane, select Access keys and copy one of the connection strings.

    The connection string is a URL similar to the following example:

    DefaultEndpointsProtocol=https;AccountName=cogsrchdemostorage;AccountKey=<your account key>;EndpointSuffix=core.windows.net
    
  10. Save the connection string to Notepad. You'll need it later when setting up the data source connection.

Cognitive Services

AI enrichment is backed by Cognitive Services, including Text Analytics and Computer Vision for natural language and image processing. If your objective was to complete an actual prototype or project, you would at this point provision Cognitive Services (in the same region as Azure Cognitive Search) so that you can attach it to indexing operations.

For this exercise, however, you can skip resource provisioning because Azure Cognitive Search can connect to Cognitive Services behind the scenes and give you 20 free transactions per indexer run. Since this tutorial uses 7 transactions, the free allocation is sufficient. For larger projects, plan on provisioning Cognitive Services at the pay-as-you-go S0 tier. For more information, see Attach Cognitive Services.

The third component is Azure Cognitive Search, which you can create in the portal. You can use the Free tier to complete this walkthrough.

As with Azure Blob storage, take a moment to collect the access key. Further on, when you begin structuring requests, you will need to provide the endpoint and admin api-key used to authenticate each request.

  1. Sign in to the Azure portal, and in your search service Overview page, get the name of your search service. You can confirm your service name by reviewing the endpoint URL. If your endpoint URL were https://mydemo.search.windows.net, your service name would be mydemo.

  2. In Settings > Keys, get an admin key for full rights on the service. There are two interchangeable admin keys, provided for business continuity in case you need to roll one over. You can use either the primary or secondary key on requests for adding, modifying, and deleting objects.

    Get the query key as well. It's a best practice to issue query requests with read-only access.

Get the service name and admin and query keys

All requests require an api-key in the header of every request sent to your service. A valid key establishes trust, on a per request basis, between the application sending the request and the service that handles it.

2 - Set up Postman

Start Postman and set up an HTTP request. If you are unfamiliar with this tool, see Explore Azure Cognitive Search REST APIs using Postman.

The request methods used in this tutorial are POST, PUT, and GET. You'll use the methods to make four API calls to your search service: create a data source, a skillset, an index, and an indexer.

In Headers, set "Content-type" to application/json and set api-key to the admin api-key of your Azure Cognitive Search service. Once you set the headers, you can use them for every request in this exercise.

Postman request URL and header

3 - Create the pipeline

In Azure Cognitive Search, AI processing occurs during indexing (or data ingestion). This part of the walkthrough creates four objects: data source, index definition, skillset, indexer.

Step 1: Create a data source

A data source object provides the connection string to the Blob container containing the files.

  1. Use POST and the following URL, replacing YOUR-SERVICE-NAME with the actual name of your service.

    https://[YOUR-SERVICE-NAME].search.windows.net/datasources?api-version=2019-05-06
    
  2. In request Body, copy the following JSON definition, replacing the connectionString with the actual connection of your storage account.

    Remember to edit the container name as well. We suggested "cog-search-demo" for the container name in an earlier step.

    {
      "name" : "cog-search-demo-ds",
      "description" : "Demo files to demonstrate cognitive search capabilities.",
      "type" : "azureblob",
      "credentials" :
      { "connectionString" :
        "DefaultEndpointsProtocol=https;AccountName=<YOUR-STORAGE-ACCOUNT>;AccountKey=<YOUR-ACCOUNT-KEY>;"
      },
      "container" : { "name" : "<YOUR-BLOB-CONTAINER-NAME>" }
    }
    
  3. Send the request. You should see a status code of 201 confirming success.

If you got a 403 or 404 error, check the request construction: api-version=2019-05-06 should be on the endpoint, api-key should be in the Header after Content-Type, and its value must be valid for a search service. You might want to run the JSON document through an online JSON validator to make sure the syntax is correct.

Step 2: Create a skillset

A skillset object is a set of enrichment steps applied to your content.

  1. Use PUT and the following URL, replacing YOUR-SERVICE-NAME with the actual name of your service.

    https://[YOUR-SERVICE-NAME].search.windows.net/skillsets/cog-search-demo-ss?api-version=2019-05-06
    
  2. In request Body, copy the JSON definition below. This skillset consists of the following built-in skills.

    Skill Description
    Entity Recognition Extracts the names of people, organizations, and locations from content in the blob container.
    Language Detection Detects the content's language.
    Text Split Breaks large content into smaller chunks before calling the key phrase extraction skill. Key phrase extraction accepts inputs of 50,000 characters or less. A few of the sample files need splitting up to fit within this limit.
    Key Phrase Extraction Pulls out the top key phrases.

    Each skill executes on the content of the document. During processing, Azure Cognitive Search cracks each document to read content from different file formats. Found text originating in the source file is placed into a generated content field, one for each document. As such, the input becomes "/document/content".

    For key phrase extraction, because we use the text splitter skill to break larger files into pages, the context for the key phrase extraction skill is "document/pages/*" (for each page in the document) instead of "/document/content".

    {
      "description": "Extract entities, detect language and extract key-phrases",
      "skills":
      [
        {
          "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
          "categories": [ "Person", "Organization", "Location" ],
          "defaultLanguageCode": "en",
          "inputs": [
            { "name": "text", "source": "/document/content" }
          ],
          "outputs": [
            { "name": "persons", "targetName": "persons" },
            { "name": "organizations", "targetName": "organizations" },
            { "name": "locations", "targetName": "locations" }
          ]
        },
        {
          "@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
          "inputs": [
            { "name": "text", "source": "/document/content" }
          ],
          "outputs": [
            { "name": "languageCode", "targetName": "languageCode" }
          ]
        },
        {
          "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
          "textSplitMode" : "pages",
          "maximumPageLength": 4000,
          "inputs": [
            { "name": "text", "source": "/document/content" },
            { "name": "languageCode", "source": "/document/languageCode" }
          ],
          "outputs": [
            { "name": "textItems", "targetName": "pages" }
          ]
        },
        {
          "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
          "context": "/document/pages/*",
          "inputs": [
            { "name": "text", "source": "/document/pages/*" },
            { "name":"languageCode", "source": "/document/languageCode" }
          ],
          "outputs": [
            { "name": "keyPhrases", "targetName": "keyPhrases" }
          ]
        }
      ]
    }
    

    A graphical representation of the skillset is shown below.

    Understand a skillset

  3. Send the request. Postman should return a status code of 201 confirming success.

Note

Outputs can be mapped to an index, used as input to a downstream skill, or both as is the case with language code. In the index, a language code is useful for filtering. As an input, language code is used by text analysis skills to inform the linguistic rules around word breaking. For more information about skillset fundamentals, see How to define a skillset.

Step 3: Create an index

An index provides the schema used to create the physical expression of your content in inverted indexes and other constructs in Azure Cognitive Search. The largest component of an index is the fields collection, where data type and attributes determine contents and behaviors in Azure Cognitive Search.

  1. Use PUT and the following URL, replacing YOUR-SERVICE-NAME with the actual name of your service, to name your index.

    https://[YOUR-SERVICE-NAME].search.windows.net/indexes/cog-search-demo-idx?api-version=2019-05-06
    
  2. In request Body, copy the following JSON definition. The content field stores the document itself. Additional fields for languageCode, keyPhrases, and organizations represent new information (fields and values) created by the skillset.

    {
      "fields": [
        {
          "name": "id",
          "type": "Edm.String",
          "key": true,
          "searchable": true,
          "filterable": false,
          "facetable": false,
          "sortable": true
        },
        {
          "name": "metadata_storage_name",
          "type": "Edm.String",
          "searchable": false,
          "filterable": false,
          "facetable": false,
          "sortable": false
        },
        {
          "name": "content",
          "type": "Edm.String",
          "sortable": false,
          "searchable": true,
          "filterable": false,
          "facetable": false
        },
        {
          "name": "languageCode",
          "type": "Edm.String",
          "searchable": true,
          "filterable": false,
          "facetable": false
        },
        {
          "name": "keyPhrases",
          "type": "Collection(Edm.String)",
          "searchable": true,
          "filterable": false,
          "facetable": false
        },
        {
          "name": "persons",
          "type": "Collection(Edm.String)",
          "searchable": true,
          "sortable": false,
          "filterable": true,
          "facetable": true
        },
        {
          "name": "organizations",
          "type": "Collection(Edm.String)",
          "searchable": true,
          "sortable": false,
          "filterable": true,
          "facetable": true
        },
        {
          "name": "locations",
          "type": "Collection(Edm.String)",
          "searchable": true,
          "sortable": false,
          "filterable": true,
          "facetable": true
        }
      ]
    }
    
  3. Send the request. Postman should return a status code of 201 confirming success.

Step 4: Create and run an indexer

An Indexer drives the pipeline. The three components you have created thus far (data source, skillset, index) are inputs to an indexer. Creating the indexer on Azure Cognitive Search is the event that puts the entire pipeline into motion.

  1. Use PUT and the following URL, replacing YOUR-SERVICE-NAME with the actual name of your service, to name your indexer.

    https://[servicename].search.windows.net/indexers/cog-search-demo-idxr?api-version=2019-05-06
    
  2. In request Body, copy the JSON definition below. Notice the field mapping elements; these mappings are important because they define the data flow.

    The fieldMappings are processed before the skillset, sending content from the data source to target fields in an index. You'll use field mappings to send existing, unmodified content to the index. If field names and types are the same at both ends, no mapping is required.

    The outputFieldMappings are for fields created by skills, and thus processed after the skillset has run. The references to sourceFieldNames in outputFieldMappings don't exist until document cracking or enrichment creates them. The targetFieldName is a field in an index, defined in the index schema.

    {
      "name":"cog-search-demo-idxr",	
      "dataSourceName" : "cog-search-demo-ds",
      "targetIndexName" : "cog-search-demo-idx",
      "skillsetName" : "cog-search-demo-ss",
      "fieldMappings" : [
        {
          "sourceFieldName" : "metadata_storage_path",
          "targetFieldName" : "id",
          "mappingFunction" :
            { "name" : "base64Encode" }
        },
        {
          "sourceFieldName" : "metadata_storage_name",
          "targetFieldName" : "metadata_storage_name",
          "mappingFunction" :
            { "name" : "base64Encode" }
        },
        {
          "sourceFieldName" : "content",
          "targetFieldName" : "content"
        }
      ],
      "outputFieldMappings" :
      [
        {
          "sourceFieldName" : "/document/persons",
          "targetFieldName" : "persons"
        },
        {
          "sourceFieldName" : "/document/organizations",
          "targetFieldName" : "organizations"
        },
        {
          "sourceFieldName" : "/document/locations",
          "targetFieldName" : "locations"
        },
        {
          "sourceFieldName" : "/document/pages/*/keyPhrases/*",
          "targetFieldName" : "keyPhrases"
        },
        {
          "sourceFieldName": "/document/languageCode",
          "targetFieldName": "languageCode"
        }
      ],
      "parameters":
      {
        "maxFailedItems":-1,
        "maxFailedItemsPerBatch":-1,
        "configuration":
        {
          "dataToExtract": "contentAndMetadata",
          "parsingMode": "default",
          "firstLineContainsHeaders": false,
          "delimitedTextDelimiter": ","
        }
      }
    }
    
  3. Send the request. Postman should return a status code of 201 confirming successful processing.

    Expect this step to take several minutes to complete. Even though the data set is small, analytical skills are computation-intensive.

Note

Creating an indexer invokes the pipeline. If there are problems reaching the data, mapping inputs and outputs, or order of operations, they appear at this stage. To re-run the pipeline with code or script changes, you might need to drop objects first. For more information, see Reset and re-run.

About indexer parameters

The script sets "maxFailedItems" to -1, which instructs the indexing engine to ignore errors during data import. This is acceptable because there are so few documents in the demo data source. For a larger data source, you would set the value to greater than 0.

The "dataToExtract":"contentAndMetadata" statement tells the indexer to automatically extract the content from different file formats as well as metadata related to each file.

When content is extracted, you can set imageAction to extract text from images found in the data source. The "imageAction":"generateNormalizedImages" configuration, combined with the OCR Skill and Text Merge Skill, tells the indexer to extract text from the images (for example, the word "stop" from a traffic Stop sign), and embed it as part of the content field. This behavior applies to both the images embedded in the documents (think of an image inside a PDF), as well as images found in the data source, for instance a JPG file.

4 - Monitor indexing

Indexing and enrichment commence as soon as you submit the Create Indexer request. Depending on which cognitive skills you defined, indexing can take a while. To find out whether the indexer is still running, send the following request to check the indexer status.

  1. Use GET and the following URL, replacing YOUR-SERVICE-NAME with the actual name of your service, to name your indexer.

    https://[YOUR-SERVICE-NAME].search.windows.net/indexers/cog-search-demo-idxr/status?api-version=2019-05-06
    
  2. Review the response to learn whether the indexer is running, or to view error and warning information.

If you are using the Free tier, the following message is expected: `"Could not extract content or metadata from your document. Truncated extracted text to '32768' characters". This message appears because blob indexing on the Free tier has a32K limit on character extraction. You won't see this message for this data set on higher tiers.

Note

Warnings are common in some scenarios and do not always indicate a problem. For example, if a blob container includes image files, and the pipeline doesn't handle images, you'll get a warning stating that images were not processed.

Now that you've created new fields and information, let's run some queries to understand the value of cognitive search as it relates to a typical search scenario.

Recall that we started with blob content, where the entire document is packaged into a single content field. You can search this field and find matches to your queries.

  1. Use GET and the following URL, replacing YOUR-SERVICE-NAME with the actual name of your service, to search for instances of a term or phrase, returning the content field and a count of the matching documents.

    https://[YOUR-SERVICE-NAME].search.windows.net/indexes/cog-search-demo-idx?search=*&$count=true&$select=content?api-version=2019-05-06
    

    The results of this query return document contents, which is the same result you would get if used the blob indexer without the cognitive search pipeline. This field is searchable, but unworkable if you want to use facets, filters, or autocomplete.

    Content field output

  2. For the second query, return some of the new fields created by the pipeline (persons, organizations, locations, languageCode). We're omitting keyPhrases for brevity, but you should include it if you want to see those values.

    https://mydemo.search.windows.net/indexes/cog-search-demo-idx/docs?search=*&$count=true&$select=metadata_storage_name,persons,organizations,locations,languageCode&api-version=2019-05-06
    

    The fields in the $select statement contain new information created from the natural language processing capabilities of Cognitive Services. As you might expect, there is some noise in the results and variation across documents, but in many instances, the analytical models produce accurate results.

    The following image shows results for Satya Nadella's open letter upon assuming the CEO role at Microsoft.

    Pipeline output

  3. To see how you might take advantage of these fields, add a facet parameter to return an aggregation of matching documents by location.

    https://[YOUR-SERVICE-NAME].search.windows.net/indexes/cog-search-demo-idx/docs?search=*&facet=locations&api-version=2019-05-06
    

    In this example, for each location, there are 2 or 3 matches.

    Facet output

  4. In this final example, apply a filter on the organizations collection, returning two matches for filter criteria based on NASDAQ.

    cog-search-demo-idx/docs?search=*&$filter=organizations/any(organizations: organizations eq 'NASDAQ')&$select=metadata_storage_name,organizations&$count=true&api-version=2019-05-06
    

These queries illustrate a few of the ways you can work with query syntax and filters on new fields created by cognitive search.For more query examples, see Examples in Search Documents REST API, Simple syntax query examples, and Full Lucene query examples.

Reset and rerun

In the early experimental stages of pipeline development, the most practical approach for design iterations is to delete the objects from Azure Cognitive Search and allow your code to rebuild them. Resource names are unique. Deleting an object lets you recreate it using the same name.

To reindex your documents with the new definitions:

  1. Delete the indexer, index, and skillset.
  2. Modify objects.
  3. Recreate on your service to run the pipeline.

You can use the portal to delete indexes, indexers, and skillsets, or use DELETE and provide URLs to each object. The following command deletes an indexer.

DELETE https://[YOUR-SERVICE-NAME]].search.windows.net/indexers/cog-search-demo-idxr?api-version=2019-05-06

Status code 204 is returned on successful deletion.

As your code matures, you might want to refine a rebuild strategy. For more information, see How to rebuild an index.

Takeaways

This tutorial demonstrates the basic steps for building an enriched indexing pipeline through the creation of component parts: a data source, skillset, index, and indexer.

Built-in skills were introduced, along with skillset definition and the mechanics of chaining skills together through inputs and outputs. You also learned that outputFieldMappings in the indexer definition is required for routing enriched values from the pipeline into a searchable index on an Azure Cognitive Search service.

Finally, you learned how to test results and reset the system for further iterations. You learned that issuing queries against the index returns the output created by the enriched indexing pipeline.

Clean up resources

The fastest way to clean up after a tutorial is by deleting the resource group containing the Azure Cognitive Search service and Azure Blob service. Assuming you put both services in the same group, delete the resource group now to permanently delete everything in it, including the services and any stored content that you created for this tutorial. In the portal, the resource group name is on the Overview page of each service.

Next steps

Customize or extend the pipeline with custom skills. Creating a custom skill and adding it to a skillset allows you to onboard text or image analysis that you write yourself.