Tutorial: Search semi-structured data in Azure cloud storage

In a two-part tutorial series, you learn how to search semi-structured and unstructured data using Azure search. Part 1 walked you through search over unstructured data, but also included important prerequisites for this tutorial, like creating the storage account.

In Part 2, focus shifts to semi-structured data, such as JSON, stored in Azure blobs. Semi-structured data contains tags or markings which separate content within the data. It splits the difference between unstructured data which must be indexed wholistically, and formally structured data that adheres to a data model, such as a relational database schema, that can be crawled on a per-field basis.

In Part 2, learn how to:

  • Configure an Azure Search data source for an Azure blob container
  • Create and populate an Azure Search index and indexer to crawl the container and extract searchable content
  • Search the index you just created

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites

  • Completion of the previous tutorial providing the storage account and search service created in the previous tutorial.

  • Installation of a REST client and an understanding of how to construct an HTTP request. For the purposes of this tutorial, we are using Postman. Feel free to use a different REST client if you're already comfortable with a particular one.

Note

This tutorial relies on JSON array support, which is currently a preview feature in Azure Search. It is not available in the portal. For this reason, we're using the preview REST API, which provides this feature, and a REST client tool to call the API.

Set up Postman

Start Postman and set up an HTTP request. If you are unfamiliar with this tool, see Explore Azure Search REST APIs using Fiddler or Postman for more information.

The request method for every call in this tutorial is "POST." The header keys are "Content-type" and "api-key." The values of the header keys are "application/json" and your "admin key" (the admin key is a placeholder for your search primary key) respectively. The body is where you place the actual contents of your call. Depending on the client you're using, there may be some variations on how you construct your query, but those are the basics.

Semi-structured search

For the REST calls covered in this tutorial, your search api-key is required. You can find your api-key under Keys inside your search service. This api-key must be in the header of every API call (replace "admin key" in the preceding screenshot with it) this tutorial directs you to make. Retain the key since you need it for each call.

Semi-structured search

Download the sample data

A sample data set has been prepared for you. Download clinical-trials-json.zip and unzip it to its own folder.

Contained in the sample are example JSON files, which were originally text files obtained from clinicaltrials.gov. We have converted them to JSON for your convenience.

Sign in to Azure

Sign in to the Azure portal.

Upload the sample data

In the Azure portal, navigate back to the storage account created in the previous tutorial. Then open the data container, and click Upload.

Click Advanced, enter "clinical-trials-json", and then upload all of the JSON files you downloaded.

Semi-structured search

After the upload completes, the files should appear in their own subfolder inside the data container.

Connect your search service to your container

We are using Postman to make three API calls to your search service in order to create a data source, an index, and an indexer. The data source includes a pointer to your storage account and your JSON data. Your search service makes the connection when loading the data.

The query string must contain api-version=2016-09-01-Preview and each call should return a 201 Created. The generally available api-version does not yet have the capability to handle json as a jsonArray, currently only the preview api-version does.

Execute the following three API calls from your REST client.

Create a datasource

A data source specifies what data to index.

The endpoint of this call is https://[service name].search.windows.net/datasources?api-version=2016-09-01-Preview. Replace [service name] with the name of your search service.

For this call, you need the name of your storage account and your storage account key. The storage account key can be found in the Azure portal inside your storage account's Access Keys. The location is shown in the following image:

Semi-structured search

Make sure to replace the [storage account name] and [storage account key] in the body of your call before executing the call.

{
    "name" : "clinical-trials-json",
    "type" : "azureblob",
    "credentials" : { "connectionString" : "DefaultEndpointsProtocol=https;AccountName=[storage account name];AccountKey=[storage account key];" },
    "container" : { "name" : "data", "query" : "clinical-trials-json" }
}

The response should look like:

{
    "@odata.context": "https://exampleurl.search.windows.net/$metadata#datasources/$entity",
    "@odata.etag": "\"0x8D505FBC3856C9E\"",
    "name": "clinical-trials-json",
    "description": null,
    "type": "azureblob",
    "subtype": null,
    "credentials": {
        "connectionString": "DefaultEndpointsProtocol=https;AccountName=[mystorageaccounthere];AccountKey=[[myaccountkeyhere]]];"
    },
    "container": {
        "name": "data",
        "query": "clinical-trials-json"
    },
    "dataChangeDetectionPolicy": null,
    "dataDeletionDetectionPolicy": null
}

Create an index

The second API call creates an index. An index specifies all the parameters and their attributes.

The URL for this call is https://[service name].search.windows.net/indexes?api-version=2016-09-01-Preview. Replace [service name] with the name of your search service.

First replace the URL. Then copy and paste the following code into your body and run the query.

{
  "name": "clinical-trials-json-index",  
  "fields": [
  {"name": "FileName", "type": "Edm.String", "searchable": false, "retrievable": true, "facetable": false, "filterable": false, "sortable": true},
  {"name": "Description", "type": "Edm.String", "searchable": true, "retrievable": false, "facetable": false, "filterable": false, "sortable": false},
  {"name": "MinimumAge", "type": "Edm.Int32", "searchable": false, "retrievable": true, "facetable": true, "filterable": true, "sortable": true},
  {"name": "Title", "type": "Edm.String", "searchable": true, "retrievable": true, "facetable": false, "filterable": true, "sortable": true},
  {"name": "URL", "type": "Edm.String", "searchable": false, "retrievable": false, "facetable": false, "filterable": false, "sortable": false},
  {"name": "MyURL", "type": "Edm.String", "searchable": false, "retrievable": true, "facetable": false, "filterable": false, "sortable": false},
  {"name": "Gender", "type": "Edm.String", "searchable": false, "retrievable": true, "facetable": true, "filterable": true, "sortable": false},
  {"name": "MaximumAge", "type": "Edm.Int32", "searchable": false, "retrievable": true, "facetable": true, "filterable": true, "sortable": true},
  {"name": "Summary", "type": "Edm.String", "searchable": true, "retrievable": true, "facetable": false, "filterable": false, "sortable": false},
  {"name": "NCTID", "type": "Edm.String", "key": true, "searchable": true, "retrievable": true, "facetable": false, "filterable": true, "sortable": true},
  {"name": "Phase", "type": "Edm.String", "searchable": false, "retrievable": true, "facetable": true, "filterable": true, "sortable": false},
  {"name": "Date", "type": "Edm.String", "searchable": false, "retrievable": true, "facetable": false, "filterable": false, "sortable": true},
  {"name": "OverallStatus", "type": "Edm.String", "searchable": false, "retrievable": true, "facetable": true, "filterable": true, "sortable": false},
  {"name": "OrgStudyId", "type": "Edm.String", "searchable": true, "retrievable": true, "facetable": false, "filterable": true, "sortable": false},
  {"name": "HealthyVolunteers", "type": "Edm.String", "searchable": false, "retrievable": true, "facetable": true, "filterable": true, "sortable": false},
  {"name": "Keywords", "type": "Collection(Edm.String)", "searchable": true, "retrievable": true, "facetable": true, "filterable": false, "sortable": false},
  {"name": "metadata_storage_last_modified", "type":"Edm.DateTimeOffset", "searchable": false, "retrievable": true, "filterable": true, "sortable": false},
  {"name": "metadata_storage_size", "type":"Edm.String", "searchable": false, "retrievable": true, "filterable": true, "sortable": false},
  {"name": "metadata_content_type", "type":"Edm.String", "searchable": true, "retrievable": true, "filterable": true, "sortable": false}
  ],
  "suggesters": [
  {
    "name": "sg",
    "searchMode": "analyzingInfixMatching",
    "sourceFields": ["Title"]
  }
  ]
}

The response should look like:

{
    "@odata.context": "https://exampleurl.search.windows.net/$metadata#indexes/$entity",
    "@odata.etag": "\"0x8D505FC00EDD5FA\"",
    "name": "clinical-trials-json-index",
    "fields": [
        {
            "name": "FileName",
            "type": "Edm.String",
            "searchable": false,
            "filterable": false,
            "retrievable": true,
            "sortable": true,
            "facetable": false,
            "key": false,
            "indexAnalyzer": null,
            "searchAnalyzer": null,
            "analyzer": null,
            "synonymMaps": []
        },
        {
            "name": "Description",
            "type": "Edm.String",
            "searchable": true,
            "filterable": false,
            "retrievable": false,
            "sortable": false,
            "facetable": false,
            "key": false,
            "indexAnalyzer": null,
            "searchAnalyzer": null,
            "analyzer": null,
            "synonymMaps": []
        },
        ...
          "scoringProfiles": [],
    "defaultScoringProfile": null,
    "corsOptions": null,
    "suggesters": [],
    "analyzers": [],
    "tokenizers": [],
    "tokenFilters": [],
    "charFilters": []
}

Create an indexer

An indexer connects the data source to the target search index and optionally provides a schedule to automate the data refresh.

The URL for this call is https://[service name].search.windows.net/indexers?api-version=2016-09-01-Preview. Replace [service name] with the name of your search service.

First replace the URL. Then copy and paste the following code into your body and run the query.

{
  "name" : "clinical-trials-json-indexer",
  "dataSourceName" : "clinical-trials-json",
  "targetIndexName" : "clinical-trials-json-index",
  "parameters" : { "configuration" : { "parsingMode" : "jsonArray" } }
}

The response should look like:

{
    "@odata.context": "https://exampleurl.search.windows.net/$metadata#indexers/$entity",
    "@odata.etag": "\"0x8D505FDE143D164\"",
    "name": "clinical-trials-json-indexer",
    "description": null,
    "dataSourceName": "clinical-trials-json",
    "targetIndexName": "clinical-trials-json-index",
    "schedule": null,
    "parameters": {
        "batchSize": null,
        "maxFailedItems": null,
        "maxFailedItemsPerBatch": null,
        "base64EncodeKeys": null,
        "configuration": {
            "parsingMode": "jsonArray"
        }
    },
    "fieldMappings": [],
    "enrichers": [],
    "disabled": null
}

Search your JSON files

Now that your search service has been connected to your data container, you can begin searching your files.

Open up the Azure portal and navigate back to your search service. Just like you did in the previous tutorial.

Unstructured search

As before, the data can be queried in a number of ways: full text search, system properties, or user-defined metadata. Both system properties and user-defined metadata may only be searched with the $select parameter if they were marked as retrievable during creation of the target index. Parameters in the index may not be altered once they are created. However, additional parameters may be added.

An example of a basic query is $select=Gender,metadata_storage_size, which limits the return to those two parameters.

Semi-structured search

An example of more complex query would be $filter=MinimumAge ge 30 and MaximumAge lt 75, which returns only results where the parameters MinimumAge is greater than or equal to 30 and MaximumAge is less than 75.

Semi-structured search

If you'd like to experiment and try a few more queries yourself, feel free to do so. Know that you can use Logical operators (and, or, not) and comparison operators (eq, ne, gt, lt, ge, le). String comparisons are case-sensitive.

The $filter parameter only works with metadata that were marked filterable at the creation of your index.

Clean up resources

The fastest way to clean up after a tutorial is by deleting the resource group containing the Azure Search service. You can delete the resource group now to permanently delete everything in it. In the portal, the resource group name is on the Overview page of Azure Search service.

Next steps

You can attach AI-powered algorithms to an indexer pipeline. As a next step, continue on with the following tutorial.