Create Indexer (Azure Cognitive Search REST API)

An indexer automates indexing from supported Azure data sources such as Azure Storage, Azure SQL Database, and Azure Cosmos DB to name a few. Indexers use a predefined data source and index to establish an indexing pipeline that extracts and serializes source data, passing it to a search service for data ingestion. For AI enrichment of image and unstructured text, indexers can also accept a skillset that defines AI processing.

Creating an indexer adds it to your search service and runs it. If the request is successful, the index will be populated with searchable content from the data source.

You can use either POST or PUT on the request. For either one, the JSON document in the request body provides the object definition.

POST https://[service name].search.windows.net/indexers?api-version=[api-version]
    Content-Type: application/json  
    api-key: [admin key]  

Alternatively, you can use PUT and specify the indexer name on the URI.

PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=[api-version]
    Content-Type: application/json  
    api-key: [admin key]    

HTTPS is required for all service requests. If the indexer doesn't exist, it is created. If it already exists, it is updated to the new definition but you must issue a Run Indexer request if you want indexer execution.

Indexer configuration varies based on the type of data source. For data-platform-specific guidance on creating indexers, start with Indexers overview, which includes the complete list of related articles.

URI Parameters

Parameter Description
service name Required. Set this to the unique, user-defined name of your search service.
indexer name Required on the URI if using PUT. The name must be lower case, start with a letter or number, have no slashes or dots, and be less than 128 characters. After starting the name with a letter or number, the rest of the name can include any letter, number and dashes, as long as the dashes are not consecutive.
api-version Required. The current version is api-version=2020-06-30. See API versions in Azure Cognitive Search for a list of available versions.

Request Headers

The following table describes the required and optional request headers.

Fields Description
Content-Type Required. Set this to application/json
api-key Required. The api-key is used to authenticate the request to your Search service. It is a string value, unique to your service. Create requests must include an api-key header set to your admin key (as opposed to a query key).

You can get the api-key from your service dashboard in the Azure portal. For more information, see Find existing keys.

Request Body

A data source, index, and skillset are part of an indexer definition, but each is an independent component that can be used in different combinations. For example, you could use the same data source with multiple indexers, or the same index with multiple indexers, or multiple indexers writing to a single index.

The following JSON is a high-level representation of the main parts of the definition.

{   
    "name" : (optional on PUT; required on POST) "Name of the indexer",  
    "description" : (optional) "Anything you want, or nothing at all", 
    "dataSourceName" : (required) "Name of an existing data source",  
    "targetIndexName" : (required) "Name of an existing index",  
    "skillsetName" : (required for AI enrichment) "Name of an existing skillset",
    "schedule" : (optional but runs once immediately if unspecified) { ... },  
    "parameters" : (optional) { ... },  
    "fieldMappings" : (optional) { ... },
    "outputFieldMappings" : (required for AI enrichment) { ... },
    "encryptionKey":(optional) { },
    "disabled" : (optional) Boolean value indicating whether the indexer is disabled. False by default.
}  

Request contains the following properties:

Property Description
name Required. The name must be lower case, start with a letter or number, have no slashes or dots, and be less than 128 characters. After starting the name with a letter or number, the rest of the name can include any letter, number and dashes, as long as the dashes are not consecutive.
dataSourceName Required. Name of an existing data source.
targetIndexName Required. Name of an existing index.
skillsetName Required for AI enrichment. Name of an existing skillset.
schedule Optional, but runs once immediately if unspecified.
parameters Optional. Properties for modifying runtime behavior.
fieldMappings Optional. Used when source and destination fields have different names.
outputFieldMappings Required for AI enrichment. Maps output from a skillset to an index or projection.
encryptionKey Optional. Used to encrypt indexer data at rest with your own keys, managed in your Azure Key Vault. To learn more, see Azure Cognitive Search encryption using customer-managed keys in Azure Key Vault.
disabled Optional. Boolean value indicating whether the indexer is disabled. False by default.

"dataSourceName"

A data source definition often includes properties that an indexer can use to exploit source platform characteristics. As such, the data source you pass to the indexer determines the availability of certain properties and parameters, such content type filtering in Azure blobs or query timeout for Azure SQL Database.

"targetIndexName"

An index schema defines the fields collection containing searchable, filterable, retrievable, and other attributions that determine how the field is used. During indexing, the indexer crawls the data source, optionally cracks documents and extracts information, serializes the results to JSON, and indexes the payload based on the schema defined for your index.

"skillsetName"

AI enrichment refers to natural language and image processing capabilities in Azure Cognitive Search, applied during data ingestion to extract entities, key phrases, language, information from images, and so forth. Transformations applied to content are through skills, which you combine into a single skillset, one per indexer. As with data sources and indexes, a skillset is an independent component that you attach to an indexer. You can repurpose a skillset with other indexers, but each indexer can only use one skillset at a time.

"schedule"

An indexer can optionally specify a schedule. Without a schedule, the indexer runs immediately when you send the request: connecting to, crawling, and indexing the data source. For some scenarios including long-running indexing jobs, schedules are used to extend the processing window beyond the 24-hour maximum. If a schedule is present, the indexer runs periodically as per schedule. The scheduler is built in; you cannot use an external scheduler. A Schedule has the following attributes:

  • interval: Required. A duration value that specifies an interval or period for indexer runs. The smallest allowed interval is five minutes; the longest is one day. It must be formatted as an XSD "dayTimeDuration" value (a restricted subset of an ISO 8601 duration value). The pattern for this is: "P[nD][T[nH][nM]]". Examples: PT15M for every 15 minutes, PT2H for every 2 hours.

  • startTime: Optional. A UTC datetime when the indexer should start running.

Note

If an indexer is set to a certain schedule but repeatedly fails on the same document over and over again each time it runs, the indexer will begin running on a less frequent interval (up to the maximum of at least once every 24 hours) until it successfully makes progress again. If you believe you have fixed whatever the issue that was causing the indexer to be stuck at a certain point, you can perform an on demand run of the indexer, and if that successfully makes progress, the indexer will return to its set schedule interval again.

"parameters"

An indexer can optionally take configuration parameters that modify runtime behaviors. Configuration parameters are comma-delimited on the indexer request.

  {
    "name" : "my-blob-indexer-for-cognitive-search",
    ... other indexer properties
    "parameters" : 
      { 
      "maxFailedItems" : "15", 
      "batchSize" : "100", 
      "configuration" : 
          { 
          "parsingMode" : "json", 
          "indexedFileNameExtensions" : ".json, .jpg, .png",
          "imageAction" : "generateNormalizedImages",
          "dataToExtract" : "contentAndMetadata" ,
          "executionEnvironment": "Standard"
          } 
      }
  }

General parameters for all indexers

Parameter Type and allowed values Usage
"batchSize" Integer
Default is source-specific (1000 for Azure SQL Database and Azure Cosmos DB, 10 for Azure Blob Storage)
Specifies the number of items that are read from the data source and indexed as a single batch in order to improve performance.
"maxFailedItems" Integer
Default is 0
Number of errors to tolerate before an indexer run is considered a failure. Set to -1 if you don’t want any errors to stop the indexing process. You can retrieve information about failed items using Get Indexer Status.
"maxFailedItemsPerBatch" Integer
Default is 0
Number of errors to tolerate in each batch before an indexer run is considered a failure. Set to -1 if you don’t want any errors to stop the indexing process.
"executionEnvironment" String
Valid values are case-insensitive and consist of [null or unspecified], Standard (default), or Private.
Overrides the execution environment chosen by internal system processes. Explicitly setting the execution environment to Private is required if indexers are accessing external resources over private endpoint connections. For data ingestion, this setting is valid only for services that are provisioned as Basic or Standard (S1, S2, S3). For AI enrichment content processing, this setting is valid for S2 and S3 only. This setting is located in the "configuration" section.

Blob configuration parameters

Several parameters are exclusive to a particular indexer, such as Azure blob indexing.

Parameter Type and allowed values Usage
"parsingMode" String
"text"
"delimitedText"
"json"
"jsonArray"
"jsonLines"
For Azure blobs, set to text to improve indexing performance on plain text files in blob storage.
For CSV blobs, set to delimitedText when blobs are plain CSV files.
For JSON blobs, set to json to extract structured content or to jsonArray to extract individual elements of an array as separate documents in Azure Cognitive Search. Use jsonLines to extract individual JSON entities, separated by a new line, as separate documents in Azure Cognitive Search.
"excludedFileNameExtensions" String
comma-delimited list
user-defined
For Azure blobs, ignore any file types in the list. For example, you could exclude ".png, .png, .mp4" to skip over those files during indexing.
"indexedFileNameExtensions" String
comma-delimited list
user-defined
For Azure blobs, selects blobs if the file extension is in the list. For example, you could focus indexing on specific application files ".docx, .pptx, .msg" to specifically include those file types.
"failOnUnsupportedContentType" Boolean
true
false (default)
For Azure blobs, set to false if you want to continue indexing when an unsupported content type is encountered, and you don't know all the content types (file extensions) in advance.
"failOnUnprocessableDocument" Boolean
true
false (default)
For Azure blobs, set to false if you want to continue indexing if a document fails indexing.
"indexStorageMetadataOnly
ForOversizedDocuments"
Boolean true
false (default)
For Azure blobs, set this property to true to still index storage metadata for blob content that is too large to process. Oversized blobs are treated as errors by default. For limits on blob size, see Service Limits.
"delimitedTextHeaders" String
comma-delimited list
user-defined
For CSV blobs, specifies a comma-delimited list of column headers, useful for mapping source fields to destination fields in an index.
"delimitedTextDelimiter" String
single character
user-defined
For CSV blobs, specifies the end-of-line delimiter for CSV files where each line starts a new document (for example, "|").
"firstLineContainsHeaders" Boolean
true (default)
false
For CSV blobs, indicates that the first (non-blank) line of each blob contains headers.
"documentRoot" String
user-defined path
For JSON arrays, given a structured or semi-structured document, you can specify a path to the array using this property.
"dataToExtract" String
"storageMetadata"
"allMetadata"
"contentAndMetadata" (default)
For Azure blobs:
Set to "storageMetadata" to index just the standard blob properties and user-specified metadata.
Set to "allMetadata" to extract metadata provided by the Azure blob storage subsystem and the content-type specific metadata (for example, metadata unique to just .png files) are indexed.
Set to "contentAndMetadata" to extract all metadata and textual content from each blob.

For image-analysis in AI enrichment, when "imageAction" is set to a value other than "none", the "dataToExtract" setting tells the indexer which data to extract from image content. Applies to embedded image content in a .PDF or other application, or image files such as .jpg and .png, in Azure blobs.
"imageAction" String
"none"
"generateNormalizedImages"
"generateNormalizedImagePerPage"
For Azure blobs, set to"none" to ignore embedded images or image files in the data set. This is the default.

For image-analysis in AI enrichment, set to"generateNormalizedImages" to extract text from images (for example, the word "stop" from a traffic Stop sign), and embed it as part of the content field. During image analysis, the indexer creates an array of normalized images as part of document cracking, and embeds the generated information into the content field. This action requires that "dataToExtract" is set to "contentAndMetadata". A normalized image refers to additional processing resulting in uniform image output, sized and rotated to promote consistent rendering when you include images in visual search results (for example, same-size photographs in a graph control as seen in the JFK demo). This information is generated for each image when you use this option.

If you set to "generateNormalizedImagePerPage", PDF files will be treated differently in that instead of extracting embedded images, each page will be rendered as an image and normalized accordingly. Non-PDF file types will be treated the same as if "generateNormalizedImages" was set.

Setting the "imageAction" configuration to any value other than "none" requires that a skillset also be attached to that indexer.
"allowSkillsetToReadFileData" Boolean
true
false (default)
Setting the "allowSkillsetToReadFileData" parameter to true will create a path /document/file_data that is an object representing the original file data downloaded from your blob data source. This allows you to pass the original file data to a custom skill for processing within the enrichment pipeline, or to the Document Extraction skill. The object generated will be defined as follows: { "$type": "file", "data": "BASE64 encoded string of the file" }

Setting the "allowSkillsetToReadFileData" parameter to true requires that a skillset be attached to that indexer, that the "parsingMode" parameter is set to "default", "text" or "json", and the "dataToExtract" parameter is set to "contentAndMetadata" or "allMetadata".
"pdfTextRotationAlgorithm" String
"none" (default)
"detectAngles"
Setting the "pdfTextRotationAlgorithm" parameter to "detectAngles" may help produce better and more readable text extraction from PDF files that have rotated text within them. Note that there may be a small performance speed impact when this parameter is used. This parameter only applies to PDF files, and only to PDFs with embedded text. If the rotated text appears within an embedded image in the PDF, this parameter does not apply.

Setting the "pdfTextRotationAlgorithm" parameter to "detectAngles" requires that the "parsingMode" parameter is set to "default".

Other configuration parameters

The following parameters are specific to Azure SQL Database.

Parameter Type and allowed values Usage
"queryTimeout" String
"hh:mm:ss"
"00:05:00"
For Azure SQL Database, set this parameter to increase the timeout beyond the 5-minute default.

"fieldMappings"

Indexer definitions contain field associations for mapping a source field to a destination field in an Azure Cognitive Search index. There are two types of associations depending on whether the content transfer follows a direct or enriched path:

  • fieldMappings are optional, applied when source-destination field names do not match, or when you want to specify a function.
  • outputFieldMappings are required if you are building an enrichment pipeline. In an enrichment pipeline, the output field is a construct defined during the enrichment process. For example, the output field might be a compound structure built during enrichment from two separate fields in the source document.

In the following example, consider a source table with a field _id. Azure Cognitive Search doesn't allow a field name starting with an underscore, so the field must be renamed. This can be done using the fieldMappings property of the indexer as follows:

"fieldMappings" : [ { "sourceFieldName" : "_id", "targetFieldName" : "id" } ]

You can specify multiple field mappings:

"fieldMappings" : [
    { "sourceFieldName" : "_id", "targetFieldName" : "id" },
    { "sourceFieldName" : "_timestamp", "targetFieldName" : "timestamp" }
]

Both source and target field names are case-insensitive.

To learn about scenarios where field mappings are useful, see Search Indexer Field Mappings.

"outputFieldMappings"

In AI enrichment scenarios in which a skillset is bound to an indexer, you must add outputFieldMappings to associate any output of an enrichment step that provides content to a searchable field in the index.

  "outputFieldMappings" : [
        {
          "sourceFieldName" : "/document/organizations", 
          "targetFieldName" : "organizations"
        },
        {
          "sourceFieldName" : "/document/pages/*/keyPhrases/*", 
          "targetFieldName" : "keyphrases"
        },
        {
            "sourceFieldName": "/document/languageCode",
            "targetFieldName": "language",
            "mappingFunction": null
        }      
   ],

Field mapping functions

Field mappings can also be used to transform source field values using field mapping functions. For example, an arbitrary string value can be base64-encoded so it can be used to populate a document key field.

To learn more about when and how to use field mapping functions, see Field Mapping Functions.

"disabled"

The disabled property is an optional Boolean value to indicate whether the indexer is disabled. It is set to false by default. To stop an indexer run, set disabled to true.

Response

201 Created for a successful request.

Examples

The first example creates an indexer that copies data from the table referenced by the ordersds data source to the orders index on a schedule that starts on Jan 1, 2015 UTC and runs hourly. Each indexer invocation will be successful if no more than 5 items fail to be indexed in each batch, and no more than 10 items fail to be indexed in total.

{
    "name" : "myindexer",  
    "description" : "a cool indexer",  
    "dataSourceName" : "ordersds",  
    "targetIndexName" : "orders",  
    "schedule" : { "interval" : "PT1H", "startTime" : "2018-01-01T00:00:00Z" },  
    "parameters" : { "maxFailedItems" : 10, "maxFailedItemsPerBatch" : 5 }  
}

The second example demonstrates an AI enrichment, indicated by the reference to a skillset and outputFieldMappings. Skillsets are high-level resources, defined separately. This example is an abbreviation of the indexer definition in the AI enrichment tutorial.

{
  "name":"demoindexer",	
  "dataSourceName" : "demodata",
  "targetIndexName" : "demoindex",
  "skillsetName" : "demoskillset",
  "fieldMappings" : [
    {
        "sourceFieldName" : "content",
        "targetFieldName" : "content"
    }
   ],
  "outputFieldMappings" : 
  [
    {
        "sourceFieldName" : "/document/organizations", 
        "targetFieldName" : "organizations"
    },
  ],
  "parameters":
  {
  	"maxFailedItems":-1,
  	"configuration": 
    {
    "dataToExtract": "contentAndMetadata",
    "imageAction": "generateNormalizedImages"
    }
  }
}

"encryptionKey"

While indexers are encrypted by default using service-managed keys, you can also encrypt them with your own keys, managed in your Azure Key Vault. The indexer execution status will also be encrypted with the same key. To learn more, see Azure Cognitive Search encryption using customer-managed keys in Azure Key Vault.

"encryptionKey": (optional) { 
  "keyVaultKeyName": "Name of the Azure Key Vault key used for encryption",
  "keyVaultKeyVersion": "Version of the Azure Key Vault key",
  "keyVaultUri": "URI of Azure Key Vault, also referred to as DNS name, that provides the key. An example URI might be https://my-keyvault-name.vault.azure.net",
  "accessCredentials": (optional, only if not using managed system identity) {
    "applicationId": "Azure Active Directory Application ID that was granted access permissions to your specified Azure Key Vault",
    "applicationSecret": "Authentication key of the specified Azure AD application)"}
  }

Note

Encryption with customer-managed keys is not available for free services. For billable services, it is only available for search services created on or after 2019-01-01.

See also