Field mappings and transformations using Azure Cognitive Search indexers

Indexer Stages

When using Azure Cognitive Search indexers, the indexer will automatically map fields in a data source to fields in a target index, assuming field names and types are compatible. When input data doesn't quite match the schema of your target index, you can define field mappings to specifically set the data path.

Field mappings address the following scenarios:

  • Mismatched field names. Suppose your data source has a field named _id. Given that Azure Cognitive Search doesn't allow field names that start with an underscore, a field mapping lets you effectively rename a field.

  • One field to many fields. You can populate several fields in the index from the same data source data. For example, you might want to apply different analyzers to each field.

  • Many fields to one field. You want to populate an index field with data from more than one data source, and the data sources each use different field names.

  • Base64 encoding or decoding of data. Field mappings support several mapping functions, including functions for Base64 encoding and decoding.

  • Splitting strings or recasting a JSON array into a string collection. Field mapping functions provide this capability.

Field mappings in indexers are a simple way to map data fields to index fields, with some ability for light-weight data conversion. More complex data might require pre-processing to reshape it into a form that's conducive to indexing. One option you might consider is Azure Data Factory.

Note

Field mappings apply to search indexes only. For indexers that also create knowledge stores, data shapes and projections determine field associations, and any field mappings and output field mappings in the indexer are ignored.

Set up field mappings

A field mapping consists of three parts:

  • "sourceFieldName", which represents a field in your data source. This property is required.
  • An optional "targetFieldName", representing a field in your search index. If omitted, the value of "sourceFieldName" is used for the target.
  • An optional "mappingFunction", which can transform your data using one of several predefined functions. This can be applied on both input and output field mappings.

Field mappings are added to the "fieldMappings" array of the indexer definition.

Note

If no field mappings are present, indexers assume data source fields should be mapped to index fields with the same name. Adding a field mapping overrides these default field mappings for the source and target field. Some indexers, such as the blob storage indexer, add default field mappings for the index key field.

Map fields using REST

You can add field mappings when creating a new indexer using the Create Indexer API request. You can manage the field mappings of an existing indexer using the Update Indexer API request.

For example, here's how to map a source field to a target field with a different name:

PUT https://[service name].search.windows.net/indexers/myindexer?api-version=[api-version]
Content-Type: application/json
api-key: [admin key]
{
    "dataSourceName" : "mydatasource",
    "targetIndexName" : "myindex",
    "fieldMappings" : [ { "sourceFieldName" : "_id", "targetFieldName" : "id" } ]
}

A source field can be referenced in multiple field mappings. The following example shows how to "fork" a field, copying the same source field to two different index fields:


"fieldMappings" : [
    { "sourceFieldName" : "text", "targetFieldName" : "textStandardEnglishAnalyzer" },
    { "sourceFieldName" : "text", "targetFieldName" : "textSoundexAnalyzer" }
]

Note

Azure Cognitive Search uses case-insensitive comparison to resolve the field and function names in field mappings. This is convenient (you don't have to get all the casing right), but it means that your data source or index cannot have fields that differ only by case.

Map fields using .NET

You can define field mappings in the .NET SDK using the FieldMapping class, which has the properties "SourceFieldName" and "TargetFieldName", and an optional "MappingFunction" reference.

You can specify field mappings when constructing the indexer, or later by directly setting SearchIndexer.FieldMappings.

The following C# example sets the field mappings when constructing an indexer.

var indexer = new SearchIndexer("hotels-sql-idxr", dataSource.Name, searchIndex.Name)
{
    Description = "SQL data indexer",
    Schedule = schedule,
    Parameters = parameters,
    FieldMappings =
    {
        new FieldMapping("_id") {TargetFieldName = "HotelId", FieldMappingFunction.Base64Encode()},
        new FieldMapping("Amenities") {TargetFieldName = "Tags"}
    }
};

await indexerClient.CreateOrUpdateIndexerAsync(indexer);

Field mapping functions

A field mapping function transforms the contents of a field before it's stored in the index. The following mapping functions are currently supported:

base64Encode function

Performs URL-safe Base64 encoding of the input string. Assumes that the input is UTF-8 encoded.

Example: Base-encoding a document key

Only URL-safe characters can appear in an Azure Cognitive Search document key (so that you can address the document using the Lookup API). If the source field for your key contains URL-unsafe characters, such as - and \, use the base64Encode function to convert it at indexing time.

The following example specifies the base64Encode function on "metadata_storage_name" to handle unsupported characters.

PUT /indexers?api-version=2020-06-30
{
  "dataSourceName" : "my-blob-datasource ",
  "targetIndexName" : "my-search-index",
  "fieldMappings" : [
    { 
        "sourceFieldName" : "metadata_storage_name", 
        "targetFieldName" : "key", 
        "mappingFunction" : { 
            "name" : "base64Encode",
            "parameters" : { "useHttpServerUtilityUrlTokenEncode" : false }
        } 
    }
  ]
}

A document key (both before and after conversion) can't be longer than 1,024 characters. When you retrieve the encoded key at search time, use the base64Decode function to get the original key value, and use that to retrieve the source document.

Example: Make a base-encoded field "searchable"

There are times when you need to use an encoded version of a field like "metadata_storage_path" as the key, but also need an un-encoded version for full text search. To support both scenarios, you can map "metadata_storage_path" to two fields: one for the key (encoded), and a second for a path field that we can assume is attributed as "searchable" in the index schema.

PUT /indexers/blob-indexer?api-version=2020-06-30
{
    "dataSourceName" : " blob-datasource ",
    "targetIndexName" : "my-target-index",
    "schedule" : { "interval" : "PT2H" },
    "fieldMappings" : [
        { "sourceFieldName" : "metadata_storage_path", "targetFieldName" : "key", "mappingFunction" : { "name" : "base64Encode" } },
        { "sourceFieldName" : "metadata_storage_path", "targetFieldName" : "path" }
      ]
}

Example - preserve original values

The blob storage indexer automatically adds a field mapping from metadata_storage_path, the URI of the blob, to the index key field if no field mapping is specified. This value is Base64 encoded so it's safe to use as an Azure Cognitive Search document key. The following example shows how to simultaneously map a URL-safe Base64 encoded version of metadata_storage_path to a index_key field and preserve the original value in a metadata_storage_path field:

"fieldMappings": [
  {
    "sourceFieldName": "metadata_storage_path",
    "targetFieldName": "metadata_storage_path"
  },
  {
    "sourceFieldName": "metadata_storage_path",
    "targetFieldName": "index_key",
    "mappingFunction": {
       "name": "base64Encode"
    }
  }
]

If you don't include a parameters property for your mapping function, it defaults to the value {"useHttpServerUtilityUrlTokenEncode" : true}.

Azure Cognitive Search supports two different Base64 encodings. You should use the same parameters when encoding and decoding the same field. For more information, see base64 encoding options to decide which parameters to use.

base64Decode function

Performs Base64 decoding of the input string. The input is assumed to be a URL-safe Base64-encoded string.

Example - decode blob metadata or URLs

Your source data might contain Base64-encoded strings, such as blob metadata strings or web URLs, that you want to make searchable as plain text. You can use the base64Decode function to turn the encoded data back into regular strings when populating your search index.

"fieldMappings" : [
  {
    "sourceFieldName" : "Base64EncodedMetadata",
    "targetFieldName" : "SearchableMetadata",
    "mappingFunction" : { 
      "name" : "base64Decode", 
      "parameters" : { "useHttpServerUtilityUrlTokenDecode" : false }
    }
  }]

If you don't include a parameters property, it defaults to the value {"useHttpServerUtilityUrlTokenEncode" : true}.

Azure Cognitive Search supports two different Base64 encodings. You should use the same parameters when encoding and decoding the same field. For more details, see base64 encoding options to decide which parameters to use.

base64 encoding options

Azure Cognitive Search supports URL-safe base64 encoding and normal base64 encoding. A string that is base64 encoded during indexing should be decoded later with the same encoding options, or else the result won't match the original.

If the useHttpServerUtilityUrlTokenEncode or useHttpServerUtilityUrlTokenDecode parameters for encoding and decoding respectively are set to true, then base64Encode behaves like HttpServerUtility.UrlTokenEncode and base64Decode behaves like HttpServerUtility.UrlTokenDecode.

Warning

If base64Encode is used to produce key values, useHttpServerUtilityUrlTokenEncode must be set to true. Only URL-safe base64 encoding can be used for key values. See Naming rules for the full set of restrictions on characters in key values.

The .NET libraries in Azure Cognitive Search assume the full .NET Framework, which provides built-in encoding. The useHttpServerUtilityUrlTokenEncode and useHttpServerUtilityUrlTokenDecode options leverage this built-in functionality. If you are using .NET Core or another framework, we recommend setting those options to false and calling your framework's encoding and decoding functions directly.

The following table compares different base64 encodings of the string 00>00?00. To determine the required additional processing (if any) for your base64 functions, apply your library encode function on the string 00>00?00 and compare the output with the expected output MDA-MDA_MDA.

Encoding Base64 encode output Additional processing after library encoding Additional processing before library decoding
Base64 with padding MDA+MDA/MDA= Use URL-safe characters and remove padding Use standard base64 characters and add padding
Base64 without padding MDA+MDA/MDA Use URL-safe characters Use standard base64 characters
URL-safe base64 with padding MDA-MDA_MDA= Remove padding Add padding
URL-safe base64 without padding MDA-MDA_MDA None None

extractTokenAtPosition function

Splits a string field using the specified delimiter, and picks the token at the specified position in the resulting split.

This function uses the following parameters:

  • delimiter: a string to use as the separator when splitting the input string.
  • position: an integer zero-based position of the token to pick after the input string is split.

For example, if the input is Jane Doe, the delimiter is " "(space) and the position is 0, the result is Jane; if the position is 1, the result is Doe. If the position refers to a token that doesn't exist, an error is returned.

Example - extract a name

Your data source contains a PersonName field, and you want to index it as two separate FirstName and LastName fields. You can use this function to split the input using the space character as the delimiter.

"fieldMappings" : [
  {
    "sourceFieldName" : "PersonName",
    "targetFieldName" : "FirstName",
    "mappingFunction" : { "name" : "extractTokenAtPosition", "parameters" : { "delimiter" : " ", "position" : 0 } }
  },
  {
    "sourceFieldName" : "PersonName",
    "targetFieldName" : "LastName",
    "mappingFunction" : { "name" : "extractTokenAtPosition", "parameters" : { "delimiter" : " ", "position" : 1 } }
  }]

jsonArrayToStringCollection function

Transforms a string formatted as a JSON array of strings into a string array that can be used to populate a Collection(Edm.String) field in the index.

For example, if the input string is ["red", "white", "blue"], then the target field of type Collection(Edm.String) will be populated with the three values red, white, and blue. For input values that cannot be parsed as JSON string arrays, an error is returned.

Example - populate collection from relational data

Azure SQL Database doesn't have a built-in data type that naturally maps to Collection(Edm.String) fields in Azure Cognitive Search. To populate string collection fields, you can pre-process your source data as a JSON string array and then use the jsonArrayToStringCollection mapping function.

"fieldMappings" : [
  {
    "sourceFieldName" : "tags", 
    "mappingFunction" : { "name" : "jsonArrayToStringCollection" }
  }]

urlEncode function

This function can be used to encode a string so that it is "URL safe". When used with a string that contains characters that are not allowed in a URL, this function will convert those "unsafe" characters into character-entity equivalents. This function uses the UTF-8 encoding format.

Example - document key lookup

urlEncode function can be used as an alternative to the base64Encode function, if only URL unsafe characters are to be converted, while keeping other characters as-is.

Say, the input string is <hello> - then the target field of type (Edm.String) will be populated with the value %3chello%3e

When you retrieve the encoded key at search time, you can then use the urlDecode function to get the original key value, and use that to retrieve the source document.

"fieldMappings" : [
  {
    "sourceFieldName" : "SourceKey",
    "targetFieldName" : "IndexKey",
    "mappingFunction" : {
      "name" : "urlEncode"
    }
  }]

urlDecode function

This function converts a URL-encoded string into a decoded string using UTF-8 encoding format.

Example - decode blob metadata

Some Azure storage clients automatically URL-encode blob metadata if it contains non-ASCII characters. However, if you want to make such metadata searchable (as plain text), you can use the urlDecode function to turn the encoded data back into regular strings when populating your search index.

"fieldMappings" : [
 {
   "sourceFieldName" : "UrlEncodedMetadata",
   "targetFieldName" : "SearchableMetadata",
   "mappingFunction" : {
     "name" : "urlDecode"
   }
 }]

fixedLengthEncode function

This function converts a string of any length to a fixed-length string.

Example - map document keys that are too long

When facing errors complaining about document key being longer than 1024 characters, this function can be applied to reduce the length of the document key.


"fieldMappings" : [
 {
   "sourceFieldName" : "metadata_storage_path",
   "targetFieldName" : "your key field",
   "mappingFunction" : {
     "name" : "fixedLengthEncode"
   }
 }]