Analyzers for text processing in Azure Search

An analyzer is a component of the full text search engine responsible for processing text in query strings and indexed documents. Different analyzers manipulate text in different ways depending on the scenario. Language analyzers process text using linguistic rules in order to improve search quality, while other analyzers perform more basic tasks like converting characters to lower case, for example.

Language analyzers are the most frequently used, and there is default language analyzer assigned to every searchable field in an Azure Search index. The following language transformations are typical during text analysis:

  • Non-essential words (stopwords) and punctuation are removed.
  • Phrases and hyphenated words are broken down into component parts.
  • Upper-case words are lower-cased.
  • Words are reduced to root forms so that a match can be found regardless of tense.

Language analyzers convert a text input into primitive or root forms that are efficient for information storage and retrieval. Conversion occurs during indexing, when the index is built, and then again during search when the index is read. You are more likely to get the search results you expect if you use the same analyzer for both operations.

Default analyzer

Azure Search uses the Apache Lucene Standard analyzer (standard lucene) as the default, which breaks text into elements following the "Unicode Text Segmentation" rules. Additionally, the standard analyzer converts all characters to their lower case form. Both indexed documents and search terms go through the analysis during indexing and query processing.

It's used automatically on every searchable field. You can override the default on a field-by-field basis. Alternative analyzers can be a language analyzer, custom analyzer, or a predefined analyzer from the list of available analyzers.

Types of analyzers

The following list describes which analyzers are available in Azure Search.

Category Description
Standard Lucene analyzer Default. No specification or configuration is required. This general-purpose analyzer performs well for most languages and scenarios.
Predefined analyzers Offered as a finished product intended to be used as-is.
There are two types: specialized and language. What makes them "predefined" is that you reference them by name, with no configuration or customization.

Specialized (language-agnostic) analyzers are used when text inputs require specialized processing or minimal processing. Non-language predefined analyzers include Asciifolding, Keyword, Pattern, Simple, Stop, Whitespace.

Language analyzers are used when you need rich linguistic support for individual languages. Azure Search supports 35 Lucene language analyzers and 50 Microsoft natural language processing analyzers.
Custom analyzers Refers to a user-defined configuration of a combination of existing elements, consisting of one tokenizer (required) and optional filters (char or token).

A few predefined analyzers, such as Pattern or Stop, support a limited set of configuration options. To set these options, you effectively create a custom analyzer, consisting of the predefined analyzer and one of the alternative options documented in Predefined Analyzer Reference. As with any custom configuration, provide your new configuration with a name, such as myPatternAnalyzer to distinguish it from the Lucene Pattern analyzer.

How to specify analyzers

  1. (for custom analyzers only) Create a named analyzer section in the index definition. For more information, see Create Index and also Add custom analyzers.

  2. On a field definition in the index, set the field's analyzer property to the name of a target analyzer (for example, "analyzer" = "keyword". Valid values include name of a predefined analyzer, language analyzer, or custom analyzer also defined in the index schema. Plan on assigning analyzer in the index definition phase before the index is created in the service.

  3. Optionally, instead of one analyzer property, you can set different analyzers for indexing and querying using the indexAnalyzer and searchAnalyzer field parameters. You would use different analyzers for data preparation and retrieval if one of those activities required a specific transformation not needed by the other.

Assigning analyzer or indexAnalyzer to a field that has already been physically created is not allowed. If any of this is unclear, review the following table for a breakdown of which actions require a rebuild and why.

Scenario Impact Steps
Add a new field minimal If the field doesn't exist yet in the schema, there is no field revision to make because the field does not yet have a physical presence in your index. You can use Update Index to add a new field to an existing index, and mergeOrUpload to populate it.
Add an analyzer or indexAnalyzer to an existing indexed field. rebuild The inverted index for that field must be recreated from the ground up, and the content for those fields must be reindexed.

For indexes under active development, delete and create the index to pick up the new field definition.

For indexes in production, you can defer a rebuild by creating a new field to provide the revised definition and start using it in place of the old one. Use Update Index to incorporate the new field and mergeOrUpload to populate it. Later, as part of planned index servicing, you can clean up the index to remove obsolete fields.

When to add analyzers

The best time to add and assign analyzers is during active development, when dropping and recreating indexes is routine.

As an index definition solidifies, you can append new analysis constructs to an index, but you will need to pass the allowIndexDowntime flag to Update Index if you want to avoid this error:

"Index update not allowed because it would cause downtime. In order to add new analyzers, tokenizers, token filters, or character filters to an existing index, set the 'allowIndexDowntime' query parameter to 'true' in the index update request. Note that this operation will put your index offline for at least a few seconds, causing your indexing and query requests to fail. Performance and write availability of the index can be impaired for several minutes after the index is updated, or longer for very large indexes."

The same holds true when assigning an analyzer to a field. An analyzer is an integral part of the field's definition, so you can only add it when the field is created. If you want to add analyzers to existing fields, you'll have to drop and rebuild the index, or add a new field with the analyzer you want.

As noted, an exception is the searchAnalyzer variant. Of the three ways to specify analyzers (analyzer, indexAnalyzer, searchAnalyzer), only the searchAnalyzer attribute can be changed on an existing field.

Recommendations for working with analyzers

This section offers advice on how to work with analyzers.

One analyzer for read-write unless you have specific requirements

Azure Search lets you specify different analyzers for indexing and search via additional indexAnalyzer and searchAnalyzer field parameters. If unspecified, the analyzer set with the analyzer property is used for both indexing and searching. If analyzer is unspecified, the default Standard Lucene analyzer is used.

A general rule is to use the same analyzer for both indexing and querying, unless specific requirements dictate otherwise. Be sure to test thoroughly. When text processing differs at search and indexing time, you run the risk of mismatch between query terms and indexed terms when the search and indexing analyzer configurations are not aligned.

Test during active development

Overriding the standard analyzer requires an index rebuild. If possible, decide on which analyzers to use during active development, before rolling an index into production.

Inspect tokenized terms

If a search fails to return expected results, the most likely scenario is token discrepancies between term inputs on the query, and tokenized terms in the index. If the tokens aren't the same, matches fail to materialize. To inspect tokenizer output, we recommend using the Analyze API as an investigation tool. The response consists of tokens, as generated by a specific analyzer.

Compare English analyzers

The Search Analyzer Demo is a third-party demo app showing a side-by-side comparison of the standard Lucene analyzer, Lucene's English language analyzer, and Microsoft's English natural language processor. The index is fixed; it contains text from a popular story. For each search input you provide, results from each analyzer are displayed in adjacent panes, giving you a sense of how each analyzer processes the same string.

Examples

The examples below show analyzer definitions for a few key scenarios.

Custom analyzer example

This example illustrates an analyzer definition with custom options. Custom options for char filters, tokenizers, and token filters are specified separately as named constructs, and then referenced in the analyzer definition. Predefined elements are used as-is and simply referenced by name.

Walking through this example:

  • Analyzers are a property of the field class for a searchable field.
  • A custom analyzer is part of an index definition. It might be lightly customized (for example, customizing a single option in one filter) or customized in multiple places.
  • In this case, the custom analyzer is "my_analyzer", which in turn uses a customized standard tokenizer "my_standard_tokenizer" and two token filters: lowercase and customized asciifolding filter "my_asciifolding".
  • It also defines 2 custom char filters "map_dash" and "remove_whitespace". The first one replaces all dashes with underscores while the second one removes all spaces. Spaces need to be UTF-8 encoded in the mapping rules. The char filters are applied before tokenization and will affect the resulting tokens (the standard tokenizer breaks on dash and spaces but not on underscore).
  {
     "name":"myindex",
     "fields":[
        {
           "name":"id",
           "type":"Edm.String",
           "key":true,
           "searchable":false
        },
        {
           "name":"text",
           "type":"Edm.String",
           "searchable":true,
           "analyzer":"my_analyzer"
        }
     ],
     "analyzers":[
        {
           "name":"my_analyzer",
           "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
           "charFilters":[
              "map_dash",
              "remove_whitespace"
           ],
           "tokenizer":"my_standard_tokenizer",
           "tokenFilters":[
              "my_asciifolding",
              "lowercase"
           ]
        }
     ],
     "charFilters":[
        {
           "name":"map_dash",
           "@odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
           "mappings":["-=>_"]
        },
        {
           "name":"remove_whitespace",
           "@odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
           "mappings":["\\u0020=>"]
        }
     ],
     "tokenizers":[
        {
           "name":"my_standard_tokenizer",
           "@odata.type":"#Microsoft.Azure.Search.StandardTokenizerV2",
           "maxTokenLength":20
        }
     ],
     "tokenFilters":[
        {
           "name":"my_asciifolding",
           "@odata.type":"#Microsoft.Azure.Search.AsciiFoldingTokenFilter",
           "preserveOriginal":true
        }
     ]
  }

Per-field analyzer assignment example

The Standard analyzer is the default. Suppose you want to replace the default with a different predefined analyzer, such as the pattern analyzer. If you are not setting custom options, you only need to specify it by name in the field definition.

The "analyzer" element overrides the Standard analyzer on a field-by-field basis. There is no global override. In this example, text1 uses the pattern analyzer and text2, which doesn't specify an analyzer, uses the default.

  {
     "name":"myindex",
     "fields":[
        {
           "name":"id",
           "type":"Edm.String",
           "key":true,
           "searchable":false
        },
        {
           "name":"text1",
           "type":"Edm.String",
           "searchable":true,
           "analyzer":"pattern"
        },
        {
           "name":"text2",
           "type":"Edm.String",
           "searchable":true
        }
     ]
  }

Mixing analyzers for indexing and search operations

The APIs include additional index attributes for specifying different analyzers for indexing and search. The searchAnalyzer and indexAnalyzer attributes must be specified as a pair, replacing the single analyzer attribute.

  {
     "name":"myindex",
     "fields":[
        {
           "name":"id",
           "type":"Edm.String",
           "key":true,
           "searchable":false
        },
        {
           "name":"text",
           "type":"Edm.String",
           "searchable":true,
           "indexAnalyzer":"whitespace",
           "searchAnalyzer":"simple"
        },
     ],
  }

Language analyzer example

Fields containing strings in different languages can use a language analyzer, while other fields retain the default (or use some other predefined or custom analyzer). If you use a language analyzer, it must be used for both indexing and search operations. Fields that use a language analyzer cannot have different analyzers for indexing and search.

  {
     "name":"myindex",
     "fields":[
        {
           "name":"id",
           "type":"Edm.String",
           "key":true,
           "searchable":false
        },
        {
           "name":"text",
           "type":"Edm.String",
           "searchable":true,
           "indexAnalyzer":"whitespace",
           "searchAnalyzer":"simple"
        },
        {
           "name":"text_fr",
           "type":"Edm.String",
           "searchable":true,
           "analyzer":"fr.lucene"
        }
     ],
  }

Next steps

See also

Search Documents REST API

Simple query syntax

Full Lucene query syntax

Handle search results