Custom analyzers in Azure Search

Note

Analyzers are a specific component of search technology. If you came to this page looking for information on how to analyze log or traffic data in Azure Search, please see Enabling and using Search Traffic Analytics instead.

Overview

The role of a full-text search engine, in simple terms, is to process and store documents in a way that enables efficient querying and retrieval. At a high level, it all comes down to extracting important words from documents, putting them in an index, and then using the index to find documents that match words of a given query. The process of extracting words from documents and search queries is called lexical analysis. Components that perform lexical analysis are called analyzers.

In Azure Search you can choose from a set of predefined language agnostic analyzers in the Analyzers table and language specific analyzers listed in Language analyzers (Azure Search Service REST API). You also have an option to define your own custom analyzers.

A custom analyzer allows you to take control over the process of converting text into indexable and searchable tokens. It’s a user-defined configuration consisting of a single predefined tokenizer, one or more token filters, and one or more char filters. The tokenizer is responsible for breaking text into tokens, and the token filters for modifying tokens emitted by the tokenizer. Char filters are applied for to prepare input text before it is processed by the tokenizer. For instance, char filter can replace certain characters or symbols.

Popular scenarios enabled by custom analyzers include:

  • Phonetic search. Add a phonetic filter to enable searching based on how a word sounds, not how it’s spelled.

  • Disable lexical analysis. Use the Keyword analyzer to create searchable fields that are not analyzed.

  • Fast prefix/suffix search. Add the Edge N-gram token filter to index prefixes of words to enable fast prefix matching. Combine it with the Reverse token filter to do suffix matching.

  • Custom tokenization. For example, use the Whitespace tokenizer to break sentences into tokens using whitespace as a delimiter

  • ASCII folding. Add the Standard ASCII folding filter to normalize diacritics like ö or ê in search terms.

    You can define multiple custom analyzers to vary the combination of filters, but each field can only use one analyzer for indexing analysis and one for search analysis.

    This page provides a list of supported analyzers, tokenizers, token filters, and char filters. You will also find a description of changes to the index definition with a usage example. For more background about the underlying technology leveraged in the Azure Search implementation, see Analysis package summary (Lucene).

Default analyzer

The default analyzer is the Apache Lucene StandardAnalyzer (standard lucene).

It's used automatically on every searchable field unless you explicitly override it with another analyzer within the field definition. Alternative analyzers can be a custom analyzer or a different predefined analyzer from the list of available Analyzers below.

Validation rules

Names of analyzers, tokenizers, token filters, and char filters have to be unique and cannot be the same as any of the predefined analyzers, tokenizers, token filters, or char filters. See the Property Reference for names already in use.

Create a custom analyzer

You can define custom analyzers at index creation time. The syntax for specifying a custom analyzer is described in this section. You can also familiarize yourself with the syntax by reviewing sample definitions in the Examples section further on.

An analyzer definition includes a name, a type, one or more char filters, a maximum of one tokenizer, and one or more token filters for post-tokenization processing. Char filers are applied before tokenization. Token filters and char filters will be applied from left to right.

The tokenizer_name is the name of a tokenizer, token_filter_name_1 and token_filter_name_2 are the names of token filters, and char_filter_name_1 and char_filter_name_2 are the names of char filters (see the Tokenizers, Token filters and Char filters tables for valid values).

The analyzer definition is a part of the larger index. See Create Index preview API for information about the rest of the index.

"analyzers":(optional)[  
   {  
      "name":"name of analyzer",  
      "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",  
      "charFilters":[  
         "char_filter_name_1",  
         "char_filter_name_2"  
      ],  
      "tokenizer":"tokenizer_name",  
      "tokenFilters":[  
         "token_filter_name_1",  
         "token_filter_name_2"  
      ]  
   },  
   {  
      "name":"name of analyzer",  
      "@odata.type":"#analyzer_type",  
      "option1":value1,  
      "option2":value2,  
      ...  
   }  
],  
"charFilters":(optional)[  
   {  
      "name":"char_filter_name",  
      "@odata.type":"#char_filter_type",  
      "option1":value1,  
      "option2":value2,  
      ...  
   }  
],  
"tokenizers":(optional)[  
   {  
      "name":"tokenizer_name",  
      "@odata.type":"#tokenizer_type",  
      "option1":value1,  
      "option2":value2,  
      ...  
   }  
],  
"tokenFilters":(optional)[  
   {  
      "name":"token_filter_name",  
      "@odata.type":"#token_filter_type",  
      "option1":value1,  
      "option2":value2,  
      ...  
   }  
]  
Note

Custom analyzers that you create are not exposed in the Azure portal. The only way to add a custom analyzer is through code that makes calls to the API when defining an index.

Within an index definition, you can place this section anywhere in the body of a create index request but usually it goes at the end:

{  
  "name": "name_of_index",  
  "fields": [ ],  
  "suggesters": [ ],  
  "scoringProfiles": [ ],  
  "defaultScoringProfile": (optional) "...",  
  "corsOptions": (optional) { },  
  "analyzers":(optional)[ ],  
  "charFilters":(optional)[ ],  
  "tokenizers":(optional)[ ],  
  "tokenFilters":(optional)[ ]  
}  

Definitions for char filters, tokenizers, and token filters are added to the index only if you are setting custom options. To use an existing filter or tokenizer as-is, you can simply specify it by name in the analyzer definition.

Test a custom analyzer

You can use the Test Analyzer operation in the Preview REST API to see how an analyzer breaks given text into tokens.

Request

  POST https://[search service name].search.windows.net/indexes/[index name]/analyze?api-version=[api-version]
  Content-Type: application/json
    api-key: [admin key]

  {
     "analyzer":"my_analyzer",
     "text": "Vis-à-vis means Opposite"
  }

Response

  {
    "tokens": [
      {
        "token": "vis_a_vis",
        "startOffset": 0,
        "endOffset": 9,
        "position": 0
      },
      {
        "token": "vis_à_vis",
        "startOffset": 0,
        "endOffset": 9,
        "position": 0
      },
      {
        "token": "means",
        "startOffset": 10,
        "endOffset": 15,
        "position": 1
      },
      {
        "token": "opossite",
        "startOffset": 16,
        "endOffset": 24,
        "position": 2
      }
    ]
  }

Update a custom analyzer

Once an analyzer, a tokenizer, a token filter or a char filter is defined, it cannot be modified. New ones can be added to an existing index only if the allowIndexDowntime flag is set to true in the index update request:

PUT https://[search service name].search.windows.net/indexes/[index name]?api-version=[api-version]&allowIndexDowntime=true

Note that this operation will take your index offline for at least a few seconds, causing your indexing and query requests to fail. Performance and write availability of the index can be impaired for several minutes after the index is updated, or longer for very large indexes, but these effects are temporary and will eventually resolve on their own.

Index Attribute Reference

The tables below list the configuration properties for the analyzers, tokenizers, token filters and char filter section of an index definition. The structure of an analyzer, tokenizer, or filter in your index is composed of these attributes. For value assignment information, see the Property Reference.

Analyzers

For analyzers, index attributes vary depending on the whether you're using predefined or custom analyzers.

Predefined Analyzers

Name It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters.
Type Analyzer type from the list of supported analyzers. See the analyzer_type column in the Analyzers table below.
Options Must be valid options of a predefined analyzer listed in the Analyzers table below.

Custom Analyzers

Name It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters.
Type Must be "#Microsoft.Azure.Search.CustomAnalyzer".
CharFilters All of the char filters are either one of predefined char filters listed in the Char Filters table or any of the custom char filters defined in the index definition.
Tokenizer Required. Must be one of predefined tokenizers listed in the Tokenizers table below or any of the custom tokenizers defined in the index definition.
TokenFilters All of the token filters are either one of predefined token filters listed in theToken filters table or any of the custom token filters defined in the index definition.

Char Filters

A char filter is used to prepare input text before it is processed by the tokenizer. For instance, they can replace certain characters or symbols. You can have multiple char filters in a custom analyzer. Char filters run in the order in which they are listed.

Name It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters.
Type Char filter type from the list of supported char filters. See char_filter_type column in the Char Filters table below.
Options Must be valid options of a given Char Filters type.

Tokenizers

A tokenizer divides continuous text into a sequence of tokens, such as breaking a sentence into words.

You can specify exactly one tokenizer per custom analyzer. If you need more than one tokenizer, you can create multiple custom analyzers and assign them on a field-by-field basis in your index schema.
A custom analyzer can use a predefined tokenizer with either default or customized options.

Name It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters.
Type Tokenizer name from the list of supported tokenizers. See tokenizer_type column in the Tokenizers table below.
Options Must be valid options of a given tokenizer type listed in the Tokenizers table below.

Token filters

A token filter is used to filter out or modify the tokens generated by a tokenizer. For example, you can specify a lowercase filter that converts all characters to lowercase.
You can have multiple token filters in a custom analyzer. Token filters run in the order in which they are listed.

Name It must only contain letters, digits, spaces, dashes or underscores, can only start and end with alphanumeric characters, and is limited to 128 characters.
Type Token filter name from the list of supported token filters. See token_filter_type column in the Token filters table below.
Options Must be Token filters of a given token filter type.

Property reference

This section provides the valid values for attributes specified in the definition of a custom analyzer, tokenizer, char filter, or token filter in your index. Analyzers, tokenizers, and filters that are implemented using Apache Lucene have links to the Lucene API documentation.

Analyzers

analyzer_name analyzer_type 1 Description and Options
keyword (type applies only when options are available) Treats the entire content of a field as a single token. This is useful for data like zip codes, ids, and some product names.
pattern PatternAnalyzer Flexibly separates text into terms via a regular expression pattern.

Options

lowercase (type: bool) - Determines whether terms are lowercased. The default is true.

pattern (type: string) - A regular expression pattern to match token separators. The default is \w+.

flags (type: string) - Regular expression flags. The default is an empty string. Allowed values: CANON_EQ, CASE_INSENSITIVE, COMMENTS, DOTALL, LITERAL, MULTILINE, UNICODE_CASE, UNIX_LINES

stopwords (type: string array) - A list of stopwords. The default is an empty list.
simple (type applies only when options are available) Divides text at non-letters and converts them to lower case.
standard
(Also referred to as standard.lucene)
StandardAnalyzer Standard Lucene analyzer, composed of the standard tokenizer, lowercase filter and stop filter.

Options

maxTokenLength (type: int) - The maximum token length. The default is 255. Tokens longer than the maximum length are split. Maximum token length that can be used is 300 characters.

stopwords (type: string array) - A list of stopwords. The default is an empty list.
standardasciifolding.lucene (type applies only when options are available) Standard analyzer with Ascii folding filter.
stop StopAnalyzer Divides text at non-letters, applies the lowercase and stopword token filters.

Options

stopwords (type: string array) - A list of stopwords. The default is an empty list.
whitespace (type applies only when options are available) An analyzer that uses the whitespace tokenizer. Tokens that are longer than 255 characters are split.

1 Analyzer Types are always prefixed in code with "#Microsoft.Azure.Search" such that "PatternAnalyzer" would actually be specified as "#Microsoft.Azure.Search.PatternAnalyzer". We removed the prefix to reduce the width of the table, but please remember to include it in your code. Note that analyzer_type is only provided for analyzers that can be customized. If there are no options, as is the case with the keyword analyzer, there is no associated #Microsoft.Azure.Search type.

Char Filters

In the table below, the character filters that are implemented using Apache Lucene are linked to the Lucene API documentation.

char_filter_name char_filter_type 1 Description and Options
html_strip (type applies only when options are available) A char filter that attempts to strip out HTML constructs.
mapping MappingCharFilter A char filter that applies mappings defined with the mappings option. Matching is greedy (longest pattern matching at a given point wins). Replacement is allowed to be the empty string.

Options

mappings (type: string array) - A list of mappings of the following format: "a=>b" (all occurrences of the character "a" will be replaced with character "b"). Required.
pattern_replace PatternReplaceCharFilter A char filter that replaces characters in the input string. It uses a regular expression to identify character sequences to preserve and a replacement pattern to identify characters to replace. For example, input text = "aa bb aa bb", pattern="(aa)\\s+(bb)" replacement="$1#$2", result = "aa#bb aa#bb".

Options

pattern (type: string) - Required.

replacement (type: string) - Required.

1 Char Filter Types are always prefixed in code with "#Microsoft.Azure.Search" such that "MappingCharFilter" would actually be specified as "#Microsoft.Azure.Search.MappingCharFilter. We removed the prefix to reduce the width of the table, but please remember to include it in your code. Note that char_filter_type is only provided for filters that can be customized. If there are no options, as is the case with html_strip, there is no associated #Microsoft.Azure.Search type.

Tokenizers

In the table below, the tokenizers that are implemented using Apache Lucene are linked to the Lucene API documentation.

tokenizer_name tokenizer_type 1 Description and Options
classic ClassicTokenizer Grammar based tokenizer that is suitable for processing most European-language documents.

Options

maxTokenLength (type: int) - The maximum token length. Default: 255, maximum: 300. Tokens longer than the maximum length are split.
edgeNGram EdgeNGramTokenizer Tokenizes the input from an edge into n-grams of given size(s).

Options

minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - Default: 2, maximum: 300. Must be greater than minGram.

tokenChars (type: string array) - Character classes to keep in the tokens. Allowed values:
"letter", "digit", "whitespace", "punctuation", "symbol". Defaults to an empty array - keeps all characters.
keyword_v2 KeywordTokenizerV2 Emits the entire input as a single token.

Options

maxTokenLength (type: int) - The maximum token length. Default: 256, maximum: 300. Tokens longer than the maximum length are split.
letter (type applies only when options are available) Divides text at non-letters. Tokens that are longer than 255 characters are split.
lowercase (type applies only when options are available) Divides text at non-letters and converts them to lower case. Tokens that are longer than 255 characters are split.
microsoft_language_tokenizer MicrosoftLanguageTokenizer Divides text using language-specific rules.

Options

maxTokenLength (type: int) - The maximum token length, default: 255, maximum: 300. Tokens longer than the maximum length are split. Tokens longer than 300 characters are first split into tokens of length 300 and then each of those tokens is split based on the maxTokenLength set.

isSearchTokenizer (type: bool) - Set to true if used as the search tokenizer, set to false if used as the indexing tokenizer.

language (type: string) - Language to use, default "english". Allowed values include:
"bangla", "bulgarian", "catalan", "chineseSimplified", "chineseTraditional", "croatian", "czech", "danish", "dutch", "english", "french", "german", "greek", "gujarati", "hindi", "icelandic", "indonesian", "italian", "japanese", "kannada", "korean", "malay", "malayalam", "marathi", "norwegianBokmaal", "polish", "portuguese", "portugueseBrazilian", "punjabi", "romanian", "russian", "serbianCyrillic", "serbianLatin", "slovenian", "spanish", "swedish", "tamil", "telugu", "thai", "ukrainian", "urdu", "vietnamese"
microsoft_language_stemming_tokenizer MicrosoftLanguageStemmingTokenizer Divides text using language-specific rules and reduces words to their base forms

Options

maxTokenLength (type: int) - The maximum token length, default: 255, maximum: 300. Tokens longer than the maximum length are split. Tokens longer than 300 characters are first split into tokens of length 300 and then each of those tokens is split based on the maxTokenLength set.

isSearchTokenizer (type: bool) - Set to true if used as the search tokenizer, set to false if used as the indexing tokenizer.

language (type: string) - Language to use, default "english". Allowed values include:
"arabic", "bangla", "bulgarian", "catalan", "croatian", "czech", "danish", "dutch", "english", "estonian", "finnish", "french", "german", "greek", "gujarati", "hebrew", "hindi", "hungarian", "icelandic", "indonesian", "italian", "kannada", "latvian", "lithuanian", "malay", "malayalam", "marathi", "norwegianBokmaal", "polish", "portuguese", "portugueseBrazilian", "punjabi", "romanian", "russian", "serbianCyrillic", "serbianLatin", "slovak", "slovenian", "spanish", "swedish", "tamil", "telugu", "turkish", "ukrainian", "urdu"
nGram NGramTokenizer Tokenizes the input into n-grams of the given size(s).

Options

minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - Default: 2, maximum: 300. Must be greater than minGram.

tokenChars (type: string array) - Character classes to keep in the tokens. Allowed values: "letter", "digit", "whitespace", "punctuation", "symbol". Defaults to an empty array - keeps all characters.
path_hierarchy_v2 PathHierarchyTokenizerV2 Tokenizer for path-like hierarchies.

Options

delimiter (type: string) - Default: '/.

replacement (type: string) - If set, replaces the delimiter character. Default same as the value of delimiter.

maxTokenLength (type: int) - The maximum token length. Default: 300, maximum: 300. Paths loner than maxTokenLength are ignored.

reverse (type: bool) - If true, generates token in reverse order. Default: false.

skip (type: bool) - Initial tokens to skip. The default is 0.
pattern PatternTokenizer This tokenizer uses regex pattern matching to construct distinct tokens.

Options

pattern (type: string) - Regular expression pattern. The default is \w+.

flags (type: string) - Regular expression flags. The default is an empty string. Allowed values: CANON_EQ, CASE_INSENSITIVE, COMMENTS, DOTALL, LITERAL, MULTILINE, UNICODE_CASE, UNIX_LINES

group (type: int) - Which group to extract into tokens. The defaultis -1 (split).
standard_v2 StandardTokenizerV2 Breaks text following the Unicode Text Segmentation rules.

Options

maxTokenLength (type: int) - The maximum token length. Default: 255, maximum: 300. Tokens longer than the maximum length are split.
uax_url_email UaxUrlEmailTokenizer Tokenizes urls and emails as one token.

Options

maxTokenLength (type: int) - The maximum token length. Default: 255, maximum: 300. Tokens longer than the maximum length are split.
whitespace (type applies only when options are available) Divides text at whitespace. Tokens that are longer than 255 characters are split.

1 Tokenizer Types are always prefixed in code with "#Microsoft.Azure.Search" such that "ClassicTokenizer" would actually be specified as "#Microsoft.Azure.Search.ClassicTokenizer". We removed the prefix to reduce the width of the table, but please remember to include it in your code. Note that tokenizer_type is only provided for tokenizers that can be customized. If there are no options, as is the case with the letter tokenizer, there is no associated #Microsoft.Azure.Search type.

Token filters

In the table below, the token filters that are implemented using Apache Lucene are linked to the Lucene API documentation.

token_filter_name token_filter_type 1 Description and Options
arabic_normalization (type applies only when options are available) A token filter that applies the Arabic normalizer to normalize the orthography.
apostrophe (type applies only when options are available) Strips all characters after an apostrophe (including the apostrophe itself).
asciifolding AsciiFoldingTokenFilter Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

Options

preserveOriginal (type: bool) - If true, the original token will be kept. The default is false.
cjk_bigram CjkBigramTokenFilter Forms bigrams of CJK terms that are generated from StandardTokenizer.

Options

ignoreScripts (type: string array) - Scripts to ignore. Allowed values include: "han", "hiragana", "katakana", "hangul". The default is an empty list.

outputUnigrams (type: bool) - Set to true if you always want to output both unigrams and bigrams. The default is false.
cjk_width (type applies only when options are available) Normalizes CJK width differences. Folds fullwidth ASCII variants into the equivalent basic latin and half width Katakana variants into the equivalent kana.
classic (type applies only when options are available) Removes the English possessives, and dots from acronyms.
common_grams CommonGramTokenFilter Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid.

Options

commonWords (type: string array) - The set of common words. The default is an empty list. Required.

ignoreCase (type: bool) - If true, common words matching will be case insensitive. The default is false.

queryMode (type: bool) - Generates bigrams then removes common words and single terms followed by a common word. The default is false.
dictionary_decompounder DictionaryDecompounderTokenFilter Decomposes compound words found in many Germanic languages.

Options

wordList (type: string array) - The list of words to match against. The default is an empty list. Required.

minWordSize (type: int) - Only words longer than this get processed. The default is 5.

minSubwordSize (type: int) - Only subwords longer than this are outputted. The default is 2.

maxSubwordSize (type: int) - Only subwords shorter than this are outputted. The default is 15.

onlyLongestMatch (type: bool) - Add only the longest matching subword to output. The default is false.
edgeNGram_v2 EdgeNGramTokenFilterV2 Generates n-grams of the given size(s) from starting from the front or the back of an input token.

Options

minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - Default: 2, maximum 300. Must be greater than minGram.

side (type: string) - Specifies which side of the input the n-gram should be generated from. Allowed values: "front", "back"
elision ElisionTokenFilter Removes elisions. For example, "l'avion" (the plane) will be converted to "avion" (plane).

Options

articles (type: string array) - A set of articles to remove. The default is an empty list. If there is no list of articles set, by default all French articles will be removed.
german_normalization (type applies only when options are available) Normalizes German characters according to the heuristics of the German2 snowball algorithm .
hindi_normalization (type applies only when options are available) Normalizes text in Hindi to remove some differences in spelling variations.
indic_normalization IndicNormalizationTokenFilter Normalizes the Unicode representation of text in Indian languages.
keep KeepTokenFilter A token filter that only keeps tokens with text contained in specified list of words.

Options

keepWords (type: string array) - A list of words to keep. The default is an empty list. Required.

keepWordsCase (type: bool) - If true, lower case all words first. The default is false.
keyword_marker KeywordMarkerTokenFilter Marks terms as keywords.

Options

keywords (type: string array) - A list of words to mark as keywords. The default is an empty list. Required.

ignoreCase (type: bool) - If true, lower case all words first. The default is false.
keyword_repeat (type applies only when options are available) Emits each incoming token twice once as keyword and once as non-keyword.
kstem (type applies only when options are available) A high-performance kstem filter for English.
length LengthTokenFilter Removes words that are too long or too short.

Options

min (type: int) - The minimum number. Default: 0, maximum: 300.

max (type: int) - The maximum number. Default: 300, maximum: 300.
limit Microsoft.Azure.Search.LimitTokenFilter Limits the number of tokens while indexing.

Options

maxTokenCount (type: int) - Max number of tokens to produce. The default is 1.

consumeAllTokens (type: bool) - Whether all tokens from the input must be consumed even if maxTokenCount is reached. The default is false.
lowercase (type applies only when options are available) Normalizes token text to lower case.
nGram_v2 NGramTokenFilterV2 Generates n-grams of the given size(s).

Options

minGram (type: int) - Default: 1, maximum: 300.

maxGram (type: int) - Default: 2, maximum 300. Must be greater than minGram.
pattern_capture PatternCaptureTokenFilter Uses Java regexes to emit multiple tokens, one for each capture group in one or more patterns.

Options

patterns (type: string array) - A list of patterns to match against each token. Required.

preserveOriginal (type: bool) - Set to true to return the original token even if one of the patterns matches, default: true
pattern_replace PatternReplaceTokenFilter A token filter which applies a pattern to each token in the stream, replacing match occurrences with the specified replacement string.

Options

pattern (type: string) - Required.

replacement (type: string) - Required.
persian_normalization (type applies only when options are available) Applies normalization for Persian.
phonetic PhoneticTokenFilter Create tokens for phonetic matches.

Options

encoder (type: string) - Phonetic encoder to use. Allowed values include: "metaphone", "doubleMetaphone", "soundex", "refinedSoundex", "caverphone1", "caverphone2", "cologne", "nysiis", "koelnerPhonetik", "haasePhonetik", "beiderMorse". Default: "metaphone". Default is metaphone.

See encoder for more information.

replace (type: bool) - True if encoded tokens should replace original tokens, false if they should be added as synonyms. The default is true.
porter_stem (type applies only when options are available) Transforms the token stream as per the Porter stemming algorithm.
reverse (type applies only when options are available) Reverses the token string.
scandinavian_normalization (type applies only when options are available) Normalizes use of the interchangeable Scandinavian characters.
scandinavian_folding (type applies only when options are available) Folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminates against use of double vowels aa, ae, ao, oe and oo, leaving just the first one.
shingle ShingleTokenFilter Creates combinations of tokens as a single token.

Options

maxShingleSize (type: int) - Defaults to 2.

minShingleSize (type: int) - Defaults to 2.

outputUnigrams (type: bool) - if true, the output stream will contain the input tokens (unigrams) as well as shingles. The default is true.

outputUnigramsIfNoShingles (type: bool) - If true, override the behavior of outputUnigrams==false for those times when no shingles are available. The default is false.

tokenSeparator (type: string) - The string to use when joining adjacent tokens to form a shingle. The default is " ".

filterToken (type: string) - The string to insert for each position at which there is no token. The default is "_".
snowball SnowballTokenFilter Snowball Token Filter.

Options

language (type: string) - Allowed values include: "armenian", "basque", "catalan", "danish", "dutch", "english", "finnish", "french", "german", "german2", "hungarian", "italian", "kp", "lovins", "norwegian", "porter", "portuguese", "romanian", "russian", "spanish", "swedish", "turkish"
sorani_normalization SoraniNormalizationTokenFilter Normalizes the Unicode representation of Sorani text.

Options

None.
stemmer StemmerTokenFilter Language specific stemming filter.

Options

language (type: string) - Allowed values include:
- "arabic"
- "armenian"
- "basque"
- "brazilian"
- "bulgarian"
- "catalan"
- "czech"
- "danish"
- "dutch"
- "dutchKp"
- "english"
- "lightEnglish"
- "minimalEnglish"
- "possessiveEnglish"
- "porter2"
- "lovins"
- "finnish"
- "lightFinnish"
- "french"
- "lightFrench"
- "minimalFrench"
- "galician"
- "minimalGalician"
- "german"
- "german2"
- "lightGerman"
- "minimalGerman"
- "greek"
- "hindi"
- "hungarian"
- "lightHungarian"
- "indonesian"
- "irish"
- "italian"
- "lightItalian"
- "sorani"
- "latvian"
- "norwegian"
- "lightNorwegian"
- "minimalNorwegian"
- "lightNynorsk"
- "minimalNynorsk"
- "portuguese"
- "lightPortuguese"
- "minimalPortuguese"
- "portugueseRslp"
- "romanian"
- "russian"
- "lightRussian"
- "spanish"
- "lightSpanish"
- "swedish"
- "lightSwedish"
- "turkish"
stemmer_override StemmerOverrideTokenFilter Any dictionary-Stemmed terms will be marked as keywords so that they will not be stemmed with stemmers down the chain. Must be placed before any stemming filters.

Options

rules (type: string array) - Stemming rules in the following format "word => stem" e.g. "ran => run". The default is an empty list. Required.
stopwords StopwordsTokenFilter Removes stop words from a token stream.

Options

stopwords (type: string array) - A list of stopwords. Default is an empty list. Cannot be specified if a stopwordsList is specified.

stopwordsList (type: string) - A predefined list of stopwords. Cannot be specified if stopwords is specified. Allowed values include:"arabic", "armenian", "basque", "brazilian", "bulgarian", "catalan", "czech", "danish", "dutch", "english", "finnish", "french", "galician", "german", "greek", "hindi", "hungarian", "indonesian", "irish", "italian", "latvian", "norwegian", "persian", "portuguese", "romanian", "russian", "sorani", "spanish", "swedish", "thai", "turkish", default: "english". Cannot be specified if stopwords is specified. Default: english

ignoreCase (type: bool) - If true, all words are lower cased first. The default is false.

removeTrailing (type: bool) - If true, ignore the last search term if it's a stop word. The default is true.
synonym SynonymTokenFilter Matches single or multi word synonyms in a token stream.

Options

synonyms (type: string array) - Required. List of synonyms in one of the following two formats:

-incredible, unbelievable, fabulous => amazing - all terms on the left side of => symbol will be replaced with all terms on its right side.

-incredible, unbelievable, fabulous, amazing - A comma separated list of equivalent words. Set the expand option to change how this list is interpreted.

ignoreCase (type: bool) - Case-folds input for matching. The default is false.

expand (type: bool) - If true, all words in the list of synonyms (if => notation is not used) will map to one another.
The following list: incredible, unbelievable, fabulous, amazing is equivalent to: incredible, unbelievable, fabulous, amazing => incredible, unbelievable, fabulous, amazing

- If false, the following list: incredible, unbelievable, fabulous, amazing will be equivalent to: incredible, unbelievable, fabulous, amazing => incredible.
trim (type applies only when options are available) Trims leading and trailing whitespace from tokens.
truncate TruncateTokenFilter Truncates the terms into a specific length.

Options

length (type: int) - Default: 300, maximum: 300. Required.
unique UniqueTokenFilter Filters out tokens with same text as the previous token.

Options

onlyOnSamePosition (type: bool) - If set, removes duplicates only at the same position. The default is true.
uppercase (type applies only when options are available) Normalizes token text to upper case.
word_delimiter WordDelimiterTokenFilter Splits words into subwords and performs optional transformations on subword groups.

Options

generateWordParts (type: bool) - Causes parts of words to be generated, e.g. "AzureSearch" becomes "Azure" "Search". The default is true.

generateNumberParts (type: bool) - Xauses number subwords to be generated. The default is true.

catenateWords (type: bool) - Causes maximum runs of word parts to be catenated, e.g. "Azure-Search" becomes "AzureSearch". The default is false.

catenateNumbers (type: bool) - Causes maximum runs of number parts to be catenated, e.g. "1-2" becomes "12". The default is false.

catenateAll (type: bool) - Causes all subword parts to be catenated, e.g "Azure-Search-1" becomes "AzureSearch1". The default is false.

splitOnCaseChange (type: bool) - If true, splits words on caseChange, e.g. "AzureSearch"becomes "Azure" "Search". The default is true.

preserveOriginal - Causes original words to be preserved and added to the subword list. The default is false.

splitOnNumerics (type: bool) - If true, splits on numbers, e.g., "Azure1Search" becomes "Azure" "1" "Search". The default is true.

stemEnglishPossessive (type: bool) - Causes trailing "'s" to be removed for each subword. The default is true.

protectedWords (type: string array) - Tokens to protect from being delimited. The default is an empty list.

1 Token Filter Types are always prefixed in code with "#Microsoft.Azure.Search" such that "ArabicNormalizationTokenFilter" would actually be specified as "#Microsoft.Azure.Search.ArabicNormalizationTokenFilter". We removed the prefix to reduce the width of the table, but please remember to include it in your code.

Note

It's required that you configure you custom analyzer not to produce tokens longer than 300 characters. Indexing will fail for documents with such tokens. To trim them or ignore them use the TruncateTokenFilter and the LengthTokenFilter respectively.

Examples

The examples below show analyzer definitions for a few key scenarios.

Example 1: Custom options

This example illustrates an analyzer definition with custom options. Custom options for char filters, tokenizers, and token filters are specified separately as named constructs, and then referenced in the analyzer definition. Predefined elements are used as-is and simply referenced by name.

Walking through this example:

  • Analyzers are a property of the field class for a searchable field.
  • A custom analyzer is part of an index definition. It might be minimally customized (for example, customizing a single option in one filter) or customized in multiple places.
  • In this case, the custom analyzer is "my_analyzer" which in turn uses a customized standard tokenizer "my_standard_tokenizer" and two token filters: lowercase and customized asciifolding filter "my_asciifolding".
  • It also defines a custom "map_dash" char filter to replace all dashes with underscores before tokenization (the standard tokenizer breaks on dash but not on underscore).
  {
     "name":"myindex",
     "fields":[
        {
           "name":"id",
           "type":"Edm.String",
           "key":true,
           "searchable":false
        },
        {
           "name":"text",
           "type":"Edm.String",
           "searchable":true,
           "analyzer":"my_analyzer"
        }
     ],
     "analyzers":[
        {
           "name":"my_analyzer",
           "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
           "charFilters":[
              "map_dash"
           ],
           "tokenizer":"my_standard_tokenizer",
           "tokenFilters":[
              "my_asciifolding",
              "lowercase"
           ]
        }
     ],
     "charFilters":[
        {
           "name":"map_dash",
           "@odata.type":"#Microsoft.Azure.Search.MappingCharFilter",
           "mappings":["-=>_"]
        }
     ],
     "tokenizers":[
        {
           "name":"my_standard_tokenizer",
           "@odata.type":"#Microsoft.Azure.Search.StandardTokenizer",
           "maxTokenLength":20
        }
     ],
     "tokenFilters":[
        {
           "name":"my_asciifolding",
           "@odata.type":"#Microsoft.Azure.Search.AsciiFoldingTokenFilter",
           "preserveOriginal":true
        }
     ]
  }

Example 2: Override the default analyzer

The Standard analyzer is the default. Suppose you want to replace the default with a different predefined analyzer, such as the pattern analyzer. If you are not setting custom options, you only need to specify it by name in the field definition.

The "analyzer" element overrides the Standard analyzer on a field-by-field basis. There is no global override. In this example, text1 uses the pattern analyzer and text2, which doesn't specify an analyzer, uses the default.

  {
     "name":"myindex",
     "fields":[
        {
           "name":"id",
           "type":"Edm.String",
           "key":true,
           "searchable":false
        },
        {
           "name":"text1",
           "type":"Edm.String",
           "searchable":true,
           "analyzer":"pattern"
        },
        {
           "name":"text2",
           "type":"Edm.String",
           "searchable":true
        }
     ]
  }

Example 3: Different analyzers for indexing and search operations

The preview APIs include additional index attributes for specifying different analyzers for indexing and search. The searchAnalyzer and indexAnalyzer attributes must be specified as a pair, replacing the single analyzer attribute.

  {
     "name":"myindex",
     "fields":[
        {
           "name":"id",
           "type":"Edm.String",
           "key":true,
           "searchable":false
        },
        {
           "name":"text",
           "type":"Edm.String",
           "searchable":true,
           "indexAnalyzer":"whitespace",
           "searchAnalyzer":"simple"
        },
     ],
  }

Example 4: Language analyzer

Fields containing strings in different languages can use a language analyzer, while other fields retain the default (or use some other predefined or custom analyzer). If you use a language analyzer, it must be used for both indexing and search operations. Fields that use a language analyzer cannot have different analyzers for indexing and search.

  {
     "name":"myindex",
     "fields":[
        {
           "name":"id",
           "type":"Edm.String",
           "key":true,
           "searchable":false
        },
        {
           "name":"text",
           "type":"Edm.String",
           "searchable":true,
           "IndexAnalyzer":"whitespace",
           "searchAnalyzer":"simple"
        },
        {
           "name":"text_fr",
           "type":"Edm.String",
           "searchable":true,
           "analyzer":"fr.lucene"
        }
     ],
  }

See Also

Azure Search Service REST
Create Index (Azure Search Service REST API)