Indexes - Analyze

Shows how an analyzer breaks text into tokens.

POST {endpoint}/indexes('{indexName}')/search.analyze?api-version=2023-11-01

URI Parameters

Name In Required Type Description
endpoint
path True

string

The endpoint URL of the search service.

indexName
path True

string

The name of the index for which to test an analyzer.

api-version
query True

string

Client Api Version.

Request Header

Name Required Type Description
x-ms-client-request-id

string

uuid

The tracking ID sent with the request to help with debugging.

Request Body

Name Required Type Description
text True

string

The text to break into tokens.

analyzer

LexicalAnalyzerName

The name of the analyzer to use to break the given text. If this parameter is not specified, you must specify a tokenizer instead. The tokenizer and analyzer parameters are mutually exclusive.

charFilters

CharFilterName[]

An optional list of character filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter.

tokenFilters

TokenFilterName[]

An optional list of token filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter.

tokenizer

LexicalTokenizerName

The name of the tokenizer to use to break the given text. If this parameter is not specified, you must specify an analyzer instead. The tokenizer and analyzer parameters are mutually exclusive.

Responses

Name Type Description
200 OK

AnalyzeResult

Other Status Codes

SearchError

Error response.

Examples

SearchServiceIndexAnalyze

Sample Request

POST https://myservice.search.windows.net/indexes('hotels')/search.analyze?api-version=2023-11-01

{
  "text": "Text to analyze",
  "analyzer": "standard.lucene"
}

Sample Response

{
  "tokens": [
    {
      "token": "text",
      "startOffset": 0,
      "endOffset": 4,
      "position": 0
    },
    {
      "token": "to",
      "startOffset": 5,
      "endOffset": 7,
      "position": 1
    },
    {
      "token": "analyze",
      "startOffset": 8,
      "endOffset": 15,
      "position": 2
    }
  ]
}

Definitions

Name Description
AnalyzedTokenInfo

Information about a token returned by an analyzer.

AnalyzeRequest

Specifies some text and analysis components used to break that text into tokens.

AnalyzeResult

The result of testing an analyzer on text.

CharFilterName

Defines the names of all character filters supported by the search engine.

LexicalAnalyzerName

Defines the names of all text analyzers supported by the search engine.

LexicalTokenizerName

Defines the names of all tokenizers supported by the search engine.

SearchError

Describes an error condition for the API.

TokenFilterName

Defines the names of all token filters supported by the search engine.

AnalyzedTokenInfo

Information about a token returned by an analyzer.

Name Type Description
endOffset

integer

The index of the last character of the token in the input text.

position

integer

The position of the token in the input text relative to other tokens. The first token in the input text has position 0, the next has position 1, and so on. Depending on the analyzer used, some tokens might have the same position, for example if they are synonyms of each other.

startOffset

integer

The index of the first character of the token in the input text.

token

string

The token returned by the analyzer.

AnalyzeRequest

Specifies some text and analysis components used to break that text into tokens.

Name Type Description
analyzer

LexicalAnalyzerName

The name of the analyzer to use to break the given text. If this parameter is not specified, you must specify a tokenizer instead. The tokenizer and analyzer parameters are mutually exclusive.

charFilters

CharFilterName[]

An optional list of character filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter.

text

string

The text to break into tokens.

tokenFilters

TokenFilterName[]

An optional list of token filters to use when breaking the given text. This parameter can only be set when using the tokenizer parameter.

tokenizer

LexicalTokenizerName

The name of the tokenizer to use to break the given text. If this parameter is not specified, you must specify an analyzer instead. The tokenizer and analyzer parameters are mutually exclusive.

AnalyzeResult

The result of testing an analyzer on text.

Name Type Description
tokens

AnalyzedTokenInfo[]

The list of tokens returned by the analyzer specified in the request.

CharFilterName

Defines the names of all character filters supported by the search engine.

Name Type Description
html_strip

string

A character filter that attempts to strip out HTML constructs. See https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilter.html

LexicalAnalyzerName

Defines the names of all text analyzers supported by the search engine.

Name Type Description
ar.lucene

string

Lucene analyzer for Arabic.

ar.microsoft

string

Microsoft analyzer for Arabic.

bg.lucene

string

Lucene analyzer for Bulgarian.

bg.microsoft

string

Microsoft analyzer for Bulgarian.

bn.microsoft

string

Microsoft analyzer for Bangla.

ca.lucene

string

Lucene analyzer for Catalan.

ca.microsoft

string

Microsoft analyzer for Catalan.

cs.lucene

string

Lucene analyzer for Czech.

cs.microsoft

string

Microsoft analyzer for Czech.

da.lucene

string

Lucene analyzer for Danish.

da.microsoft

string

Microsoft analyzer for Danish.

de.lucene

string

Lucene analyzer for German.

de.microsoft

string

Microsoft analyzer for German.

el.lucene

string

Lucene analyzer for Greek.

el.microsoft

string

Microsoft analyzer for Greek.

en.lucene

string

Lucene analyzer for English.

en.microsoft

string

Microsoft analyzer for English.

es.lucene

string

Lucene analyzer for Spanish.

es.microsoft

string

Microsoft analyzer for Spanish.

et.microsoft

string

Microsoft analyzer for Estonian.

eu.lucene

string

Lucene analyzer for Basque.

fa.lucene

string

Lucene analyzer for Persian.

fi.lucene

string

Lucene analyzer for Finnish.

fi.microsoft

string

Microsoft analyzer for Finnish.

fr.lucene

string

Lucene analyzer for French.

fr.microsoft

string

Microsoft analyzer for French.

ga.lucene

string

Lucene analyzer for Irish.

gl.lucene

string

Lucene analyzer for Galician.

gu.microsoft

string

Microsoft analyzer for Gujarati.

he.microsoft

string

Microsoft analyzer for Hebrew.

hi.lucene

string

Lucene analyzer for Hindi.

hi.microsoft

string

Microsoft analyzer for Hindi.

hr.microsoft

string

Microsoft analyzer for Croatian.

hu.lucene

string

Lucene analyzer for Hungarian.

hu.microsoft

string

Microsoft analyzer for Hungarian.

hy.lucene

string

Lucene analyzer for Armenian.

id.lucene

string

Lucene analyzer for Indonesian.

id.microsoft

string

Microsoft analyzer for Indonesian (Bahasa).

is.microsoft

string

Microsoft analyzer for Icelandic.

it.lucene

string

Lucene analyzer for Italian.

it.microsoft

string

Microsoft analyzer for Italian.

ja.lucene

string

Lucene analyzer for Japanese.

ja.microsoft

string

Microsoft analyzer for Japanese.

keyword

string

Treats the entire content of a field as a single token. This is useful for data like zip codes, ids, and some product names. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html

kn.microsoft

string

Microsoft analyzer for Kannada.

ko.lucene

string

Lucene analyzer for Korean.

ko.microsoft

string

Microsoft analyzer for Korean.

lt.microsoft

string

Microsoft analyzer for Lithuanian.

lv.lucene

string

Lucene analyzer for Latvian.

lv.microsoft

string

Microsoft analyzer for Latvian.

ml.microsoft

string

Microsoft analyzer for Malayalam.

mr.microsoft

string

Microsoft analyzer for Marathi.

ms.microsoft

string

Microsoft analyzer for Malay (Latin).

nb.microsoft

string

Microsoft analyzer for Norwegian (Bokmål).

nl.lucene

string

Lucene analyzer for Dutch.

nl.microsoft

string

Microsoft analyzer for Dutch.

no.lucene

string

Lucene analyzer for Norwegian.

pa.microsoft

string

Microsoft analyzer for Punjabi.

pattern

string

Flexibly separates text into terms via a regular expression pattern. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/PatternAnalyzer.html

pl.lucene

string

Lucene analyzer for Polish.

pl.microsoft

string

Microsoft analyzer for Polish.

pt-BR.lucene

string

Lucene analyzer for Portuguese (Brazil).

pt-BR.microsoft

string

Microsoft analyzer for Portuguese (Brazil).

pt-PT.lucene

string

Lucene analyzer for Portuguese (Portugal).

pt-PT.microsoft

string

Microsoft analyzer for Portuguese (Portugal).

ro.lucene

string

Lucene analyzer for Romanian.

ro.microsoft

string

Microsoft analyzer for Romanian.

ru.lucene

string

Lucene analyzer for Russian.

ru.microsoft

string

Microsoft analyzer for Russian.

simple

string

Divides text at non-letters and converts them to lower case. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/SimpleAnalyzer.html

sk.microsoft

string

Microsoft analyzer for Slovak.

sl.microsoft

string

Microsoft analyzer for Slovenian.

sr-cyrillic.microsoft

string

Microsoft analyzer for Serbian (Cyrillic).

sr-latin.microsoft

string

Microsoft analyzer for Serbian (Latin).

standard.lucene

string

Standard Lucene analyzer.

standardasciifolding.lucene

string

Standard ASCII Folding Lucene analyzer. See https://docs.microsoft.com/rest/api/searchservice/Custom-analyzers-in-Azure-Search#Analyzers

stop

string

Divides text at non-letters; Applies the lowercase and stopword token filters. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/StopAnalyzer.html

sv.lucene

string

Lucene analyzer for Swedish.

sv.microsoft

string

Microsoft analyzer for Swedish.

ta.microsoft

string

Microsoft analyzer for Tamil.

te.microsoft

string

Microsoft analyzer for Telugu.

th.lucene

string

Lucene analyzer for Thai.

th.microsoft

string

Microsoft analyzer for Thai.

tr.lucene

string

Lucene analyzer for Turkish.

tr.microsoft

string

Microsoft analyzer for Turkish.

uk.microsoft

string

Microsoft analyzer for Ukrainian.

ur.microsoft

string

Microsoft analyzer for Urdu.

vi.microsoft

string

Microsoft analyzer for Vietnamese.

whitespace

string

An analyzer that uses the whitespace tokenizer. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/WhitespaceAnalyzer.html

zh-Hans.lucene

string

Lucene analyzer for Chinese (Simplified).

zh-Hans.microsoft

string

Microsoft analyzer for Chinese (Simplified).

zh-Hant.lucene

string

Lucene analyzer for Chinese (Traditional).

zh-Hant.microsoft

string

Microsoft analyzer for Chinese (Traditional).

LexicalTokenizerName

Defines the names of all tokenizers supported by the search engine.

Name Type Description
classic

string

Grammar-based tokenizer that is suitable for processing most European-language documents. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html

edgeNGram

string

Tokenizes the input from an edge into n-grams of the given size(s). See https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenizer.html

keyword_v2

string

Emits the entire input as a single token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/KeywordTokenizer.html

letter

string

Divides text at non-letters. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/LetterTokenizer.html

lowercase

string

Divides text at non-letters and converts them to lower case. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/LowerCaseTokenizer.html

microsoft_language_stemming_tokenizer

string

Divides text using language-specific rules and reduces words to their base forms.

microsoft_language_tokenizer

string

Divides text using language-specific rules.

nGram

string

Tokenizes the input into n-grams of the given size(s). See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenizer.html

path_hierarchy_v2

string

Tokenizer for path-like hierarchies. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/path/PathHierarchyTokenizer.html

pattern

string

Tokenizer that uses regex pattern matching to construct distinct tokens. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/pattern/PatternTokenizer.html

standard_v2

string

Standard Lucene analyzer; Composed of the standard tokenizer, lowercase filter and stop filter. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html

uax_url_email

string

Tokenizes urls and emails as one token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizer.html

whitespace

string

Divides text at whitespace. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html

SearchError

Describes an error condition for the API.

Name Type Description
code

string

One of a server-defined set of error codes.

details

SearchError[]

An array of details about specific errors that led to this reported error.

message

string

A human-readable representation of the error.

TokenFilterName

Defines the names of all token filters supported by the search engine.

Name Type Description
apostrophe

string

Strips all characters after an apostrophe (including the apostrophe itself). See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/tr/ApostropheFilter.html

arabic_normalization

string

A token filter that applies the Arabic normalizer to normalize the orthography. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ar/ArabicNormalizationFilter.html

asciifolding

string

Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if such equivalents exist. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html

cjk_bigram

string

Forms bigrams of CJK terms that are generated from the standard tokenizer. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html

cjk_width

string

Normalizes CJK width differences. Folds fullwidth ASCII variants into the equivalent basic Latin, and half-width Katakana variants into the equivalent Kana. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html

classic

string

Removes English possessives, and dots from acronyms. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/standard/ClassicFilter.html

common_grams

string

Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html

edgeNGram_v2

string

Generates n-grams of the given size(s) starting from the front or the back of an input token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html

elision

string

Removes elisions. For example, "l'avion" (the plane) will be converted to "avion" (plane). See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/util/ElisionFilter.html

german_normalization

string

Normalizes German characters according to the heuristics of the German2 snowball algorithm. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html

hindi_normalization

string

Normalizes text in Hindi to remove some differences in spelling variations. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/hi/HindiNormalizationFilter.html

indic_normalization

string

Normalizes the Unicode representation of text in Indian languages. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/in/IndicNormalizationFilter.html

keyword_repeat

string

Emits each incoming token twice, once as keyword and once as non-keyword. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html

kstem

string

A high-performance kstem filter for English. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/en/KStemFilter.html

length

string

Removes words that are too long or too short. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html

limit

string

Limits the number of tokens while indexing. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html

lowercase

string

Normalizes token text to lower case. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.htm

nGram_v2

string

Generates n-grams of the given size(s). See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html

persian_normalization

string

Applies normalization for Persian. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/fa/PersianNormalizationFilter.html

phonetic

string

Create tokens for phonetic matches. See https://lucene.apache.org/core/4_10_3/analyzers-phonetic/org/apache/lucene/analysis/phonetic/package-tree.html

porter_stem

string

Uses the Porter stemming algorithm to transform the token stream. See http://tartarus.org/~martin/PorterStemmer

reverse

string

Reverses the token string. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html

scandinavian_folding

string

Folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminates against use of double vowels aa, ae, ao, oe and oo, leaving just the first one. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html

scandinavian_normalization

string

Normalizes use of the interchangeable Scandinavian characters. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html

shingle

string

Creates combinations of tokens as a single token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html

snowball

string

A filter that stems words using a Snowball-generated stemmer. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/snowball/SnowballFilter.html

sorani_normalization

string

Normalizes the Unicode representation of Sorani text. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/ckb/SoraniNormalizationFilter.html

stemmer

string

Language specific stemming filter. See https://docs.microsoft.com/rest/api/searchservice/Custom-analyzers-in-Azure-Search#TokenFilters

stopwords

string

Removes stop words from a token stream. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/StopFilter.html

trim

string

Trims leading and trailing whitespace from tokens. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/TrimFilter.html

truncate

string

Truncates the terms to a specific length. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilter.html

unique

string

Filters out tokens with same text as the previous token. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html

uppercase

string

Normalizes token text to upper case. See http://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/core/UpperCaseFilter.html

word_delimiter

string

Splits words into subwords and performs optional transformations on subword groups.