Languages Supported by Windows Search

This topic describes how Windows Search supports multiple languages.

Tokenization, Wordbreakers, and Language Resources

Windows Search is language-independent, but the accuracy of search across languages may vary because of the way wordbreakers tokenize text. Wordbreakers implement various tokenization rules for languages and break text into individual tokens, or words, to be indexed or searched.

Both the language of the indexed text and the query string are broken into tokens. Because tokenization rules vary by language, there are separate wordbreakers for each language or family of languages. If there is a mismatch between the query language and the indexed language, the results can be unpredictable.

Windows Search ships with a well defined set of wordbreakers. Classic wordbreaker and stemmer components are supported in Windows Vista and later. If the language of a document cannot be determined, Windows Search attempts to detect the language to identify the most appropriate wordbreaker. Windows Search attempts to detect the language by calling the GetSystemPreferredUILanguages function to determine the first Multiple User Interface (MUI) language (which is typically the system UI language unless MUI language packs are installed). If that call succeeds, the wordbreaker for the first MUI language is used. If the call to GetSystemPreferredUILanguages fails, Windows Search retrieves the system locale by calling the GetSystemDefaultLCID function and uses the wordbreaker associated with that locale.

If no wordbreaker is installed for a language, Windows Search breaks on white space by using the Neutral wordbreaker.

You can remove a language through the registry, as illustrated in the following example.

HKEY_LOCAL_MACHINE
   SYSTEM
      CurrentControlSet
         Control
            ContentIndex
               Language
                  Dutch_Dutch
                     (Default)
                     Locale
                     NoiseFile
                     StemmerClass = CLSID
                     WBreakerClass = CLSID

Tip

If you make changes to the registry, restart Windows Search.

 

When Windows Search requires a new wordbreaker, the class identifier (CLSID) is read, and the instantiated wordbreaker is cached.

You can create a custom wordbreaker for a language by implementing the IWordBreaker interface. Windows Search then calls the IWordBreaker methods when it builds content indexes and runs queries.

Locale information for indexed content is retrieved from the source of the content. If the source implementer does not know the locale of the indexed content, it should set the locale to LOCALE_NEUTRAL.

For example, if you implement a filter handler (an implementation of the IFilter interface), property handler, or protocol handler, you should set the locale for indexed content to LOCALE_NEUTRAL unless you have specific locale information and are confident of its accuracy.

Tip

If an index query is based on user input, the locale should match the language in which the user is typing. You can determine this locale by calling the GetKeyboardLayout function.

 

Languages Supported by Wordbreakers

Windows Search includes wordbreakers to support the following languages.

Registry key Language (sublanguage) LCID
Arabic_SaudiArabia Arabic (Neutral) 0x0001
Bengali_Default Bangla (Neutral) 0x0045
Bulgarian_Default Bulgarian (Bulgaria) 0x0402
Catalan_Default Catalan (Catalan) 0x0403
Chinese_HongKong Chinese (Hong Kong SAR, PRC) 0x0C04
Chinese_Simplified Chinese (Simplified) 0x0804
Chinese_Traditional Chinese (Traditional) 0x0404
Croatian_Default Croatian (Croatia) 0x041A
Czech_Default Czech (Czech Republic) 0x0405
Danish_Default Danish (Denmark) 0x0406
Dutch_Dutch Dutch (Netherlands) 0x0413
English_UK English (United Kingdom) 0x0809
English_US English (United States) 0x0409
Finnish_Default Finnish (Finland) 0x040B
French_French French (France) 0x040C
German_German German (Germany) 0x0407
Greek_Default Greek (Greece) 0x0408
Gujarati_Default Gujarati (India) 0x0447
Hebrew_Default Hebrew (Neutral) 0x000D
Hindi_Default Hindi (India) 0x0439
Hungarian_Default Hungarian (Hungary) 0x040E
Icelandic_Default Icelandic (Iceland) 0x040F
Indonesian_Default Indonesian (Indonesia) 0x0421
Italian_Italian Italian (Italy) 0x0410
Japanese_Default Japanese (Japan) 0x0411
Kannada_Default Kannada (India) 0x044B
Korean_Default Korean (Korea) 0x0412
Latvian_Default Latvian (Latvia) 0x0426
Lithuanian_Default Lithuanian (Lithuanian) 0x0427
Malay_Malaysia Malay (Malaysia) 0x043E
Malayalam_Default Malayalam (Neutral) 0x004C
Marathi_Default Marathi (India) 0x044E
Norwegian_Bokmal Norwegian (Bokmål, Norway) 0x0414
Polish_Default Polish (Poland) 0x0415
Portuguese_Portugal Portuguese (Portugal) 0x0816
Portuguese_Brazil Portuguese (Brazil) 0x0416
Punjabi_Default Punjabi (India) 0x0446
Romanian_Default Romanian (Romania) 0x0418
Russian_Default Russian (Neutral) 0x0019
Serbian_Cyrillic Serbian (Serbia and Montenegro, Former, Cyrillic) 0x0C1A
Serbian_Latin Serbian (Serbia and Montenegro, Former, Latin) 0x081A
Slovak_Default Slovak (Slovakia) 0x041B
Slovenian_Default Slovenian (Slovenia) 0x0424
Spanish_Modern Spanish (Spain, Modern Sort) 0x0C0A
Swedish_Default Swedish (Sweden) 0x041D
Tamil_Default Tamil (India) 0x0449
Telugu_Default Telugu (India) 0x044A
Thai_Default Thai (Thailand) 0x041E
Turkish_Default Turkish (Turkey) 0x041F
Ukrainian_Default Ukrainian (Ukraine) 0x0422
Urdu_Default Urdu (Pakistan) 0x0420
Vietnamese_Default Vietnamese (Vietnam) 0x042A

 

Note

LCIDs for some languages in the table are generated using the language identifier, sublanguage identifier, and sort identifier.

 

For more information about languages and associated identifiers, see Language Identifier Constants and Strings.

Note

There is no guarantee that all of these language registry keys will be present on any given machine. The wordbreaker for any given language may or may not be installed in the machine depending on user settings.

 

Beginning in Windows 8.1, the preferred way to use wordbreakers is via the WinRT API WordsSegmenter class.

Additional Resources

Windows Search Overview

Windows Search as a Development Platform

Using Managed Code with Shell Data and Windows Search