How to: Customize the Thesaurus in SharePoint Search and Search Server
The thesaurus is an xml file that provides users with a means of automatically expanding or rewriting their queries to include synonyms, acronyms, etc. For example, in a chemical company, product ID 1234, oxygen, O2 and LOX could all refer to the same item.
A SharePoint Search administrator can modify the thesaurus file to substitute all these words at search query time. This document explains how to set up a thesaurus and where to find the relevant files.
Supported Thesaurus Syntax:
To use the sample files provided by the product, you need to remove the comment beginning (<!--) and ending lines (-->) from the xml file.
Explanation of terms:
|thesaurus||marks beginning (and end) of thesaurus|
Diacritics are marks, such as accents that are added to letters that change their pronunciation. For example, the acute accent over and e gives you: é. 0 – ignore diacritics 1 – respect diacritics
|expansion||A list of alternative forms each marked by <sub> by the sub keyword|
|sub||One of several alternatives in an expansion|
|replacement||Several patterns will be replaced with a substitution.|
|pat||A pattern to be replaced|
|sub||Item to be substituted|
<XML ID="Microsoft Search Thesaurus"> <thesaurus xmlns="x-schema:tsSchema.xml"> <diacritics_sensitive>0</diacritics_sensitive> <expansion> <sub>Internet Explorer</sub> <sub>IE</sub> <sub>IE5</sub> </expansion> <replacement> <pat>NT5</pat> <pat>W2K</pat> <sub>Windows 2000</sub> </replacement> </thesaurus>
The example means:
- We have elected to ignore accents, etc in the thesaurus
- Queries containing IE, or any other one of the <sub> clauses will also contain “internet explorer” and “ie5”.
- If a query contains terms “NT5” or “W2K”, they will be replaced by “Windows 2000”.
How to Customize the Thesaurus:
Find the appropriate thesaurus file in the config folder contained in the registry key: [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager]"DefaultApplicationsPath”
Update the thesaurus file(s) for each appropriate language for each desired <expansion> or <replacement>.
Replace the file(s) on each index, query and web frontend server for each search application path:
%programfiles%\Microsoft Office Servers\12.0\Data\Office Server\Applications\[GUID]\Config
Note index propagation does not sync these files on all the servers in the farm.
Stop and restart search service (this is needed to load the new thesaurus files). E.G., in a console window, run “net stop osearch & net start osearch” without quotes, or launch Programs\Administrations Tools\Services then right click Office SharePoint Search Service then choose restart.
See “Finding Important Files” below for a summary of where to find the key files to manage your thesaurus.
(optional) If you want to have the same thesaurus files apply to all newly created SSPs, put your thesaurus files under the main config folder
(e.g., %programfiles%\Microsoft Office Servers\12.0\Data\config).
If there is a syntax error in the thesaurus file, all expansions and replacements will be ignored.
If a word in the thesaurus file matches a stop word in the stop word file, it will be ignored. To avoid this, remove it from the appropriate stop word file.
Thesaurus terms are broken into words at query time. Add words you do not want to be broken into the custom dictionary file customLANG.lex (see Finding Important Files for more details).
Search first applies the thesaurus, and then expands words into their alternate forms, when “stemming” functionality is turned on. Care should be taken to avoid expanding into too many unnecessary forms as this may harm search performance and accuracy.
The “All words” option on the Advanced Search page might no longer work when using multiple term substitution with the thesaurus. This is because an implicit “+” is used between every term. For example, if we used our example thesaurus above and typed E.G., “browser ie” in the “All words” field, it would look for “+browser +ie” – it would no longer allow “Internet Explorer”.
Ambiguous replacements will stop the thesaurus working (this will be noted in the appropriate logs, but will not be obvious to the user). For example, if you replace a with b and a with c, this is an error. Some admins add large thesauri which are automatically populated with items such as “replace a b with c” and “replace a,b with c”. As it turns out, after wordbreaking these two expressions look exactly the same. Please check for this kind of problem in the logs if you are building a large thesaurus.
There is a 10,000 term limit per language in thesaurus.
Finding Important Files:
The following are the most important files used to manage your thesaurus.
There are 50 default stop word files and 48 thesaurus sample files for the languages we support.
The search service install path can be located by examining registry key [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager]"DefaultApplicationsPath”
The default location of the thesaurus files (for each index, query and web frontend server) is:
%programfiles%\ Microsoft Office Servers\12.0\Data\Office Server
When a search application is created, a copy of the thesaurus file will also be placed under: %programfiles%\Microsoft Office Servers\12.0\Data\Office Server\Applications\[GUID]\Config
Stop word files for each language can be found as noiseLANG.txt, where LANG is the 3 letter acronym for that language. For example, US English is noiseENU.txt, and the language neutral list is noiseNEU.txt.
To find the appropriate acronym for your language(s), please look them up under: http://www.microsoft.com/globaldev/nlsweb/default.mspx.
|Ping Lin Senior Test Lead Microsoft Corp.||Victor Poznanski Senior Program Manager Microsoft Corp.|