Configuring Thesaurus Files
[This topic is pre-release documentation and is subject to change in future releases. Blank topics are included as placeholders.]
All thesaurus files that are included with SQL Server 2008 are formatted as follows.
<XML ID="Microsoft Search Thesaurus"> <!-- Commented out <thesaurus xmlns="x-schema:tsSchema.xml"> <diacritics = false/> <expansion> <sub>Internet Explorer</sub> <sub>IE</sub> <sub>IE5</sub> </expansion> <replacement> <pat>NT5</pat> <pat>W2K</pat> <sub>Windows 2000</sub> </replacement> <expansion> <sub>run</sub> <sub>jog</sub> </expansion> </thesaurus> --> </XML>
Each thesaurus file has one or more of the following sections:
- Expansion set
An expansion set contains a group of synonyms. These synonyms are identified in code by "substitution" tags (<sub> and </sub>). Queries that contain matches in one substitution are expanded to include all other substitutions in the expansion set.
- Replacement set
A replacement set contains a text pattern to be replaced by a substitution set. For an example, see the section "Replacement Set" later in this topic.
Additionally, the thesaurus file includes a a
<diacritics = false/> tag.
false indicates that the terms specified in the expansion and replacement sets are accent-insensitive. To make searches using the thesaurus accent-sensitive, change this tag to
<diacritics = true``/> . For example, suppose you specify the pattern "café" to be replaced by other patterns in a Full-Text Search query. If the thesaurus file is accent-insensitive, Full-Text Search replaces the patterns "café" and "cafe". If the thesaurus file is accent-sensitive, Full-Text Search replaces only the pattern "café". Note that this setting can only be applied one time in the file, and applies to all the search patterns in the file. This setting cannot be specified for individual patterns.
When editing thesaurus files using text editor tools, the files must be saved in Unicode format and Byte Order Marks must be specified.
Each expansion set is enclosed within an <expansion> tag. Within the expansion tag, you specify one or more substitutions enclosed by a <sub> tag. In the expansion set, you can specify a group of substitutions that are synonyms of each other.
For example, you can edit the expansion section to treat the substitutions "writer", "author", and "journalist" as synonyms. Full-Text Search queries that contain matches in one substitution are expanded to include all other substitutions specified in the expansion set. Therefore, in the preceding example, when you issue a FORMS OF THESAURUS or a FREETEXT query for the word "author", Full-Text Search also returns search results containing the words "writer" and "journalist".
This is what the expansion set section would look like for the above example:
<expansion> <sub>writer</sub> <sub>author</sub> <sub>journalist</sub> </expansion>
Each replacement set is enclosed within a <replacement> tag. Within each replacement tag you can specify one or more patterns enclosed by a <pat> tag. You can specify one or more substitutions enclosed by <sub> tags. You can specify a pattern to be replaced by a substitution set. Patterns and substitutions can contain a word, or a sequence of words.
For example, suppose you want queries for "W2K", the pattern, to be replaced by "Windows 2000" or "XP", the substitutions. If you run a full-text query for "W2K", Full-Text Search only returns search results containing "Windows 2000" or "XP". It does not return results containing "W2K". This is because the pattern "W2K" has been "replaced" by the patterns "Windows 2000" and "XP".
This is what the replacement set section would look like for the above example:
<replacement> <pat>W2K</pat> <sub>Windows 2000</sub> <sub>XP</sub> </replacement>
If you have two replacement sets with similar patterns being matched, the longer of the two takes precedence. For example, if you run a FORMS OF THESAURUS query for "Internet Explorer online community" and you have the following replacement sets, the "Internet Explorer" replacement set takes precedence over the "Internet" replacement set. The query will therefore be processed as "IE online community" or "IE 5 online community".
<replacement> <pat>Internet</pat> <sub>intranet</sub> </replacement>
<replacement> <pat>Internet Explorer</pat> <sub>IE</sub> <sub>IE 5</sub> </replacement>