MOSS Search Word Stemming - Part 2

So how Does MOSS Expand Search Query Terms to Related Words?


Here is how this works in MOSS:


In MOSS, stemming is used in combination with the word breaker component which determines where word boundaries are. The word breaker is used at both index and query time while the stemmer is used only at query time for most languages (the exceptions currently are Arabic and Hebrew) to perform both morphological analysis and morphological generation. In the case of Arabic and Hebrew, stemming is restricted to morphological analysis at both query and index time. A stemmer links word forms to their base form. For example, ”running,” ”ran,” and ”runs“ are all variants of the verb ”to run.” Stemming is currently turned off by default for some languages including English. Stemmers are only available for languages which have significant morphological variation among their word forms. This means that for languages where stemmers are not available (such as Vietnamese) turning on this feature in the Search Result Page (CoreResult Web Part) will not have any effect, since in such languages exact match is all that is needed.


Word Stemming is NOT the same thing as Wild Card Searching, which our engine supports as well. Wild Card searching has to do with doing searches with * in the query. This means you are asking the search engine to find you all words that start with the text string and end with anything, since * means match any character any number of times until you reach the end of the word which in most languages (excluding most East Asian languages) is indicated by a white space. So a search query using * such as "Share*" will return results including "SharePoint", while a search query using morphological processing would bring back "sharing", which is an inflectional variant of Share. Wild Card searching and Word Stemming are often used to refer to the same thing but they are in fact separate and different mechanisms which can return different results.


Word Stemming would bring back words closely related to the query terms (usually inflectional variants for most languages, but for some languages derivational variants as well).


 For example, for the following queries, here are some sample results

  • If you type in "run" --> in addition to exact matches on “run”, it will bring back matches on "runs", "ran" and "running"
  • If you type in "page" --> in addition to exact matches on “page” it will bring back matches on "pages", "paged" and "paging"
  • If you type in "basket" --> in addition to exact matches on “basket” it will find "baskets", but it will not find "basketball". A wild card search for “basket*” would find basketball, which our engine supports and I will discuss this in another article. Word Stemming does not handle this currently because we have focused on matching inflectional variants of words only rather than derivational variants.

However this option is turned off by default out of the box for English and some other languages. You can turn this on by going to the Search Results Web Part, and then Options and turn on this feature which is called “Enable Search Term Stemming”.


Thanks for Ian Johnson from the Natural Language Group at Microsoft for providing his feedback on this.


Hope that helps