Search: Fine Tuning search relevancy in Microsoft SharePoint Server 2007: Getting the search results your user expects
Relevance is about how closely the search results match what the user wanted to find.
To improve the search results that MOSS Search returns, we need to understand how search results are ranked:
SharePoint performs two types of ranking, dynamic ranking and static ranking. Dynamic ranking, is something that happens on the Query Servers and depends on query and term matching whereas static ranking occurs at index time.
Static ranking is query independent and is computed at index time. Lets dive deeper into each of these:
This looks at the content or property values for a content item such as:
This evaluates the text that describes a target. E.g. <A href=http://portal/site> Company Name Enterprise Gateway Portal</A>
- Search harvests anchor text from HTML anchor elements, WSS Link Lists, SPS 2003 listings, Word/Excel/PowerPoint 2007 (files using Open Office XML File Formats)
- Any other File Types handled by installed 3rd Party iFilter components
Property weighting infers that matches on a specific property value can be more relevant than other property values or in document’s body.
- MOSS 2007 automatically enhances / extraction of metadata
- MOSS 2007 automatic tuning
- Index time implementation (occurs on index server)
- Weight is part of property definition
- Managed properties considered in ranking (weights can only be changed through object model); New Relevance Object Model in Microsoft.Office.Server.Search.Administration Namespace
- Configure Managed Property (managedProperty.Weight = newWeight;) or set ranking parameter on predefined documents (RankingParameter.Value)
string strURL = " http://<SiteName >";
using (SPSite site = new SPSite(" http://yourSiteName "))
srchContext = SearchContext.GetContext(site);
Ranking ranking = new ranking(srchContext));
foreach (RankingParameter param in ranking.RankingParameters)
RankingParameter lookedup = ranking.RankingParameters[param.Name];
Console.WriteLine(lookedup.Name + ": " + lookedup.Value);
- Unmanaged properties NOT considered in ranking.
Title is a very important property of ranking and are often wrong (e.g. “Slide 1”, or “Word Template Name”)
MOSS 2007 has an intelligent way of overcoming this problem. What is does, is use a text extraction algorithm that generates a shadow title. How does it find a shadow title if one does not exist? It uses the headings inside your document. These are normally displayed using text formatting such as Heading 1 or Heading 2.
Please note that this only works for Office file types, another words, the Office IFilter that MOSS 2007 search uses to pick up this information.
Name of a website is normally a common type of query. MOSS Search matches site name to URL equivalent.
This describes the ranking that is not impacted by the content or property values for a content item.
File Type Biasing
In most search scenarios, certain file types are more relevant than others. This effects the MOSS Search relevance calculation ranks.
- Order of relevancy: HTML Web pages, PowerPoint presentations, Word documents, XML Files, Excel Spreadsheets, Plain Text files, List Items
- See Object Model : RankingParameter.Value
- IMPORTANT: You cannot add and/or remove File Types
Automatic Language Detection
Foreign language results are less relevant than results in user’s language
- Index time: documents are tagged with their likely language.
- Query time: MOSS Search determines users language via browsers headers (Accept-Language).
- Advanced Search: User can override this default behaviour using different language.
- Exception: ENGLISH is always considered as relevant as user’s language.
Click Distance from authoritative pages
NOTE: the difference between Click Distance and URL Depth. Click distance is not based on URL depth but rather on the path the user takes through pages to get to information.
Authoritative Pages (Configured in SharePoint Central Administration):
Sites linked to authoritative pages have higher relevant score.
Click distance can be improved by configuring authoritative pages in search admin. This effectively “bumps up” the a “X number of clicks site” to a one click site.
3 levels of importance and is maintained by an administrator.
Pages linked to authoritative pages are MORE relevant than pages that and is adjusted until rank of all pages is influenced by its “click distance” to authoritative pages.
Administrators CAN demote relevance of sites.
Items with shorter urls are more relevant than items placed in longer URLs; E.g. http://msw/ vs http://portal/divisionalsite/ProjectSite1/MeetingSite/ .Short URLS are like prime real estate and organisations tend to allocate them to the most important content.
· Precision@N: Avg. No. Of relevant documents in top 5, 10,etc.
· Mean Average Precision: Avg. Precision from N-1 to R
· Reciprical Rank: 1/rank of the top relevant document
· Normalized Discounted Cumulative Gain (NDCG) : Represents ratio of current ranking to ideal
User’s Perceived Relevance
· Summarization and Highlighting : Query-dependant summarization and highlighting of hits within summary.
· Duplicate removal: Near duplicates documents are detected across index and removed at query time; can be disabled by admin
· Best Bets: Best Bets promotion IS NO LONGER PART OF ranking algorithm
· Did you mean? : Index informed spell checker; Only available for English, Spanish, French, (not sure of last language).
· First crawl your content J
· Manage authoritative pages and demoted sites carefully
· Mine query logs to identify keywords
· Review list of descriptions, keywords, and best bets periodically as content prioritization can change over time
· Use admin object model CAREFULLY to change weight given to properties
· Features in ranking formula can also be added using object model to personalize ranking criterias: