Relevance in SharePoint Search
Before reading this blog please check the following links, which will help you to better understand search relevance in sharepoint.
Enterprise Search Architecture: http://msdn.microsoft.com/en-us/library/ms570748.aspx
Building Search Queries http://msdn.microsoft.com/en-us/library/ms470199.aspx
SharePoint Search SQL Syntax http://msdn.microsoft.com/en-us/library/ms443580.aspx
Now, we will try to understand How does Relevance work in MOSS Search?
When a search query is executed, the query engine passes the query through a language-specific wordbreaker. If there is no wordbreaker for the query language, the neutral wordbreaker is used, which does whitespace-style wordbreaking, which means that the wordbreaking occurs where there are whitespaces in the words and phrases. After wordbreaking, the resulting words are passed through a stemmer to generate language-specific inflected forms of a given word. The use of wordbreaker and stemmer in both the crawling and query processes enhances the effectiveness of search because more relevant alternatives to a user's query phrasing are generated.
When the query engine executes a property value query, the content index is checked first to get a list of possible matches. The properties for the matching documents are loaded from the property store, and the properties in the query are checked again to ensure that there was a match. The result of the query is a list of all matching results, ordered according to their relevance to the query words. Relevance is about how closely the search results that are returned to the user match what the user wanted to find. Ideally, the results that are returned on the first page are the most relevant, so the user does not have to look through several pages of results to find the best matches for their search.
The relevance ranking engine is based on information retrieval algorithms, adapted from Stephen Robertson’s BM25F algorithm. It is specifically tuned for the unique requirements of searching enterprise content. This approach orders results by decreasing probability of relevance to the query. Query terms describe the document and the query. Statistics about the terms and the result make up the ranking: the document length, the number of occurrences of the term in the document, and the number of documents in which each term occurs at all (this is repeated for each property). This is further enhanced by tracking body text and properties, such as title or author, individually. Yet, each enhancement to the model, adding features and facts about the document or the query, will contribute to better results.
Search queries return integer relevance values in the column named "rank".
Below are certain rules that are used to calculate ranking:
- Rank values returned by the query are integers ranging from 0 to 1000.
- Higher rank values indicate documents that match the search conditions better than others.
- Rank values apply only to the current query, so they cannot be compared for results across queries.
- Rank values are relative to the other documents matching the query. Therefore, the rank value of a particular document depends on the other documents that also match the query.
- The rank value for items matching a purely relational predicate is 1000.
Ranking of the documents can improved by following the below methods.
- Include keywords, key phrases, and a description that reflects the page’s content.
- Use the Robots Exclusion Standard, for example, robots.txt (see About /robots.txt).
- Use proper semantic code (appropriate HTML elements) such as headlines. For example, use headline elements tagged as <h1> to <h6>. Use proper HTML list items; for example, ordered, unordered, or definition lists tagged with <ol>, <ul>, or <dl> tags. Use alt and title attributes with image (<img>) tags. Use a Favorites icon for bookmarking and to keep your error log clean. For more information, see How to Add a Shortcut Icon to a Web Page.
- Use the Platform for Internet Content Selection (PICS) specification (see W3C and ICRA tools) to provide a signal to search engines and filter programs that your Web site is safe for children.
- Consider keyword density. Because of a general misuse of HTML keyword <meta> tags, current crawlers compare the number of times a word or a phrase appears in an HTML page to the number of times it appears in its <meta> tag to properly determine its relevancy. Using too few or too many keywords can have a negative effect.
- Use descriptive text in your hyperlinks.
- Consider the levels of directories or subsites in a path to a Web page, or the URL depth.
- Use descriptive page titles.
- Build some automated process in Office SharePoint Server 2007 to create or update the sitemap file for a WCM site. The sitemap file is a simple XML file that contains URLs for the site’s pages and some metadata to help search engines crawl or discover pages in a site (for more information, see What are Sitemaps?). You can achieve this by using custom workflows or by customizing a default SharePoint publishing workflow.
- Use valid HTML or XHTML. Even if most ASP.NET 2.0 controls are XHTML-compliant, Office SharePoint Server 2007 and SharePoint controls are not XHTML-compliant. However, considering XHTML when you are designing master pages and page layouts will always help. If you need to get compliant output from controls that are not compliant, you can refer to Scott Guthrie's article CSS Control Adapter Toolkit for ASP.NET 2.0 for possible options. You can also code all custom Web Parts and field controls to be XHTML-compliant.
Try to avoid doing the following in the Web pages:
· Naming all pages in the site with the same page title.
· Including a specific keyword or keywords or a phrase too often in the <meta> tags or content of a Web page, also called keyword stuffing. In this scenario, the crawler might determine that these keywords or phrases are suspect, and they might be discarded when the search engine is calculating relevance.
· Using hidden text to fill a page with keywords that a search engine can recognize but that are not visible to a visitor.
· Using complex URLs, in which a page is multiple levels deep in a site (for example, http://someserver.com/subsite/pages/somepage.aspx) might not be easily crawled. You can use a combination of a URL rewriter and a sitemap file to address this in Office SharePoint Server. In addition, this unwanted behavior on the part of the crawler highlights the importance of using proper site structure (which is part of the information architecture).
· Using temporary redirects (this can be a significant issue with a SharePoint landing page). For more information, see Welcome Page Redirect.
Components involved in Search Ranking
SharePoint performs two types of ranking, dynamic ranking and static ranking. Dynamic ranking, is something that happens on the Query Servers and depends on query and term matching whereas static ranking occurs at index time. Static ranking is query independent and is computed at index time. Lets dive deeper into each of these:
This looks at the content or property values for a content item such as:
This evaluates the text that describes a target. E.g. <A href=http://portal/site> Company Name Enterprise Gateway Portal</A>
- Search harvests anchor text from HTML anchor elements, WSS Link Lists, SPS 2003 listings, Word/Excel/PowerPoint 2007 (files using Open Office XML File Formats)
- Any other File Types handled by installed 3rd Party iFilter components
Property weighting infers that matches on a specific property value can be more relevant than other property values or in document’s body.
- MOSS 2007 automatically enhances / extraction of metadata
- MOSS 2007 automatic tuning
- Index time implementation (occurs on index server)
- Weight is part of property definition
- Managed properties considered in ranking (weights can only be changed through object model); New Relevance Object Model in Microsoft.Office.Server.Search.Administration Namespace
- Configure Managed Property (managedProperty.Weight = newWeight;) or set ranking parameter on predefined documents (RankingParameter.Value)
string strURL = " http://<SiteName >";
using (SPSite site = new SPSite(strURL ))
myContext = SearchContext.GetContext(site);
Ranking ranking = new ranking(myContext));
foreach (RankingParameter param in ranking.RankingParameters)
RankingParameter RP= ranking.RankingParameters[param.Name];
Console.WriteLine(RP.Name + ": " + RP.Value);
- Unmanaged properties NOT considered in ranking.
Title is a very important property of ranking and are often wrong (e.g. “Slide 1”, or “Word Template Name”)MOSS 2007 has an intelligent way of overcoming this problem. What is does, is use a text extraction algorithm that generates a shadow title. How does it find a shadow title if one does not exist? It uses the headings inside your document. These are normally displayed using text formatting such as Heading 1 or Heading 2.
Please note that this only works for Office file types, another words, the Office IFilter that MOSS 2007 search uses to pick up this information.
Name of a website is normally a common type of query. MOSS Search matches site name to URL equivalent.
This describes the ranking that is not impacted by the content or property values for a content item.
File Type Biasing
In most search scenarios, certain file types are more relevant than others. This effects the MOSS Search relevance calculation ranks.
- Order of relevancy: HTML Web pages, PowerPoint presentations, Word documents, XML Files, Excel Spreadsheets, Plain Text files, List Items
- See Object Model : RankingParameter.Value
- IMPORTANT: You cannot add and/or remove File Types
Automatic Language Detection
Foreign language results are less relevant than results in user’s language
- Index time: documents are tagged with their likely language.
- Query time: MOSS Search determines users language via browsers headers (Accept-Language).
- Advanced Search: User can override this default behaviour using different language.
- Exception: ENGLISH is always considered as relevant as user’s language.
Click Distance from authoritative pages
NOTE: the difference between Click Distance and URL Depth. Click distance is not based on URL depth but rather on the path the user takes through pages to get to information.
Authoritative Pages (Configured in SharePoint Central Administration):
- Sites linked to authoritative pages have higher relevant score.
- Click distance can be improved by configuring authoritative pages in search admin. This effectively “bumps up” the a “X number of clicks site” to a one click site.
- 3 levels of importance and is maintained by an administrator.
- Pages linked to authoritative pages are MORE relevant than pages that and is adjusted until rank of all pages is influenced by its “click distance” to authoritative pages.
- Administrators CAN demote relevance of sites.
Items with shorter urls are more relevant than items placed in longer URLs; E.g. http://msw/ vs http://portal/divisionalsite/ProjectSite1/MeetingSite/ .Short URLS are like prime real estate and organisations tend to allocate them to the most important content.
SharePoint object model supports modifying relevance parameters using the Ranking class. The Ranking class has got RankingParameters property, which represents the collection of all ranking parameters for the SSP. The parameter values can be edited but new parameters cannot be added, deleted or renamed.
Analysis of the Parameters that affect Search Relevance:
Crawled properties are discovered by search service indexer while crawling content. Later, administrators map these crawled properties to managed properties to use them in search. Both Crawled and Managed property have their own configurable properties, which impact the search relevance ranking. As mentioned earlier, when a query is executed with managed property it first goes to content index to get a list of possible matches to rebuild the query, which will be fired to get the search results. However, if the managed property is not scoped then it is directed towards the SQL and not to the content index, which will impact the performance of the search query. Queries that bypass SQL and directly hit the content index are much faster than the former approach. We can also modify the property Weight or the LengthNormalization for relevance tuning. The list of parameters that will affect the way relevance is calculated are listed below:
Types of parameters to customize
Saturation constant for term frequency. This relates to how many times the query term was returned.
Saturation constant for click distance.
Weight of click distance for calculating relevance.
Saturation constant for URL depth.
Weight of URL depth for calculating relevance.
Weight for ranking applied to content in a language that does not match the language of the user.
Weight of HTML content type for calculating relevance.
Weight of Microsoft Office Word content type for calculating relevance.
Weight of Microsoft Office PowerPoint content type for calculating relevance.
Weight of Microsoft Office Excel content type for calculating relevance.
Weight of XML content type for calculating relevance.
Weight of plain text content type for calculating relevance.
Weight of list item content type for calculating relevance.
Weight of Microsoft Office Outlook e-mail message content type for calculating relevance.
Using special Search predicates to influence relevance:
The Contains predicate is used for exact matches whereas the FreeText is used for finding items containing combinations of search words. Queries that contain only the CONTAINS predicate return results with unexpected rank ordering.
We can indicate a column or a column group on which FREETEXT will test. However, to get better relevance, if we do not want to specify a particular column then it is recommended that we use DefaultProperties as the column. When we specify DefaultProperties, all indexed text properties that have non-zero weight are searched. There are non-zero weighted properties non retrievable and that cannot be modified, taken into consideration when ranking an element.
If we combine a Boolean restriction clause with a FREETEXT clause by using the AND operator in a query, we reduce the number of possible matches to a query without affecting the rank values that are calculated based on the FREETEXT clause of the query. If we do not specify a column reference, only the Contents column, which contains the body of the item, is searched which in turn might filter out certain results that the user might have been interested otherwise.
Each FREETEXT clause represents a separate query and ranking is done separately. Hence, it is not recommended to use more than one FREETEXT clause in a single query.
e.g. ...WHERE FREETEXT(DefaultProperties, 'hello') AND FREETEXT(DefaultProperties, 'world)...
To specify multiple query terms in a FREETEXT clause you can add all the terms in a single string
e.g. ...WHERE FREETEXT(DefaultProperties, 'hello world’)...
Like keyword is used to perform pattern matching comparison on the specified column. This keyword only works with single valued fields and not multi-valued. Also, the column should be available in the property store. The wildcards that can be used in Like keyword are “%”, ”_”, “”, “[^]”. Also, we can use multiple wildcards in a match string.
Near specifies that two content search terms must be located relatively close to one another to be recognized as matching by the CONTAINS predicate. When the words in the query joined by NEAR are found within approximately 50 words of one another in the column that is being searched, the NEAR term returns a match. The closer together the two words are, the higher the calculated rank for the NEAR term. The farther apart the two words are, the lower the rank. The number of words is approximate; it can be less than 50. If the match words specified with the NEAR term are both found in the column being searched, but are farther apart than 50, the result is still returned but has a rank of 0.
FormsOf is used in CONTAINS keyword which performs matches by using other linguistic forms of the word. There are two types of FormsOF word generation. INFLECTIONAL chooses alternative inflection forms for the match words. If the word is a verb, alternative tenses are used. If the word is a noun, the singular, plural, and possessive forms are used to detect matches. THESAURUS chooses words that have the same meaning, taken from a thesaurus.
IsAbout matches columns against a group of one or more search terms. IsAbout term can have one or more components. The columns specified in the CONTAINS predicate are tested against each component. The document is included with the results if at least one of the components matches. Commas separate multiple components.
If you want to retrieve the ranking values as computed by the server at run time, you make use of the following property : rankdetail. Hence, modify the query to : "select rank, rankdetail, title, description, size, path, hithighlightedSummary" you will receive the following xml formatted text:
<QIR WID="4" URLDepth="4" ClickDistance="5.333332" Language="1034"
FileType="0" External="0" IsDemoted="No"></QIR>
<Rank DocId="4" Score="13165" NodeType="Prob" OriginalScore="23.714941"
<Term ID="0" Score="1.801292" TFW="95.807739" n="87" N="527"
RW="1.801292" TFK="1.000000" FFK="1.000000">
<Prop Pid="1" W="100.000000" TF="1.000000" TFW="100.000000"
DL="971.000000" AVDL="905.000000" DL_AVDL="1.072928" B="0.600000" DLNORM="1.043757"