Enterprise Search Relevance Architecture Overview
In Search, relevance is about how closely the search results that are returned to the user match what the user wanted to find. Ideally, the results that are returned on the first page are the most relevant, so the user does not have to look through several pages of results to find the best matches for their search.
Enterprise Search in Microsoft Office SharePoint Server 2007 includes a revamped ranking engine developed in collaboration with Microsoft Research. It is specifically tuned for the unique requirements of searching enterprise content.
Understanding Static and Dynamic Ranking
There are two types of ranking formula components used in the relevance calculation: static and dynamic. The difference between the components is related to whether the rank calculated is affected or not by the query terms, and the actual content and text in various properties for a content item.
Dynamic ranking describes the ranking that is affected by the content or property values for a content item; this is also known as query-dependent ranking.
The following sections provide an overview of the components used for the dynamic ranking algorithm used in the Enterprise Search relevance calculation.
Anchor text is the text that is included with a hyperlink to describe the target content of that hyperlink. When Enterprise Search crawls the content item, this text is included in the index for that content. Anchor text only influences rank, and is not the determining factor for including a content item in the result set. For example, if all the query terms are found only in the anchor text and not in the actual content of the item, the link may be obsolete, so the content item is not included in the results.
Search indexes the anchor text from the following elements:
HTML anchor elements
Microsoft Windows SharePoint Services link lists
Microsoft Office SharePoint Portal Server 2003 listings
Microsoft Office Word 2007, Microsoft Office Excel 2007, and Microsoft Office PowerPoint 2007 hyperlinks (only for files using the new Office Open XML Formats)
Changing property weights arbitrarily can have an adverse effect on the overall relevance of the system, so we do not recommended that you do this without properly evaluating the changes and how they impact accuracy of search results.
Some properties are more important to calculating relevance than others. This is called property weighting. Enterprise Search provides you with a way to modify per-property weight to identify these properties so that they are weighted more heavily in the relevance calculation. You must use the Search Administration object model to do this. For a code sample demonstrating how to do this, see How to: Change the Weight Setting for a Managed Property.
The Microsoft Office SharePoint Portal Server 2003 version of the SQL search syntax supported query time column weighting. The Enterprise Search in Microsoft Office SharePoint Server 2007 version of the SQL search syntax does not support column weighting. If column weighting is present in search queries migrated to Office SharePoint Server 2007, the search queries will still work, but column weighting values will be ignored.
Property Length Normalization
A content item can have many different properties of varying length. If the values in these properties are treated equally regardless of their size during relevance calculation, it can have a negative impact on the calculated rank. Length normalization adjusts the rank of a content item, based on the length of the property, and the length normalization setting. You must use the Search Administration object model to perform property length normalization.
URL matching is the process by which Enterprise Search checks content item URLs for a direct match with the specified search terms.
Title extraction, or using the title value in the relevance calculation, can help return highly relevant content, if the content item is appropriately named. However, there are scenarios where the value in the title property does not accurately reflect the content. For example, the following titles do not provide valuable information about their content:
Slide 1 (the default name of the first slide in a PowerPoint presentation file, which PowerPoint uses as the presentation file name if it is not changed)
Document 1 (the default name of a Word document file, which Word uses as the document file name if it is not changed)
The previous title examples provide no valuable information about the content of those files, so they are not relevant for Search. To work around this issue, Enterprise Search detects another candidate for title within the body of the content item, and includes this value with the actual title when calculating relevance.
This process is performed only on Microsoft Office files.
Static ranking describes the ranking that is not impacted by the content or property values for a content item; this is also known as query-independent ranking.
The following sections provide an overview of the components used for the static ranking algorithm used in the Enterprise Search relevance calculation.
You link a document, Web page, list, or other item to other content items because, more than likely, the linked content item contains information that is related to and enhances the content value of the original item that contained the link. Therefore, information about those hyperlinks to a specific content item, such as the number of hyperlinks to it or where those hyperlinks might be located, are helpful in determining relevance.
Click distance refers to the number of links between a content item and an "expert" page linking to the content item. For calculating search relevance, the starting point is an authoritative page, as described in Authoritative Pages and Demoted Sites. The more links that the crawler must travel from an authoritative page to the content item, the lower the relevance score. If there are multiple paths to a content item, relevance is calculated based on the shortest path, the one with the least amount of links from the authoritative page to the content item.
Important or relevant content is often located closer to the top of a site's hierarchy, instead of in a location several levels deep in the site. As a result, the content has a shorter URL, so it is more easily remembered and accessed by the user. Enterprise Search makes use of this fact by reviewing URL depth, which refers to how many levels deep within a site the content item is found. The level is determined by reviewing the number of slash ("/") characters in the URL; the greater the number of slash characters in the URL path, the deeper the URL is for that content item. As a consequence, a large URL depth number can lower the relevance of that content.
Automatic Language Detection
Users are more likely to be looking for content in their own language than in other languages. Enterprise Search determines the user's language based on "Accept-Language" headers from the browser they are using—automatic language detection. When calculating relevance, content that is retrieved in the user's language is considered more relevant than content in other languages, with the exception of English language content. English language content is considered as relevant as content in the user's language.
File Type Biasing
In most search scenarios, certain file types are more relevant than others. For example, HTML pages and Word documents are usually more relevant to a user's search than an Excel spreadsheet or a plain text file.
Enterprise Search's relevance calculation includes a ranking algorithm that ranks some file types higher than other file types. This applies to the following file types, listed in default ranking order in Enterprise Search, starting with the highest:
HTML Web pages
Plain text files