Crawling content with Sharepoint 2013 Search

Article
12/23/2013

Before we get anywhere further with search, let’s discuss in more detail how content is gathered, and that’s via crawling. Crawling is simply a process of gathering documents from various sources/repositories, making sure they obey by various rules and sending them off for further processing to the Content Processing Component.

Let’s take a more in-depth look at how Sharepoint crawl works.

Architecture:

There are 2 processes that you should be aware of when working with Sharepoint crawler/gatherer: MSSearch.exe and MSSDmn.exe

The MSSearch.exe process is responsible for crawling content from various repositories, such as SharePoint sites, HTTP sites, file shares, Exchange Server and more.
When a request is issued to crawl a ‘Content Source’, MSSearch.exe invokes a ‘Filter Daemon’ process called MSSDmn.exe. This loads the required protocol handlers and filters necessary to connect, fetch and parse the content. Another way of defining MSSDmn.exe is that it is a child process of MSSearch.exe and is a hosting process for protocol handlers.

The figure above should give you a feel for how Crawl Component operates, it uses MSSearch.exe and MSSDmn.exe to load the necessary protocol handlers and gather documents from various supported repositories, and then sends the crawled content via a Content Plug-In API to the Content Processing Component. There is one temporary location I should mention as listed in the figure(as there is more than one), and that’s a network location where crawler will store document blobs for CPC to pick up. It is a temporary on-disk location based on callbacks received by the crawler Content Plug-In from the indexer.

Last part of this architecture is the Crawl Store database. It is used by the Crawler/Gatherer to manage crawl operations and store history, URL, deletes, error data, etc.

Major Changes from SP2010 Crawler:

- Crawler is no longer responsible for parsing and extracting document properties and various other tasks such as linguistic processing as was the case with previous Sharepoint Search versions. Its job is now much closer to FAST for Sharepoint 2010 crawler, where crawler is really just the gatherer of documents that’s tasked with shipping them off to the Content Processing Component for further processing. This also means no more Property Store database.

- Crawl component and Crawl DB relationship. As of Sharepoint 2013, crawl component will automatically communicate with all crawl databases if there is more than one(for a single host). Previously, mapping of crawl components to crawl databases resulted in a big difference in database sizes.

- Coming from FAST for Sharepoint 2010, there is single Search SSA that will handle both content and people crawl. No longer is there a need to have FAST Content SSA to crawl documents and FAST Query SSA to crawl People data.

- Crawl Reports. Better clarity from a troubleshooting perspective.

Protocol Handlers:

Protocol handler is a component used for each of the target types. Here are the target types supported by Sharepoint 2013 crawler:

• HTTP Protocol Handler: accessing websites, public Exchange folders and SP sites. (https://)

• File Protocol Handler: accessing file shares (file://)

• BCS Protocol Handler: accessing Business Connectivity Services – (bdc://)

• STS3 Protocol Handler: accessing SharePoint Server 2007 sites.

• STS4 Protocol Handler: accessing SharePoint Server 2010 and 2013 sites.

• SPS3 Protocol Handler: accessing people profiles in SharePoint 2007 and 2010.

Note that only STS4 Protocol Handler will crawl SP sites as true Sharepoint sites. If using HTTP protocol handler, Sharepoint sites will still be crawled but only as regular web sites.

Crawl Modes:

Full Crawl Mode – Discover and Crawl every possible document on the specific content source.
Incremental Crawl Mode – Discover and Crawl only documents that have changed since the last full or incremental crawl.

Both are defined on a per-content source basis and sequential and dedicated. This means that they cannot be run in a parallel and that they process changes from the Content Source ‘change log’ in a top-down fashion. This presents the following challenge. Let’s say this is what we expect from an Incremental crawl as far as processing changes and the amount of time it should take.

Incremental_crawl_expected

However, there is a tendency to have some “deep changes” spikes (say a wide security update) which alter this timeline and result in incremental crawls taking longer than expected. Since these incremental crawls are sequential, the subsequent crawls cannot start until the previous crawl has completed, leading to missing scheduled timelines set by administrator. Figure below shows the impact:

What is the best way for a search administrator to deal with this? Enter the new Continuous Crawl mode:

Continuous Crawl Mode – Enables a continuous crawl of a content source. Eliminates the need to define crawl schedules and automatically kicks off crawls as needed to process the latest changes and ensure index freshness. Note that Continuous mode can only work for Sharepoint-type content source. Below is a figure that shows how using Continuous crawl mode with its parallel sessions ensures that index is kept fresh even with unexpected deep content changes:

There are couple of things to keep in mind here regarding Continuous crawls:

- Each SSA will have only one Continuous Crawl running.

- They are automatically spun up every 15 minutes(can be changed with Powershell)

- You cannot pause or resume a Continuous crawl, it can only be stopped.

Scaling/Performance/Load:

Some notes here:

- Add crawl components for both tolerance and potentially a better throughput (depending on the use-case). Number of crawl components figures into calculation of how many sessions each crawl component will start with a Content Processing Component.

- Continuous crawls increase the load on the crawler and on crawl targets. For each large content source for which you enable continuous crawls, it is recommended that you configure one or more front-end web servers as dedicated targets for crawling. For more information, take a look at https://technet.microsoft.com/en-us/library/dd335962(v=office.14).aspx

- There is a global setting that allows you to control how many worker threads each crawl component will use to target a host. The default setting is High, which is a change from Sharepoint 2010 Search where this setting was set to Partially Reduced. The reason for the change is that crawler is now by far less resource intensive than in the past due to much of the functionality moving to Content Processing Component. Microsoft support team recommends changing Crawler Impact Rules versus this setting, mainly due to the fact that Crawler Impact Rules are host-based and not global.

Reduced = 1 per CPU
Partially reduced = Number_of_CPU’s+4, but threads set to ‘low priority’, meaning another thread can make it wait for CPU time.
High = Number_of_CPU’s + 4 AND a normal thread priority). This is the default setting

- Crawler Impact Rules: There are not global and are “host”-based. You can either set it to request a different number of simultaneous requests than below or change to have 1 request/thread at a time with a wait time of Y number of seconds.

Choosing to have 1 request a time while waiting for a specified time will most likely ensure a pretty slow crawl.

We will tackle other search components in future posts, hopefully providing a very clear view of how all these components interact with each other.

Crawling content with Sharepoint 2013 Search

Additional resources