Manage crawler impact rules (Office SharePoint Server 2007 for Search)

Crawler impact rules are particularly important when crawling external content sources because crawling uses resources on the crawled servers. Because crawling content can place a significant load on the servers you crawl, we recommend that you use crawler impact rules to throttle how aggressively your system crawls slower servers. Otherwise, if you place too much burden on the servers you crawl, the administrators of those servers might revoke your permissions to crawl their sites.

Crawler impact rules enable administrators to manage the impact your crawler has on the servers being crawled. For each crawler impact rule, you can specify a single URL, or you can use wildcard characters in the URL path to include a block of URLs to which the rule applies. You can then specify how many simultaneous requests for pages will be made to the specified URL, or you can choose to request only one document at a time and wait a number of seconds that you choose between requests.

The following table shows the wildcard characters that you can use in the site name when adding a crawler impact rule.

Use To

* as the site name

Apply the rule to all sites.

*.* as the site name

Apply the rule to sites with dots in the name.

*.site_name.com as the site name

Apply the rule to all sites in the site_name.com domain (for example, *.adventure-works.com).

*.top-level_domain_name (such as *.com or *.net) as the site name

Apply the rule to all sites that end with a specific top-level domain name (for example, .com or .net).

?

Replace a single character in a rule. For example, *.adventure-works?.com will apply to all sites in the domains adventure-works1.com, adventure-works2.com, and so on.

You can create a crawler impact rule for *.com that applies to all Internet sites whose addresses end in .com. For example, an administrator of a portal might add a content source for samples.microsoft.com. The rule for *.com applies to this site unless you add a crawler impact rule specifically for samples.microsoft.com.

For content within your organization that other administrators are crawling, you can coordinate with those administrators to set crawler impact rules based on the performance and capacity of the servers. By contrast, for most external sites, this coordination is not possible. Requesting too much content on external servers or making requests too frequently can cause administrators of those sites to limit your future access if your crawls are using too many resources or too much bandwidth. Therefore, the best practice is to crawl too little rather than crawl too much. In this way, you can mitigate the risk that you will lose access to crawl the relevant content.

During initial deployment, set the crawler impact rules to make as small an impact on other servers as possible while still crawling enough content frequently enough to ensure the freshness of the crawled content.

Later, during the operations phase, you can adjust crawler impact rules based on your experiences and data from crawl logs.

To manage a crawler impact rule, you can perform the following procedures: