Plan to crawl content (Search Server 2008)

Applies To: Microsoft Search Server 2008

 

Topic Last Modified: 2008-09-15

In this article:

  • About crawling and indexing content

  • Identify the sources of content that you want to crawl

  • Plan content sources

  • Plan for authentication

  • Plan protocol handlers

  • Plan to manage the impact of crawling

  • Plan crawl rules

  • Plan search settings that are managed at the farm level

  • Indexing content in different languages

Note

Unless otherwise noted, the information in this article applies to both Microsoft Search Server 2008 and Microsoft Search Server 2008 Express.

The purpose of this article is to help search services administrators plan to crawl content by helping them understand how Microsoft Search Server 2008 crawls and indexes content. For more information, see Add or remove a search services administrator (Search Server 2008).

Before end users can take advantage of the enterprise search functionality in Search Server 2008, the content that they will run queries against must first be crawled.

For the purpose of this article, content is an item that can be crawled, such as a Web page, a Microsoft Office Word document, or an e-mail message file.

When planning to crawl content, consider the following questions:

  • Where is the content physically located?

  • Is the content stored in different sources, such as file shares, SharePoint sites, Web sites, or other places?

  • Do you want to crawl all of the content stored at the source or a portion of it?

  • What types of files do you want to crawl?

  • When and how often will you crawl content?

  • How is the content secured?

Use the information in this article to help you answer these questions and make the necessary planning decisions about the content you want to crawl and how and when you want to crawl that content.

About crawling and indexing content

Crawling and indexing content is the process by which the system accesses and parses content and its properties, sometimes called metadata, to build a content index from which search queries can be served.

The result of successfully crawling content is that the individual files or pieces of content are accessed and read by the crawler. The keywords and metadata for those files are stored in the content index, sometimes called the index. The index consists of the keywords that are stored in the file system of the index server and the metadata that is stored in the search database. The system maintains a mapping between the keywords, the metadata, and the URL of the source from which the content was crawled.

The search service is associated with the Shared Services Provider (SSP) and is assigned a specific server to index content. Unlike the server products in the 2007 Office release, which can have multiple SSPs and therefore more than one content index, Search Server 2008 is limited to one SSP and therefore, has only one content index.

Note

The crawler does not change the files on the host servers. Instead, the crawler accesses and reads the files, and then sends the text and metadata to the index server. Some host servers change the date on the files after the crawler accesses them. The crawler does not do this.

Identify the sources of content that you want to crawl

In many cases, the needs of your organization might only require that you crawl all the content contained by the SharePoint sites in your server farm. In this case, you might not need to identify the sources of content you want to crawl because all site collections in a server farm can be crawled by using the default content source. For more information about the default content source, see Plan content sources later in this article.

Many organizations also need to crawl content that is external from the server farm, such as file shares or Web sites on the Internet. Search Server 2008 can crawl and index content that is hosted on other Windows SharePoint Services farms, Web sites, file shares, Microsoft Exchange public folders, and IBM Lotus Notes servers. This greatly increases the amount of content that is available to search queries.

In many cases, however, you might not want to crawl every site collection in your server farm, because content stored in some site collections might not be relevant in search results. In this case, you must do one or both of the following:

  • Note the URLs of the site collections that you do not want to crawl. If you decide to use the default content source, you must ensure that the start addresses for the site collections that you do not want to crawl are not listed in the default content source.

  • Note the start addresses of the site collections that you do want to crawl. If you decide to create additional content sources to use to crawl this content, you need to know these start addresses. Information about when to use one or more content sources is provided in the Plan content sources section of this article.

Tip

With Search Server, there are two ways to process search queries in order to return search results to users. You can query the Search Server content index, or you can use federated search. There are advantages to each approach. For a comparison of these two approaches to processing search queries, see Federated Search Overview (https://go.microsoft.com/fwlink/?LinkID=122651). For a list and brief description of Search Server articles about understanding and using federation, see Working with federation (Search Server 2008).

Plan content sources

Before you can crawl content, you must determine where the content is located and on what types of servers the content is hosted. After this information is gathered, a search services administrator can create one or more content sources. These content sources provide the following information to the crawler:

  • The type of content to crawl — for example, a SharePoint site or a file share.

  • The start address from which to start crawling.

  • What type of behavior to use when crawling — for example, how deep to crawl from the start address, or how many server hops to allow.

  • How frequently to crawl.

Note

Crawling content by using a particular content source is sometimes called "crawling the content source."

This section helps you plan for the content sources needed by your organization.

The default content source is called Local Office SharePoint Server sites. Search services administrators can use this content source to crawl and index all content in the server farm. By default, Search Server 2008 adds the start address (in this case a URL) of the top-level site of each site collection in the farm to the default content source.

For some organizations, simply using the default content source to crawl all sites in their site collections satisfies their search requirements. However, many organizations need additional content sources.

Reasons for creating additional content sources include the need to:

  • Crawl different types of content.

  • Crawl some content on different schedules than other content.

  • Limit or increase the quantity of content that is crawled.

Search services administrators can create up to 500 content sources and each content source can contain up to 500 start addresses. To keep administration as simple as possible, you should create only as many content sources as you need.

Crawl different types of content

You can only crawl one type of content per content source. That is, you can create a content source that contains URLs for SharePoint sites and another that contains URLs for file shares, but you cannot create a single content source that contains URLs to both SharePoint sites and file shares. The following table lists the types of content sources that can be configured.

This type of content source Includes this type of content

SharePoint sites

SharePoint sites from the same farm or different Office SharePoint Server 2007, Windows SharePoint Services 3.0, or Search Server 2008 farms

  • SharePoint sites from Microsoft Office SharePoint Portal Server 2003 or Windows SharePoint Services 2.0

    Note

    Unlike crawling SharePoint sites on Office SharePoint Server 2007, Windows SharePoint Services 3.0, or Search Server 2008, the crawler cannot automatically crawl all subsites in a site collection from previous versions of SharePoint Products and Technologies. Therefore, when crawling SharePoint sites from previous versions, you must specify the URL of each top-level site and each subsite that you want to crawl.
    Sites listed in the Site Directory of Microsoft Office SharePoint Portal Server 2003 farms are crawled when the portal site is crawled. For more information about the Site Directory, see About the Site Directory (https://go.microsoft.com/fwlink/?LinkId=88227&clcid=0x409).

Web sites

  • Other Web content in your organization not found in SharePoint sites

  • Content on Web sites on the Internet

File shares

Content on file shares within your organization

Lotus Notes

E-mail messages stored in Lotus Notes databases

Note

Unlike all other types of content sources, the Lotus Notes content source option does not appear in the user interface until you have installed and configured the appropriate prerequisite software. For more information, see Configure Search Server to crawl Lotus Notes (Search Server 2008).

Exchange public folders

Exchange Server content

Crawl content on different schedules

Search services administrators often must decide whether some content is crawled more frequently than other content. The larger the volume of content that you crawl, the more likely it is that you are crawling content from different sources. These different sources might or might not be of the same type and might be hosted on servers of varying speeds in relation to one another.

These factors make it more likely that you need additional content sources to crawl those different sources of content at different times.

Primary reasons that content is crawled on different schedules are:

  • To accommodate downtimes and periods of peak usage.

  • To more frequently crawl the content that is updated more frequently.

  • To crawl content hosted on slower host servers separately from content crawled on faster host servers.

In many cases, not all of this information can be known until after Search Server 2008 is deployed and running for some time. Instead, some of these decisions are made during the operations phase. However, it is a good idea to consider these factors during planning so that you can plan your crawl schedules based on the information at hand.

The following two sections provide more information about crawling content on different schedules.

Downtimes and periods of peak usage

Consider the downtimes and peak usage times of the servers that host the content you want to crawl. For example, if you are crawling content hosted by many different servers outside your server farm, it is likely that these servers are backed up on different schedules and have different peak usage times. The administration of servers outside your server farm is typically out of your control. Therefore, we recommend that you coordinate your crawls with the administrators of the servers that host the content you want to crawl to ensure you do not attempt to crawl content on their servers during a downtime or peak usage time.

A common scenario involves content outside the control of your organization that relates to the content on your SharePoint sites. You can add the start addresses for this content to an existing content source or create a new content source for external content. Because availability of external sites varies widely, it is helpful to add separate content sources for different external content. In this way, the content sources for external content can be crawled at different times than your other content sources. You can then update external content on a crawl schedule that accounts for the availability of each site.

Content that is updated frequently

When planning crawl schedules, consider that some sources of content are typically updated more frequently than others. For example, if you know that content on some site collections or external sources are updated only on Fridays, it would be a waste of resources to crawl that content more frequently than once each week. However, your server farm might contain other site collections that are continually updated Monday through Friday, but not typically updated on Saturdays and Sundays. In this case, that you might want to crawl several times each week day, but only once or twice on weekends.

The way in which content is stored across the site collections in your environment can guide you to create additional content sources for each of your site collections in each of your Web applications. For example, if a site collection stores only archived information, you may not need to crawl that content as frequently as you crawl a site collection that stores frequently updated content. In this case, you might want to crawl these two site collections using different content sources so that they can be crawled on different schedules without having to crawl the archive sites as frequently as the other content.

Full and incremental crawl schedules

Search services administrators can configure the crawl schedules independently for each content source. For each content source, they can specify a time to do full crawls and a separate time to do incremental crawls. Note that you must run a full crawl for a particular content source before you can run an incremental crawl. If you choose an incremental crawl for content that has not yet been crawled, the system performs a full crawl.

We recommend that you plan crawl schedules based on the availability, performance, and bandwidth considerations of the servers running the search service and the servers hosting the crawled content.

When you plan crawl schedules, consider the following best practices:

  • Group start addresses in content sources based on similar availability and with acceptable overall resource usage for the servers that host the content.

  • Schedule incremental crawls for each content source during times when the servers that host the content are available and when there is low demand on the resources of the server.

  • Stagger crawl schedules so that the load on the servers in your farm is distributed over time.

  • Schedule full crawls only when necessary for the reasons listed in the next section. We recommend that you schedule full crawls less frequently than incremental crawls.

  • Schedule administration changes that require a full crawl to occur shortly before the planned schedule for full crawls. For example, we recommend that you attempt to schedule the creation of the crawl rule before the next scheduled full crawl so that an additional full crawl is not necessary.

  • Base simultaneous crawls on the capacity of the index server to crawl them. We recommend that you typically stagger your crawl schedules so that the index server does not crawl using multiple content sources at the same time. For best performance, we suggest that you stagger the crawling schedules of content sources. The performance of the index server and the servers hosting the content determines the extent to which crawls can be overlapped. A strategy for crawl scheduling can be developed over time as you can become familiar with the typical crawl durations for each content source.

Reasons to do a full crawl

Reasons for a search services administrator to do a full crawl include:

  • One or more hotfix or service pack was installed on servers in the farm. See the instructions for the hotfix or service pack for more information.

  • A search services administrator added a new managed property.

  • To re-index ASPX pages on Windows SharePoint Services 3.0 sites.

    Note

    The crawler cannot discover when ASPX pages on Windows SharePoint Services 3.0 sites have changed. Because of this, incremental crawls do not re-index views or home pages when individual list items are deleted. We recommend that you periodically do full crawls of sites that contain ASPX files to ensure that these pages are re-indexed.

  • To detect security changes that were made on a file share after the last full crawl of the file share.

  • To resolve consecutive incremental crawl failures. In rare cases, if an incremental crawl fails one hundred consecutive times at any level in a repository, the index server removes the affected content from the index.

  • Crawl rules have been added, deleted, or modified.

  • To repair a corrupted index.

  • The search services administrator has created one or more server name mappings.

  • The account assigned to the default content access account or crawl rule has changed.

The system does a full crawl even when an incremental crawl is requested under the following circumstances:

  • A search services administrator stopped the previous crawl.

  • A content database was restored.

    Note

    If you are running the Infrastructure Update for Microsoft Office Servers, you can use the restore operation of the Stsadm command-line tool to change whether a content database restore causes a full crawl.

  • A farm administrator has detached and reattached a content database.

  • A full crawl of the site has never been done.

  • The change log does not contain entries for the addresses that are being crawled. Without entries in the change log for the items being crawled, incremental crawls cannot occur.

  • The account assigned to the default content access account or crawl rule has changed.

  • To repair a corrupted index.

    Depending on the severity of the corruption, the system might attempt to perform a full crawl if corruption is detected in the index.

You can adjust schedules after the initial deployment based on the performance and capacity of servers in the farm and the servers hosting content.

Limit or increase the quantity of content that is crawled

For each content source, you can select how extensively to crawl the start addresses in that content source. You also specify the behavior of the crawl, sometimes called the crawl settings. The options you can choose for a particular content source vary somewhat based on the content source type that you select. However, most options determine how many levels deep in the hierarchy from each start address listed in the content source are crawled. Note that this behavior is applied to all start addresses in a particular content source. If you need to crawl some sites at deeper levels, you can create additional content sources that encompass those sites.

The options available in the properties for each content source vary depending on the content source type that is selected. The following table describes the crawl settings options for each content source type.

Content source type Crawl settings options

SharePoint sites

  • Everything under the host name for each start address

  • Only the SharePoint site of each start address

Web sites

  • Only within the server of each start address

  • Only the first page of each start address

  • Custom — Specify page depth and number of server hops

    Note

    The default setting for this option is unlimited page depths and server hops.

File shares

  • The folder and all subfolders of each start address

  • Only the folder of each start address

Exchange public folders

  • The folder and all subfolders of each start address

  • Only the folder of each start address

As the preceding table shows, search services administrators can use crawl setting options to limit or increase the quantity of content that is crawled.

The following table describes best practices when configuring crawl setting options.

For this content source type If this pertains Use this crawl setting option

SharePoint sites

You want to include the content on the site itself

-or-

You do not want to include the content available on subsites, or you want to crawl them on a different schedule

Crawl only the SharePoint site of each start address

SharePoint sites

You want to include the content on the site itself

-or-

You want to crawl all content under the start address on the same schedule

Crawl everything under the host name of each start address

Web sites

Content on the site itself is relevant

-or-

Content available on linked sites is not likely to be relevant

Crawl only within the server of each start address

Web sites

Relevant content is on only the first page

Crawl only the first page of each start address

Web sites

You want to limit how deep to crawl the links on the start addresses

Custom — Specify the number of pages deep and number of server hops to crawl

Note

We recommend that you start with a small number on a highly connected site because specifying more than three pages deep or more than three server hops can crawl the entire Internet.

File shares

Exchange public folders

Content available in the subfolders is not likely to be relevant

Crawl only the folder of each start address

File shares

Exchange public folders

Content in the subfolders is likely to be relevant

Crawl the folder and subfolder of each start address

Plan file-type inclusions and IFilters

Content is only crawled if the relevant file name extension is included in the file-type inclusions list and an IFilter is installed on the index server that supports those file types. Several file types are included automatically during initial installation. When you plan for content sources in your initial deployment, determine whether content you want to crawl uses file types that are not included. If file types are not included, you must add those file types on the Manage File Types page during deployment and ensure that an IFilter is installed and registered to support that file type.

Search Server 2008 provides several IFilters, and more are available from Microsoft and third-party vendors. For more information about how to install and register additional IFilters that are available from Microsoft, see How to register Microsoft Filter Pack with SharePoint Server 2007 and with Search Server 2008. If necessary, software developers can create IFilters for new file types.

On the other hand, if you want to exclude certain file types from being crawled, you can delete the file name extension for that file type from the file type inclusions list. Doing so excludes file names that have that extension from being crawled.

The following table lists which file types are supported by the IFilters that are installed by default and which file types are enabled on the Manage File Types page by default.

File name extension Default IFilter support Default file type inclusions

ascx

Yes

Yes

asm

Yes

No

asp

Yes

Yes

aspx

Yes

Yes

bat

Yes

No

c

Yes

No

cmd

Yes

No

cpp

Yes

No

css

Yes

No

cxx

Yes

No

def

Yes

No

dic

Yes

No

doc

Yes

Yes

docm

Yes

Yes

docx

Yes

Yes

dot

Yes

Yes

eml

Yes

Yes

exch

No

Yes

h

Yes

No

hhc

Yes

No

hht

Yes

No

hpp

Yes

No

hta

Yes

No

htm

Yes

Yes

html

Yes

Yes

htw

Yes

No

htx

Yes

No

jhtml

No

Yes

jsp

No

Yes

lnk

Yes

No

mht

Yes

Yes

mhtml

Yes

Yes

mpx

Yes

No

msg

Yes

Yes

mspx

No

Yes

nsf

No

Yes

odc

Yes

Yes

one

No

No

php

No

Yes

pot

Yes

No

pps

Yes

No

ppt

Yes

Yes

pptm

Yes

Yes

pptx

Yes

Yes

pub

Yes

Yes

stm

Yes

No

tif

Yes

Yes

tiff

No

Yes

trf

Yes

No

txt

Yes

Yes

url

No

Yes

vdx

No

Yes

vsd

No

Yes

vss

No

Yes

vst

No

Yes

vsx

No

Yes

vtx

No

Yes

xlb

Yes

No

xlc

Yes

No

xls

Yes

Yes

xlsm

Yes

Yes

xlsx

Yes

Yes

xlt

Yes

No

xml

Yes

Yes

IFilters and Microsoft Office OneNote

An IFilter is not provided for the .one file name extension used by Microsoft Office OneNote. If you want users to be able to search content in Office OneNote files, you must install an IFilter for OneNote. To do this, you must do one of the following:

  • Install the Microsoft Office OneNote 2007 client application on the index server.

    The IFilter provided by Office OneNote 2007 can be used to crawl both Office OneNote 2003 and Office OneNote 2007 files. The IFilter installed by Office OneNote 2003 can crawl Office OneNote 2003 files only.

  • Install and register the Microsoft Filter Pack.

    The OneNote IFilter provided by this filter pack can be used to crawl Office OneNote 2007 files only. For more information, see How to register Microsoft Filter Pack with SharePoint Server 2007 and with Search Server 2008.

Limit or exclude content by using crawl rules

When you add a start address to a content source and accept the default behavior, all subsites or folders below that start address are crawled unless you exclude them by using one or more crawl rules.

For more information about crawl rules, see Plan crawl rules later in this article.

Other considerations when planning content sources

You cannot crawl the same addresses using multiple content sources. For example, if you use a particular content source to crawl a site collection and all of its subsites, you cannot use a different content source to crawl one of those subsites separately on a different schedule. To accommodate this restriction, you might need to crawl some of these sites separately. Consider the following scenario:

An administrator at Contoso wants to crawl http://contoso, which contains the subsites http://contoso/sites/site1 and http://contoso/sites/site2. The administrator wants to crawl http://contoso/sites/site2 on a different schedule than the other sites. To achieve this, the administrator adds the addresses http://contoso and http://contoso/sites/site1 to one content source and selects the setting called Crawl only the SharePoint site of each start address. The subsite http://contoso/sites/site2 is then added to a separate content source with a different crawl schedule.

In addition to crawl schedules, there are other things to consider when planning content sources. For example, whether you group start addresses in a single content source or create additional content sources to crawl those start addresses depends largely on administration considerations. Administrators often make changes that require a full update of a particular content source. Changes to a content source require a full crawl of that content source. To make administration easier, organize content sources in such a way that updating content sources, crawl rules, and crawling content is convenient for administrators.

Content sources summary

Consider the following when planning your content sources:

  • A particular content source can be used to crawl only one of the following content types: SharePoint sites, Web sites that are not SharePoint sites, file shares, Exchange public folders, and Lotus Notes databases.

  • Search services administrators can create up to 500 content sources, and each content source can contain up to 500 start addresses. To keep administration as simple as possible, create only as many content sources as you absolutely need.

  • Each URL in a particular content source must be of the same content source type.

  • For a particular content source, you can choose how deep to crawl from the start addresses. These configuration settings apply to all start addresses in the content source. The available choices on how deep you can crawl the start addresses differ depending on the content source type that is selected.

  • You can schedule when to perform either a full or incremental crawl for the entire content source. For more information about scheduling crawls, see Plan crawl rules later in this article.

  • Search services administrators can modify the default content source, create additional content sources for crawling other content, or both. For example, they can configure the default content source to also crawl content on a different server farm or they can create a new content source to crawl other content.

  • To effectively crawl all the content needed by your organization, use as many content sources as make sense for the types of sources you want to crawl, and for the frequency at which you plan to crawl them.

Plan for authentication

When the crawler accesses the start addresses that are listed in content sources, the crawler must be authenticated by and granted access to the servers that host that content. This means that the domain account used by the crawler must have at least read permission to the content.

The default content access account is the account that is used by default when crawling content sources. This account is specified by the search services administrator. Alternatively, you can use crawl rules to specify a different content access account to use when crawling particular content. Regardless of whether you use the default content access account or a different content access account specified by a crawl rule, the content access account that you use must have read access to all content that is crawled, or the content is not crawled and is not available to queries.

We recommend that you select a default content access account that has the broadest access to most of your crawled content, and only use other content access accounts when security considerations require separate content access accounts. For information about creating a separate content access accounts to crawl content that cannot be read using the default content access account, see Plan crawl rules later in this article.

For each content source you plan, identify the start addresses that cannot be accessed by the default content access account and plan to add crawl rules for URL patterns that encompass those start addresses.

Note

Ensure that the domain account used for the default content access account or any other content access account is not the same domain account that is used by an application pool associated with any Web application you crawl. Doing so can cause unpublished content in SharePoint sites and minor versions of files (history) in SharePoint sites to be crawled and indexed.

For more information about the planning considerations for content access accounts, see Plan crawl rules later in this article.

Another important consideration is that the crawler must use the same authentication method as the host server. By default, the crawler attempts to authenticate using NTLM authentication. You can configure the crawler to use a different authentication method, if necessary. For more information, see "Authentication requirements for crawling content" in Plan authentication methods (Office SharePoint Server). This article also pertains to Search Server 2008.

Plan protocol handlers

All content that is crawled requires the use of a protocol handler to gain access to that content. Search Server 2008 provides protocol handlers for all common Internet protocols. However, if you want to crawl content that requires a protocol handler that is not installed with Search Server 2008, you must install the third-party or custom protocol handler before you can crawl that content.

The following table shows the protocol handlers that are installed by default.

Protocol handler Used to crawl

File

File shares

http

Web sites

https

Web sites over Secure Sockets Layer (SSL)

Notes

Lotus Notes databases

Rb

Exchange public folders

Rbs

Exchange public folders over SSL

Sps

People profiles from Windows SharePoint Services 2.0 server farms

Sps3

People profile crawls of Windows SharePoint Services 3.0 server farms only

Sps3s

People profile crawls from Windows SharePoint Services 3.0 server farms only over SSL

Spsimport

People profile import

Spss

People profile import from Windows SharePoint Services 2.0 server farms over SSL

Sts

Windows SharePoint Services 3.0 root URLs (internal protocol)

Sts2

Windows SharePoint Services 2.0 sites

Sts2s

Windows SharePoint Services 2.0 sites over SSL

Sts3

Windows SharePoint Services 3.0 sites

Sts3s

Windows SharePoint Services 3.0 sites over SSL

Plan to manage the impact of crawling

Crawling content can significantly decrease the performance of the servers that host the content. The impact that this has on a particular server varies depending on the load that the host server is experiencing and whether the server has sufficient resources (particularly CPU and RAM) to maintain service level agreements under normal or peak usage.

Crawler impact rules enable farm administrators to manage the impact your crawler has on the servers being crawled. For each crawler impact rule, you can specify a single URL or use wildcard characters in the URL path to include a block of URLs to which the rule applies. You can then specify how many simultaneous requests for pages are made to the specified URL or choose to request only one document at a time and wait a number of seconds that you choose between requests.

Crawler impact rules reduce or increase the rate at which the crawler requests content from a particular start address or range of start addresses (sometimes called a site name), regardless of the content source used to crawl those addresses. The following table shows the wildcard characters that you can use in the site name when adding a rule.

Wildcard to use Result

* as the site name

Applies the rule to all sites.

*.* as the site name

Applies the rule to sites with dots in the name.

*.site_name.com as the site name

Applies the rule to all sites in the site_name.com domain (for example, *.adventure-works.com).

*.top-level_domain_name as the site name

Applies the rule to all sites that end with a specific top-level domain name, for example, *.com or *.net.

?

Replaces a single character in a rule. For example, *.adventure-works?.com applies to all sites in the domains adventure-works1.com, adventure-works2.com, and so on.

You can create a crawler impact rule that applies to all sites within a particular top-level domain. For example, *.com applies to all Internet sites with addresses that end in .com. For example, an administrator of a portal site might add a content source for example.microsoft.com. The rule for *.com applies to this site unless you add a crawler impact rule specifically for example.microsoft.com.

For content within your organization that other administrators are crawling, you can coordinate with those administrators to set crawler impact rules based on the performance and capacity of the servers. For most external sites, this coordination is not possible. Requesting too much content on external servers or making requests too frequently can cause administrators of those sites to limit your future access if your crawls are using too many resources or too much bandwidth. Therefore, the best practice is to crawl more slowly. In this way, you can mitigate the risk of losing access to crawl the relevant content.

During initial deployment, set the crawler impact rules to make as small an impact on other servers as possible while still crawling enough content frequently enough to ensure the freshness of the crawled content.

During the operations phase, you can adjust crawler impact rules based on your experiences and data from crawl logs.

Plan crawl rules

Crawl rules apply to a particular URL or set of URLs represented by wildcards (also referred to as the path affected by the rule). You use crawl rules to do the following things:

  • Avoid crawling irrelevant content by excluding one or more URLs. This also helps to reduce the use of server resources and network traffic, and to increase the relevance of search results.

  • Crawl links on the URL without crawling the URL itself. This option is useful for sites with links of relevant content when the page containing the links does not contain relevant information.

  • Enable complex URLs to be crawled. This option crawls URLs that contain a query parameter specified with a question mark. Depending on the site, these URLs might or might not include relevant content. Because complex URLs can often redirect to irrelevant sites, it is a good idea to enable this option on only sites where the content available from complex URLs is known to be relevant.

  • Enable content on SharePoint sites to be crawled as HTTP pages. This option enables the index server to crawl SharePoint sites that are behind a firewall or in scenarios in which the site being crawled restricts access to the Web service used by the crawler.

  • Specify whether to use the default content access account, a different content access account, or a client certificate for crawling the specified URL.

Note

Crawl rules apply simultaneously to all content sources.

Often, most of the content for a particular site address is relevant, but not a specific subsite or range of sites below that site address. By selecting a focused combination of URLs for which to create crawl rules that exclude unneeded items, search services administrators can maximize the relevance of the content in the index while minimizing the impact on crawling performance and the size of search databases. Creating crawl rules to exclude URLs is particularly useful when planning start addresses for external content, the impact on resource usage of which is not under the control of people in your organization.

When creating a crawl rule, you can use standard wildcard characters in the path. For example:

  • http://server1/folder* contains all Web resources with a URL that starts with http://server1/folder.

  • *://*.txt includes every document with the .txt file name extension.

Because crawling content consumes resources and bandwidth, it is better to include a smaller amount of content that you know is relevant than a larger amount of content that might be irrelevant. After the initial deployment, you can review the query and crawl logs and adjust content sources and crawl rules to be more relevant and include more content.

Specify a different content access account

For crawl rules that include content, administrators have the option of changing the content access account for the rule. The default content access account is used unless another account is specified in a crawl rule. The main reason to use a different content access account for a crawl rule is that the default content access account does not have access to all start addresses. For those start addresses, you can create a crawl rule and specify an account that does have access.

Note

Ensure that the domain account used for the default content access account or any other content access account is not the same domain account that is used by an application pool associated with any Web application you crawl. Doing so can cause unpublished content in SharePoint sites and minor versions of files (history) in SharePoint sites to be crawled and indexed.

Plan search settings that are managed at the farm level

In addition to the settings that are configured at the Search Administration level, several settings that are managed at the farm level affect how content is crawled. Consider the following farm-level search settings while planning for crawling:

Contact e-mail address   Crawling content affects the resources of the servers that are being crawled. Before you can crawl content, you must provide in the configuration settings the e-mail address of the person in your organization whom administrators can contact in the event that the crawl adversely affects their servers. This e-mail address appears in logs for administrators of the servers being crawled so that those administrators can contact someone if the impact of crawling on their performance and bandwidth is too great, or if other issues occur.

The contact e-mail address should belong to a person who has the necessary expertise and availability to respond quickly to requests. Alternatively, you can use a closely monitored distribution-list alias as the contact e-mail address. Regardless of whether the content crawled is stored internally to the organization or not, quick response time is important.

Proxy server settings   You can choose whether to use a proxy server when crawling content. The proxy server to use depends on the topology of your Search Server 2008 deployment and the architecture of other servers in your organization.

  • Time-out settings   The time-out settings are used to limit the time that the search server waits while connecting to other services.

  • SSL setting   The Secure Sockets Layer (SSL) setting determines whether the SSL certificate must exactly match to crawl content.

Indexing content in different languages

When crawling content, the crawler determines each individual word in the content it finds. Languages that have words separated by spaces make it relatively easy for the crawler to distinguish each word. In other languages, finding the boundary between words can be more complex.

Search Server 2008 provides word breakers and stemmers by default to help crawl and index content in many languages. Word breakers find word boundaries in full-text indexed data, while stemmers conjugate verbs.

If you are crawling any of the languages listed in the following table, Search Server 2008 automatically uses the appropriate word breaker and stemmer for that language. An asterisk (*) indicates that the stemming feature is on by default.

Language supported by default Language supported by default

Arabic

Lithuanian*

Bengali

Malay

Bulgarian*

Malayalam*

Catalan

Marathi

Croatian

Norwegian_Bokmaal

Czech*

Polish*

Danish

Portuguese

Dutch

Portuguese_Brazilian

English

Punjabi

Finnish*

Romanian*

French*

Russian*

German*

Serbian_Cyrillic*

Greek*

Serbian_Latin*

Gujarati

Slovak*

Hebrew

Slovenian*

Hindi

Spanish*

Hungarian*

Swedish

Icelandic*

Tamil*

Indonesian

Telugu*

Italian

Thai

Japanese

Turkish*

Kannada*

Ukrainian*

Korean

Urdu*

Latvian*

Vietnamese

When the crawler indexes content for a language that is not supported, the neutral breaker is used. If the neutral breaker does not give you the results you expect, you can try third-party solutions that work with Search Server 2008.

See Also

Concepts

Working with federation (Search Server 2008)