Manage crawl rules (Search Server 2010)

 

Applies to: Search Server 2010

Topic Last Modified: 2011-05-23

You can add a crawl rule to include or exclude specific paths when you crawl content. When you include a path, you can optionally provide alternative account credentials to crawl it. In addition to creating or editing crawl rules, you can test, delete, or reorder existing crawl rules.

Use crawl rules to perform the following:

  • Prevent content on a site from being crawled. For example, if you created a content source to crawl https://www.contoso.com, but you do not want the search system to crawl content from the subdirectory https://www.contoso.com/downloads, create a crawl rule to exclude content from that subdirectory.

  • Crawl content on a site that would be excluded otherwise. For example, if you excluded content from https://www.contoso.com/downloads from being crawled, but you want content in the subdirectory https://www.contoso.com/downloads/content to be crawled, create a crawl rule to include content from that subdirectory.

  • Specify authentication credentials. If a site to be crawled requires different credentials than those of the default content access account, create a crawl rule to specify the authentication credentials.

You can use the asterisk (*) as a wildcard character in crawl rules. For example, to exclude JPEG files from crawls on https://www.contoso.com, create a crawl rule to exclude https://www.contoso.com/\*.jpg.

Rule order is important, because the first rule that matches a particular set of content is the one that is applied.

In this article:

  • To create or edit a crawl rule

  • To test a crawl rule on a URL

  • To delete a crawl rule

  • To reorder crawl rules

To create or edit a crawl rule

  1. Verify that the user account that is performing this procedure is a service application administrator for the Search service application.

  2. In Central Administration, in the Application Management section, click Manage Service Applications.

  3. On the Manage Service Applications page, in the list of service applications, click Search Service Application.

  4. On the Search Administration page, in the Quick Launch, click Crawl Rules. The Manage Crawl Rules page appears.

  5. To create a new crawl rule, click New Crawl Rule. To edit an existing crawl rule, in the list of crawl rules, point to the name of the crawl rule that you want to edit, click the arrow that appears, and then click Edit.

  6. On the Add Crawl Rule page, in the Path section:

    • In the Path box, type the path to which the crawl rule will apply. You can use standard wildcard characters in the path.

    • Select the Follow regular expression syntax when matching this rule check box to use regular expressions instead of wildcard characters.

    • Select the Match case check box if you want the capitalization in the provided path to exactly match the capitalization of the actual path.

  7. In the Crawl Configuration section, select one of the following options:

    • Exclude all items in this path. Select this option if you want to exclude all items in the specified path from crawls. If you select this option, you can refine the exclusion by selecting the following:

      • Exclude complex URLs (URLs that contain question marks (?)). Select this option if you want to exclude URLs that contain parameters that use the question mark (?) notation.
    • Include all items in this path. Select this option if you want all items in the path to be crawled. If you select this option, you can further refine the inclusion by selecting any combination of the following:

      • Follow links on the URL without crawling the URL itself. Select this option if you want to crawl links contained within the URL, but not the starting URL itself.

      • Crawl complex URLs (URLs that contain a question mark (?)). Select this option if you want to crawl URLs that contain parameters that use the question mark (?) notation.

      • Crawl SharePoint content as HTTP pages. Normally, SharePoint sites are crawled by using a special protocol. Select this option if you want SharePoint sites to be crawled as HTTP pages instead. When the content is crawled by using the HTTP protocol, item permissions are not stored.

  8. In the Specify Authentication section, perform one of the following actions:

    Note

    This option is not available unless the Include all items in this path option is selected in the Crawl Configuration section.

    • To use the default content access account, select Use the default content access account.

    • If you want to use a different account, select Specify a different content access account and then perform the following actions:

      1. In the Account box, type the user account name that can access the paths that are defined in this crawl rule.

      2. In the Password and Confirm Password boxes, type the password for this user account.

      3. To prevent basic authentication from being used, select the Do not allow Basic Authentication check box. The server attempts to use NTLM authentication. If NTLM authentication fails, the server attempts to use basic authentication unless the Do not allow Basic Authentication check box is selected.

    • To use a client certificate for authentication, select Specify client certificate, expand the Certificate menu, and then select a certificate.

    • To use form credentials for authentication, select Specify form credentials, type the form URL (the location of the page that accepts credentials information) in the Form URL box, and then click Enter Credentials. When the logon prompt from the remote server opens in a new window, type the form credentials with which you want to log on. You are prompted if the logon was successful. If the logon was successful, the credentials that are required for authentication are stored on the remote site.

    • To use cookies, select Use cookie for crawling, and then select either of the following options:

      • Obtain cookie from a URL. Select this option to obtain a cookie from a Web site or server.

      • Specify cookie for crawling. Select this option to import a cookie from your local file system or a file share. You can optionally specify error pages in the Error pages (semi-colon delimited) box.

  9. Click OK.

To test a crawl rule on a URL

  1. Verify that the user account that is performing this procedure is a service application administrator for the Search service application.

  2. In Central Administration, in the Application Management section, click Manage Service Applications.

  3. On the Manage Service Applications page, in the list of service applications, click Search Service Application.

  4. On the Search Administration page, in the Quick Launch, click Crawl Rules.

  5. On the Manage Crawl Rules page, in the Type a URL and then click test to find out whether it matches a rule box, type the URL that you want to test.

  6. Click Test. The result of the test appears below the Type a URL and then click test to find out whether it matches a rule box.

To delete a crawl rule

  1. Verify that the user account that is performing this procedure is a service application administrator for the Search service application.

  2. In Central Administration, in the Application Management section, click Manage Service Applications.

  3. On the Manage Service Applications page, in the list of service applications, click Search Service Application.

  4. On the Search Administration page, in the Quick Launch, click Crawl Rules.

  5. On the Manage Crawl Rules page, in the list of crawl rules, point to the name of the crawl rule that you want to delete, click the arrow that appears, and then click Delete.

  6. Click OK to confirm that you want to delete this crawl rule.

To reorder crawl rules

  1. Verify that the user account that is performing this procedure is a service application administrator for the Search service application.

  2. In Central Administration, in the Application Management section, click Manage Service Applications.

  3. On the Manage Service Applications page, in the list of service applications, click Search Service Application.

  4. On the Search Administration page, in the Quick Launch, click Crawl Rules.

  5. On the Manage Crawl Rules page, in the list of crawl rules, in the Order column, specify the crawl rule position that you want the rule to occupy. Other values shift accordingly.