Manage crawl rules (FAST Search Web crawler)

 

Applies to: FAST Search Server 2010

The FAST Search Web crawler supports a variety of rules to configure what content is and is not crawled. The rules consist of URL and host name based include and exclude rules, extension excludes, MIME type restrictions and HTTP scheme restrictions. Refer to Determine what Web content to crawl for additional guidance.

How to include content in the crawl

Include rules are configured with the include_domains and include_uris settings, and specify what Web content should be included. However, they do not define where the FAST Search Web crawler should start the crawl, which is specified in the start URL list.

The include_domains setting lets you specify what host names the Web crawler is allowed to crawl and the include_uris setting is the URL equivalent. The table shows some examples of how these rules can be used. For a complete overview, see include_domains and include_uris in the Web Crawler XML configuration file reference topic.

Setting Rule type Description

include_uris

exact

Include only the page https://www.contoso.com/

prefix

Include all content under https://www.contoso.com/public/

suffix

Include all URLs ending with “.html”

regexp

Include URLs matching the pattern “.*contoso.*”

file

Include <path to local file>. This file contains the rule set to use.

include_domains

exact

Include the host name “www.contoso.com”.

prefix

Include host names that begin with “www.contoso”.

suffix

Include host names ending with “.contoso.com”.

regexp

Include all host names starting with www and ending with com: “www\..*\.com”.

file

Include <path to local file>. This file contains the rule set to use.

If neither include_uris nor include_domains are specified, the FAST Search Web crawler treats all Web sites and URLs as eligible for crawling unless explicitly excluded by other rules.

How to exclude content from the crawl

Important

Exclude rules override include rules. This means that a Web site or URL specified in the exclude rules will not be crawled, even if it was included by an include rule.

Exclude certain domains and/or sites

Exclude rules are configured with the exclude_domains and exclude_uris settings, and specify what Web content should not be crawled. The rules are configured similarly to the include rules.

Exclude file types

Content can also be excluded based on the file type extension of the Web item. A common example is graphical content. JPEG, GIF and bitmap files are excluded. Links to audio or video content are typically excluded to avoid downloading large amounts of content with no text content, although special multimedia crawls may choose to include this content for further processing.

These restrictions can be implemented using either the exclude_exts settings, which exclude specific file types by their extension (for example, any Web item that ends with ".jpg"), or by restricting the MIME types that are crawled using the allowed_types setting (for example, "image/jpeg"). Note that a MIME type screening requires the crawler to actually fetch part of the document, whereas an extension exclude can be performed without any network access simply by examining the hyperlink itself.

Exclude URL schemes

The FAST Search Web crawler can be restricted to crawl only a certain type of URL schemes by using the allowed_schemes setting. This setting can be used to restrict the Web crawler to, for example, only crawl HTTP or HTTPS content.

See Also

Concepts

Web Crawler XML configuration reference
Determine crawl schedules