2.2.4.3 CrawlRuleInternal

The CrawlRuleInternal type represents the properties for a crawl rule.

 <s:complexType name="CrawlRuleInternal">
   <s:sequence>
     <s:element name="path" type="s:string" minOccurs="0"/>
     <s:element name="type" type="s:int"/>
     <s:element name="authenticationType" type="s:int"/>
     <s:element name="accountName" type="s:string" minOccurs="0"/>
     <s:element name="contentClass" type="s:string" minOccurs="0"/>
     <s:element name="suppressIndexing" type="s:boolean"/>
     <s:element name="followComplexUrls" type="s:boolean"/>
     <s:element name="crawlAsHttp" type="s:boolean"/>
     <s:element name="enabled" type="s:boolean"/>
     <s:element name="pluggableSecurityTrimmerId" type="s:int"/>
     <s:element name="authUrl" type="s:string" minOccurs="0"/>
     <s:element name="authData" type="s:string" minOccurs="0"/>
     <s:element name="miscData" type="tns:ArrayOfString" minOccurs="0"/>
     <s:element name="accountLastModified" type="s:dateTime"/>
   </s:sequence>
 </s:complexType>

path: A crawl rule path expression. MUST be present, and the length MUST be greater than 0 and less than 2048 characters. MUST be either a Universal Naming Convention (UNC) path or a URL, with the following characters allowed: ‘*’ and ‘?’.

type: The crawl rule type.  MUST be one of the following values:

Value

Meaning

0

Inclusion rule. URLs matching the path are included in the crawl.

1

Exclusion rule. URLs matching the path are not included in the crawl.

authenticationType: The authentication type for accessing matching URLs. MUST be one of the following values:

Value

Meaning

0

Default access.

1

Integrated authentication.

2

Basic authentication. The user name and password are required for this authentication type.

3

Authentication using a certificate. A valid client certificate name is required for this authentication type.

4

forms authentication. A valid URL for HTTP POST or HTTP GET, public and private parameters, and a list of error pages are used by this authentication type.

5

Cookie based authentication. Private parameters and a list of error pages are used by this authentication type.

Default access implies integrated authentication using credentials of the default crawl account for the crawler application.

accountName: If present, the length MUST be less than 256 characters. This element MUST be interpreted differently depending on the value of authenticationType. The following table specifies interpretation and restrictions for this element:

authenticationType value

accountName interpretation

0

MUST be ignored.

1

The user name for integrated authentication.

2

The user name for basic authentication.

3

The X.509 certificate name.

4

MUST be ignored.

5

MUST be ignored.

contentClass: Arbitrary metadata for the crawl rule. If present, the length MUST be less than 1024 characters.

suppressIndexing: MUST be one of the following values:

Value

Meaning

false

The protocol server MUST crawl any URLs that match the URL specified for the crawl rule’s path.

true

The protocol server MUST NOT crawl any URLs that match the URL specified for the crawl rule’s path.

followComplexUrls: Specifies the crawl behavior on the matched URLs with a query component. MUST be one of the following values:

Value

Meaning

false

Matching URLs with a query component MUST be followed during the crawl.

true

Matching URLs with a query component MUST be discarded during the crawl.

crawlAsHttp: Specifies whether to use the HTTP protocol to crawl matching links with the http: scheme. MUST be one of the following values:

Value

Meaning

false

The protocol server crawls matching URLs with http: scheme using any protocol available for the repository.

true

The protocol server crawls matching URLs with http: scheme using the HTTP protocol.

enabled: Specifies whether the crawl rule is enabled or disabled. Disabled crawl rules are ignored by the protocol server. MUST be one of the following values:

Value

Meaning

false

The crawl rule is disabled.

true

The crawl rule is enabled.

pluggableSecurityTrimmerId: The security trimmer identifier.

authUrl: If authenticationType is 4, authUrl contains the URL for the forms authentication type. If present, the length MUST be less than 2048 characters. The forms authentication type implies an HTTP GET or an HTTP POST to the authentication URL, as specified in [RFC2616], to obtain the authorization cookie. If authenticationType is set to 0, 1, 2, 3 or 5, this element MUST not be interpreted.

authData: If authenticationType is 4, this element represents the public authentication parameters for authUrl, according to the format of the HTTP form in authUrl. If authenticationType is set to 0, 1, 2, 3 or 5, this element MUST not be interpreted.

miscData: If authenticationType is set to 4 or to 5, this element represents a collection of error page URLs. When requests to retrieve content get redirected by item repositories during a crawl, these error page URLs are used to determine authentication errors. If authenticationType is set to 0, 1, 2 or 3, this element MUST not be interpreted.

accountLastModified: The latest time credentials were set or modified for this rule.