3.1.1.4 Content Source

Article
10/13/2020

The portal content project contains a collection of zero or more content source objects. This object represents a content source that be used to start a crawl on the index server.

id: The unique identifier of the content source in the collection. Assigned by the protocol server when a new content source is added.

type: The content source type. This type is used by the crawler as a hint to determine which technology to use to crawl the repository pointed to by the start addresses. MUST be one of the following values:

Value	Meaning
0	Web sites
1	Sites
2	Lotus Notes database
3	File shares
4	Exchange public folders
5	Custom
6	Business Data Connectivity (BDC)

systemCreated: If true, the content source was created during the initial system configuration and cannot be deleted by the protocol client. Any content sources added by the protocol client will have the systemCreated set to false.

name: The content source name. This is the label intended to be displayed in user interface.

wssCrawlStyle: The type of the crawl performed while crawling sites. If 0, then the entire Web applications pointed to by start addresses are crawled. If 1, then only the specific sites pointed by the start addresses are crawled without enumerating all sites in the Web application.

metadata: The arbitrary metadata associated by the protocol client with the content source. The value of this property is ignored by the protocol server, but can be interpreted by the protocol client to associate arbitrary metadata with the collection of content sources.

followDirectories: If true, only links provided by the repository being crawled are followed during the crawl, and links discovered within items are discarded. If false, only links discovered within items are followed.

pageDepth: The maximum permitted depth of the URL space traversal, including traversal within a single site or across different sites. Whenever a link is followed by the index server during the crawl the depth counter is incremented. The depth counter cannot increase beyond the pageDepth of the content source. For example, if the pageDepth is 1 and Page A links to Page B, which links to Pages C and D, then neither pages C nor D will be crawled because the depth counter would exceed pageDepth.

siteDepth: The depth of the URL space traversal in terms of authority hops. This is analogous to the pageDepth variable, but at a domain level. A server domain hop is made when a link points to a URL from a different server domain. Whenever a link is followed by the index server during the crawl to a different host (or item repository server), the site depth counter is incremented. The site depth counter cannot exceed the siteDepth of the content source

startAddresses: A list of start address URLs. The first step of starting the crawl is to add the start address URLs to the crawl queue. The crawl then begins by following links from these start addresses.

fullCrawlTrigger: Defines the full crawl schedule. The crawl can be started either by explicit request from the protocol client, or automatically, at specified points of time, according to the schedule.

incCrawlTrigger: Defines the incremental crawl schedule.

crawlStatus: Identifies whether a crawl for this content source is idle, paused, stopped, or running. Also identifies what type of crawl it is, (full crawl or incremental crawl).

crawlStarted: The timestamp of when the most recent crawl was started for this content source.

crawlCompleted: The timestamp of when the most recent crawl was finished for this content source.

errorCount: The number of items, (during the most recent crawl), where the crawler attempted to crawl the items, but did not succeed.

crawlSuccesses: The number of items successfully crawled during the most recent crawl.

throttleStart: This property is not interpreted by the protocol server. It can be set and retrieved by the protocol client.

throttleDuration: This property is not interpreted by the protocol server. It can be set and retrieved by the protocol client.

3.1.1.4 Content Source

Additional resources