1.3 Protocol Overview (Synopsis)

The purpose of this protocol is to support site crawling by an external index server. The purpose, techniques, and internal data structures used by an index server are beyond the scope of this protocol. However, this protocol does depend on the structure of the site content and the behavior of the protocol server.

This protocol supports an index server or similar client applications that follow the site crawling process to traverse a site where the content conforms to the site structure.

Site Structure

This protocol requires the site to present content in a hierarchical structure. It requires the protocol server to track the details about common content types such as pages, lists, list items, documents, and document libraries.

The index server or any other client can use the protocol’s methods to traverse the site structure, explore the list item fields, or retrieve document library items or list item attachments.

Site Crawling Process

This protocol is designed to support an index server or similar client application that conducts a scan of the site content following the recommended site crawling process. This process is described in detail in section 3.1.4.

The site crawling process assumes that the index server is configured with the URL of the site to index and that the site supports this protocol.

First, the index server establishes the crawling context. Details of context establishment are discussed in section 3.1.4.

Next, the index server proceeds with a traversal of the site. Traversal requires identifying all subsites, identifying all lists and document libraries, and scanning all list items and documents to peruse their content. This protocol exposes many operations supporting traversal including GetSite, GetWeb, and GetListCollection. These operations are described in detail in section 3.1.4.

Note that the site can store the documents in a variety of proprietary formats as well as in many languages. This protocol itself allows the index server to enumerate site objects or retrieve the document content itself.

When the crawling process completes the traversal of all site content, it reaches the "complete" state.

The site content might undergo changes before the index server completes a traversal of the site content. In this case, the index can be already out of date with respect to the content.