Introduction to Protocol Handlers

With Microsoft Office SharePoint Portal Server 2003, you can add external content sources to the portal site to be included in the content crawled by Microsoft SharePoint Portal Server Search (SharePointPSSearch). Protocol handlers are software components running in a Filter Daemon process that implement the protocol for accessing a content source in its native format. This exposes the content source to SharePointPSSearch for crawling. The following figure shows the protocol handler architecture and the data flow during the crawl process.

Crawls are initiated within the search component's process for a portal site. The search component receives a URL for content that must be crawled from the portal site. The URL can be the start address for a content source, a link stored from a previous crawl, or a notification from a portal site. The search component checks the URL against the crawl restrictions set for this portal site.

When the crawling of a content source starts, a crawler or robot thread in the search component gives the crawling request to the Filter Daemon. The robot thread allocates a Filter object from a pool. When the Filter object is allocated, it is associated with a Filter thread object. Each document being filtered corresponds to one Filter thread object in the Filter Daemon. The Filter Daemon runs in a separate process from the search component so that the process can be terminated if there are any problems with the process. The Filter Daemon process for SharePoint Portal Server 2003 is Mssdmn.exe. The Filter Daemon and the search component communicate by using pipes of shared memory.

The Filter thread object receives the URL for content to be filtered in addition to the last time the content was crawled. The Filter thread object determines and invokes the appropriate protocol handler for the URL item. The protocol handler creates a UrlAccessor object that will control the filtering of this item.

The URL items passed to the Filter thread object can either be the start address for a content source, or the URL of an item inside the content source. In process 1, as shown in the preceding figure, the search component provides the Filter thread object with the start address for a content source. The protocol handler produces an enumeration of items inside the content source. The search component acts on this enumeration of items and determines which items need to be filtered. It associates a URL with each item, and queues the items for further examination by the Filter Daemon.

If the URL points to a specific item within the content source, data from the UrlAccessor object can follow one of two paths from the protocol handler to the Filter thread object. These paths are labeled 2a and 2b in in the preceding figure.

In process 2a, the Filter thread object was issued a URL for an item in the content source. The protocol handler processes the data in one of two ways. It can pass the contents of the item in the stream it has open for the Filter thread object in the Filter Daemon. This happens through the BindToStream method on the UrlAccessor. In this case, the Filter thread object invokes an appropriate IFilter on the Stream object created for the document. Alternatively, the protocol handler returns the file name of the item pointed to in the URL. The Filter thread object uses the file name to access the file directly and chooses an appropriate IFilter.

In process 2b, the UrlAccessor uses the ProtocolHandlerSite object to query the Filter Daemon on the appropriate IFilter to use for the URL item. The choice of IFilter is based on the file extension, a class ID that identified the file's content in the registry, or on the MIME content type. The UrlAccessor object applies this IFilter on the URL item and returns the filtered data to the Filter thread object.

After the Filter thread object has established a connection to the IFilter for the item being accessed, the filtering process is the same for both protocol handler data paths. The IFilter process enters a loop of reading from the URL item and producing filtered data that is returned to the search component process. The IFilter first extracts metadata that corresponds to properties that are marked retrievable in the SharePoint Portal Server XML Schema, such as title, file size, and last modified date. Then it breaks the item content into chunks of text.

All errors that occur during this process are flagged in the Gatherer log. If an error occurs, the filtering process may be terminated. The search component re-queues the items that were being filtered when the Filter Daemon was terminated for later filtering.

Protocol Handler Initialization

The Filter Daemon starts and initializes all protocol handlers that are registered. After the CoCreate method is called for the protocol handler, initialization is performed by a call to the Init method of the ISearchProtocol interface.

Protocol Handler Selection

When a crawl for a content source is initiated, the search component determines the URL for the start address to the content source and passes this URL to the Filter Daemon. The Filter Daemon determines the appropriate protocol handler for the content source. Content source types are distinguished by their URL prefixes. One example uses the protocol name of the URL. For the URL http://www.microsoft.com, the Filter Daemon uses the protocol handler associated with the HTTP protocol.

Protocol Handler Security

Security for the search component is implemented through Microsoft Windows Security Descriptors. Protocol handlers must, therefore, use domain groups and SharePointPSSearch must exist in a trusted domain.

Protocol handlers receive security credentials for the content source they are accessing from the Filter Daemon. You specify these credentials, in addition to the authentication method, when you create and configure the content source for the portal site.

A default access account is configured for SharePointPSSearch, and the protocol handler runs as this account if no access account is configured for the content source.

Types of Protocol Handlers

SharePoint Portal Server provides support for hierarchical and link-based protocol handlers. Hierarchical protocol handlers work with structured content sources, such as file shares, which include structures such as directories or folders that must be traversed. Link-based protocol handlers work with content sources such as Web sites, where links within the content indicate how the source is traversed.