Content Crawling and Search Overview
SharePoint Portal Server provides extensive and extensible content crawling and search features that support full-text searching and a Structured Query Language (SQL-based) query grammar.
The SharePoint Portal Server Search service can crawl content and its associated properties stored in internal in addition to external Web sites, local and network file systems, Web Storage Systems, Microsoft Exchange 5.5 and Exchange 2000 Server, other SharePoint Portal Server computers, and Lotus Notes databases.
The SharePoint Portal Server Search service and extended SQL query language support a broad range of simple and complex queries over multiple document sources, and can mix property-based filtering with full-text, linguistically-enabled content matching. Search results from these content sources are merged together.
Figure 5: SharePoint Portal Server Content Crawling and Search Architecture
The preceding figure illustrates components of the SharePoint Portal Server Search architecture.
The following list describes the components of the SharePoint Portal Server Search architecture.
- **Search Engine. **Component of the Search service that runs queries written in the SharePoint Portal Server extended SQL syntax against the full-text index.
- Index Engine. Component of the Search service that processes chunks of text and properties filtered from content sources, and determines which properties are written to the full-text index.
- Gatherer. Component of the Search service that manages the content crawling process and has rules that determine what content is crawled.
- Wordbreakers. Components shared by the Search and Index engines that break up compound words and phrases.
- Stemmers. Components shared by the Search and Indexing engines that generate inflected forms for a word.
- Filter Daemon. Component that handles requests from the Gatherer. Uses protocol handlers to access content sources, and IFilters to filter files. Provides Gatherer with a stream of data containing filtered chunks and properties.
- Protocol Handlers. Open content sources in their native protocol and expose documents and other items to be filtered.
- IFilters. Open documents and other content source items in their native format and filter into chunks of text and properties.
- Content sources. Collection of data the Search service must crawl, and specific rules for crawling items in that content source. Items in content sources are identified by URLs. What distinguishes different types of content sources is the protocol portion of the URL.
Each SharePoint Portal Server workspace has an associated Gatherer process and its own full-text index called the workspace catalog. Each Gatherer process contains its own set of parameters, restrictions and plug-ins components. Each Gatherer process also keeps its own logs and performance statistics. The content crawling process is started by a manual or scheduled instruction to crawl content or by a notification from a file store — for example, a SharePoint Portal Server document store or a file share using NTFS — that notifies Search when content has changed. The Gatherer component is given a URL for the start address for a content source, and a crawl is initiated.
The Gatherer uses a pipe of shared memory to request that the Filter Daemon begin filtering the content source. For the crawl process to be successful, the content source must have an associated protocol handler that can read its protocol. The Filter Daemon invokes the appropriate protocol handler for the content source based on the start address provided by the Gatherer. The Filter Daemon uses protocol handlers and IFilters to extract and filter individual items from the content source. Appropriate IFilters for each document are applied, and the Filter Daemon passes the extracted text and metadata to the Gatherer through the pipe.
The Gatherer runs the data through a series of internal components (such as the Persistent Query Service [PQS] component that matches crawled documents against subscriptions stored in the system,) to process the data before relaying it to the Index engine. At this time, the Gatherer's index component saves document properties to a property store separate from the SharePoint Portal Server document store. The property store consists of a table of properties and their values. Properties in this store can be retrieved and sorted. In addition, simple queries against properties are supported by the store. Each row in the table corresponds to a separate document in the full-text index. The index itself can be used for content queries. The property store also maintains and enforces document level security that is gathered when a document is crawled.
The data is then passed to the Index engine. The Index engine uses wordbreakers and stemmers to further process the text and properties received from the Gatherer. The wordbreaker component is used to further break the text into words and phrases. The stemming component is used to generate inflected forms of a given word. The Index engine also removes noise words and creates inverted indexes for full-text searching. These indexes are saved to disk.
Search Query Execution
Users can search for content on the dashboard site. When a search query is executed, the Search engine passes the query through a language-specific wordbreaker. If there is no word breaker for the query language, the neutral word breaker is used. After word breaking, the resulting words are passed through a stemmer so that language-specific inflected forms of a given are generated. The use of wordbreaker and stemmer in both the crawling and query processes enhances the effectiveness of search because more relevant alternatives to a user's query phrasing are generated. When a property value query is executed, the index is checked first to get a list of possible matches. The properties for the matching documents are loaded from the property store, and the properties in the query are checked again to ensure that there was a match. The result of the query is a list of all matching results, ordered according to their relevance to the query words. If the user does not have permission to a matching document, the Search service filters that document out of the list returned.
Writing Search Queries
The SharePoint Portal Server Search engine can be accessed using Microsoft ActiveX Data Objects (ADO), through the OLE DB Provider for Internet Publishing, or using the XMLHTTP COM object and the WebDAV/DASL protocol.
When developing server-side applications with XMLHTTP, the serverXMLHTTP object must be used. Use of the XMLHTTP object will jeopardize the stability of your server.
For more information on writing Search Applications, see Searching SharePoint Portal Server.
Extending Content Indexing and Search
For file types and formats that SharePoint Portal Server cannot crawl out-of-the-box, you can create custom indexing filters (IFilters). SharePoint Portal Server provides enhanced IFilter registration and loading methods for IFilters.
If a content source must be accessed using a network or access protocol that is not already supported, SharePoint Portal Server provides extensibility interfaces that allow you to create custom protocol handlers.