Indexing with Microsoft Index Server

 

Krishna Nareddy
Windows NT Query Team
Microsoft Corporation

January 30, 1998

Introduction

This is the third of a series of articles to help you understand and effectively deploy Microsoft's search solutions on your Web sites and intranets. The first article, "Anatomy of a Search Solution," helped you understand what to expect of a search solution to meet your site's needs. The second article, "Introduction to Microsoft Index Server," introduced you to the features and capabilities of Index Server. This article is designed to help you understand, manage, and fine-tune the indexer. It helps to have Microsoft® Index Server documentation handy for quick cross-reference.

An Index Server catalog encapsulates all the aspects of indexing infrastructure. We will start with the catalog to understand the infrastructure. Next we will delve into the indexing process. As you examine each phase of the process, you will be introduced to the details needed to control and customize it. Then you will be introduced to the tools that help you administer Index Server and monitor its status and performance. When you are done with this article you will have a better understanding of the indexing process, how to tune its performance, and how to diagnose and resolve common problems.

Information contained in this article applies to Index Server 2.0 shipped with Microsoft Windows NT® Option Pack 4.0. Most of it also applies to Index Server 3.0 scheduled to ship with Windows 2000, but differences do exist in their behavior. Those differences will be noted in a later update to this article after a final release of Index Server 3.0.

The Catalog

An Index Server catalog encapsulates all the details needed to access and index your document corpus (collection of documents) along with the index of the corpus. It consists of a set of source directories that point to the corpus, a content index to store the compiled full text index, a property cache to store document properties, and a set of control attributes used to fine-tune the indexing process. All the information about a catalog is stored in the registry under the following key. All registry parameters, unless otherwise specified, will be associated with this key. Many parameters can be overridden by a catalog under the Catalogs\<catalog> subkey of the following ContentIndex key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ContentIndex

Upon successful creation, Index Server 2.0 creates a default catalog named "Web".

Source Directories

This is the collection of directories whose contents are to be included (or excluded) from the corpus. Directories can be physical paths on a local disk or remote paths following the Universal Naming Convention (UNC). At query time, directories may be used to restrict the scope of a query and help complete a full path name (or UNC name) for matching documents so the document browser can locate them. Because directories are used to define the scope of an Index Server query, Index Server documents often use the term scopes to mean directories.

All the directories associated with an index are listed under the Catalogs\<catalog>\Scopes subkey of the ContentIndex key. Each value under the scopes subkey names the directory to be indexed and a value of fixup,domain\user,flags. A fixup is a prefix on a path that will be substituted for the directory when a remote client sends a query. The domain\user is used to logon to the directory when Index Server indexes the files in the remote directory. The flags field indicates whether the index should be included or excluded and if it is a virtual or a physical directory. Set the flags field to a combination of the values listed below. For example, if a physical directory should be indexed, the flags fields should be set to 5 (0x1 combined with 0x4).

0 = directory is not indexed (excluded)

1 = directory is indexed (included)

2 = directory is a virtual root

4 = directory is a physical directory

Property Cache

The property cache is an on-disk store optimized to speed up the retrieval of frequently retrieved properties. The properties stored in the cache fall into the following categories:

  • Index Server-defined properties for internal use only. You have no direct control over these properties.
  • Index Server-defined frequently used properties such as Path and Filename. These properties are attributes of the document file extracted during the document-gathering process.
  • Index Server-defined properties that are extracted from a document or created from a document during the filtering process. Examples of such properties are DocTitle and DocAuthor (for HTML and Microsoft Office documents).
  • User-defined properties that are extracted from a document. Having custom properties in a document is not sufficient to retrieve them in response to queries. Custom properties of interest should be added to the property cache so they can be retrieved at result fetch time. The only custom properties that can be fetched directly from a document are OLE properties, which are associated with a document. Index Server can extract OLE properties without using a document filter. In the interest of efficiency, however, you should consider caching OLE properties along with non-OLE custom properties. Details on how to view the complete list of available properties and modify the set of cached properties are presented below in the section on monitoring status and performance.

Because it contains properties from each indexed document, the property cache is a fairly large physical entity comparable in size to the content index. It is sufficiently large that it usually cannot be loaded into main memory in its entirety. So parts of it have to be paged into memory on demand. Each part is 64 kilobytes in size. You can control the maximum number of such parts that can be simultaneously loaded into memory through the PropertyStoreMappedCache registry parameter. The more parts you can load, the better the performance. Of course, you are paying for that with physical memory (RAM).

The property cache is modified every time a document is added, deleted, or modified. All modifications happen on the parts that are loaded into memory and the property store will remain dirty until these parts are flushed to disk. If Index Server is terminated abruptly, it will be unable to flush the property store to disk. When Index Server is restarted, it may find a property cache that is inconsistent with the content index. If that happens, the cache will be restored to a last known good state. The information needed to restore it is stored in a property cache backup file. The size of the backup file partly determines how often the property cache pages are flushed to disk. The larger the backup file, the less often the property cache is flushed to disk. The less often the property cache is flushed, the faster indexing can proceed. The size of the backup file is measured in Operating System (OS) pages and can be controlled through the PropertyStoreBackupSize registry parameter. The OS page size depends on the processor architecture and is defined by Windows NT. Since OS page size differs between processors, the same backup size parameter causes creation of files of different sizes on different processors.

Content Index

The content index contains all the full-text information extracted from the documents, which is compiled for efficient matching at query time. It is distributed among several files and placed in a special directory, catalog.wci. This directory cannot be split among multiple drives and should exist on a fixed (that is, not removable) local drive. The location of this directory should be specified when the catalog is defined. This will cause the full directory path to be created. If you choose to specify the location by directly modifying the location parameter under the Catalogs\<catalog> registry subkey, make sure that the specified directory contains the catalog.wci directory at the lowest level. The catalog.wci directory should not contain any directories beneath it because they will not be indexed. More importantly, directories created under catalog.wci could be deleted when the catalog is deleted through an administrative tool.

The content index, in some form or the other, contains a complete summary of your corpus. Anyone with access to that directory may be able to glean bits and pieces of information from the index files and can potentially reconstruct documents that are inaccessible to them through Windows NT file access mechanisms. You should protect your catalog.wci directory with appropriate security permissions on the directory to prevent such abuse.

Control Attributes

Index Server supports the creation and use of multiple catalogs. While each catalog is different, they do share many common control attributes. Duplicating this commonality is wasteful and error-prone. Therefore, all control attributes that affect the operation of all Index Server catalogs are available in a central location. Catalogs that choose to differ in certain respects may do so by duplicating attributes of specific interest. For example, if a catalog chooses not to store a summary of each document, it may set the relevant attribute in its sphere of influence. When Index Server needs to know if a given catalog supports document summaries, it first checks with the catalog. If that attribute is not explicitly available from the catalog, it uses the global attribute value. The global attribute values are associated with the ContentIndex key while the catalog specific overrides are associated with the Catalogs\<catalog> subkey under the ContentIndex key.

Control attributes can be classified into the following groups. Only indexing-related registry parameters are grouped here for convenience. Index Server documentation provides a detailed description of these registry parameters along with default values and min/max range where applicable.

  • Filtering-related parameters control various aspects of the filtering process. They are DaemonResponseTimeout, FilterContents, FilterDelayInterval, FilterDirectories, FilterFilesWithUnknownExtensions, FilterRemainingThreshold, FilterRetries, FilterRetryInterval, MaxFilesizeFiltered, and MaxFilesizeMultiplier.
  • Language-related parameters list the resources specific to each installed language. The InstalledLangs registry parameter lists the set of languages installed. Each string in the InstalledLangs value names a subkey below the ContentIndex\Language key. Beneath each language key, the available parameters are ISAPIDefaultErrorFile, ISAPIHTXErrorFile, ISAPIIDQErrorFile, ISAPIRestrictionErrorFile, Locale, NoiseFile, StemmerClass, and WBreakerClass.
  • Index Merge-related parameters control the process that builds the master index. They are MasterMergeCheckpointInterval, MasterMergeTime, MaxFreshCount, MaxIdealIndexes, MaxIndexes, MaxMergeInterval, MaxWordlistSize, MinDiskFreeForceMerge, MinMergeIdleTime, MinSizeMergeWordlist, and MinWordlistMemory.
  • Property Cache-related parameters control the memory available to the cache and the frequency of commits to disk. They are PropertyStoreBackupSize and PropertyStoreMappedCache.
  • CPU management parameters control the amount of CPU available to perform specific tasks. They are ThreadClassFilter, ThreadPriorityFilter, and ThreadPriorityMerge.
  • Miscellaneous parameters are indexing-related parameters that don't fall in any of the above categories. They are EventLogFlags, GenerateCharacterization, IsapiDefaultCatalogDirectory, IsIndexingNNTPSvc, IsIndexingW3Svc, MaxCharacterization, NNTPSvcInstance, and W3SvcInstance.

The Indexing Process

An enumeration mechanism identifies all the indexable files in the included directories and appends them to a queue. A document filter opens each queued file and emits properties and content of the document contained therein. The stream of text emitted by the filter is fed to a word breaker, which recognizes features such as words and numbers contained in the stream. Features that survive the stop list (noise word list) are eventually compiled into a master index that is used to resolve queries.

The creation of the master index is a multistage process in which the words extracted from a document progressively move from temporary in-memory word lists to an intermediate persistent shadow index and eventually to a permanent master index designed to efficiently resolve queries. This multistage process allows for instant availability of filtered documents to the query processor as they gradually graduate to the permanent master index. The collection of the word lists, the shadow indexes, and the master index is referred to as the content index. The catalyst that converts the intermediate data structures to a final form by combining several source indexes into a target index is called merging. It is a time-intensive and disk I/O-intensive process, but is necessary because the resulting target is more efficient than the sources it replaces. Index Server provides several ways of controlling the merging process. More about that later.

Besides full text content, filters also extract properties from documents. These properties can be stored in the property cache*,* which is optimized for efficient access. Index Server uses the property cache and the content index to resolve a query and retrieves the requested properties of the matching documents from the property store.

In this section we will detail and describe how to control each step of the indexing process.

Document Gathering

Index Server gathers documents for indexing through scanning and notifications. Scanning is the process of recursively walking through all the included directories to determine which documents should be indexed. Windows NT sends notifications whenever files under its control are modified. Whenever possible, Index Server relies on notifications because that mechanism is more efficient than an explicit scan.

Index Server performs two types of scanning—full scanning and incremental scanning. A full scan takes complete inventory of all the documents and is performed when the directory is first added. The only other time a full scan is performed is as part of recovery from a serious failure.

Index Server will not be able to track changes to documents when it is shut down. When it is restarted, it needs to know what documents were modified when it was inactive, so it can update its index. An incremental scan provides that capability and is capable of detecting all documents that need to be filtered and indexed again. On startup, Index Server performs an incremental scan on all the directories. An incremental scan may also be performed if Index Server loses change notifications. This can happen if the update rate of documents is very high and the buffer used to get change notifications from Microsoft Windows NT overflows.

You can force either a full scan or an incremental scan on any of the indexed directories. You should force a full scan after installing a new filter, removing a filter, or repairing a filter's registration information. You may also force a scan for any other reason, but be warned that a full scan results in all the files in a directory being re-indexed, which can take a long time. Details on how to initiate a scan are presented in the section on monitoring status and performance.

During normal operation of Index Server, all changes to the documents in the directories are automatically tracked if the indexed directories are on computers running Microsoft Windows NT. Recall that a directory can point to a network directory. Those network drives may be running under systems such as Novell NetWare or Microsoft Windows® 95 file server, which do not support change notification. To handle such directories, Index Server does periodic scans of that share. You can control the frequency of these periodic scans through the registry parameter ForcedNetPathScanInterval.

Document Filtering

Documents are composed in a wide variety of formats. Index Server cannot possibly be aware of all document formats or restrict itself to a few well-known formats. So the indexing model allows for pluggable programs called filters to extract content from a wide variety of formats. The process of extracting textual content from a document is called filtering. The default filters shipped with Index Server can handle Microsoft Office format documents (Microsoft Excel, Microsoft Word, and Microsoft PowerPoint®), HTML 3.0 or lower, text documents, and binary files. Additional filters may be added, or existing filters replaced through the registry. Index Server documentation has detailed instructions on how to modify the registry to change filter DLLs. Use the attached tool, FILTREG (no parameters are needed), to get a list of filters and associated extensions as listed in the registry. Use the attached tool, FILTDUMP (-? for usage) to see what the filter (associated with that file extension) would report to Index Server.

When Index Server is ready to filter a file, it can determine the file format by examining the file extension. The registry contains associations between file extensions and filter DLLs. Index Server uses this association to determine which DLL should be used for a given file. Not all file extensions can possibly be listed in the registry. How can Index Server deal with files with unknown extensions? You can control that through the registry parameter FilterFilesWithUnknownExtensions. When set, this parameter causes Index Server to filter the document with the default plain text filter.

Your corpus is likely to have several "binary" files. In the context of Index Server filtering, a binary file is one that contains no useful textual information to be indexed. You can identify such files and cause them to be filtered by a dummy filter that ignores the contents. It only extracts file attributes such as size and filename, so you still can find the binary file by searching for its attributes. You can identify all such binary files in your corpus and declare their extensions in the registry. For example, to associate the extension ".nul" with the binary file type, add a .nul subkey with a default string value of BinaryFile as shown below.

HKEY_LOCAL_MACHINES\Software\Classes\.nul = REG_SZ BinaryFile

If a filter is unable to process a file, it makes several attempts to filter it. The registry parameter FilterRetries controls the maximum number of filtering attempts. If a file can't be filtered within those attempts, it will be considered unfiltered. Files may also be left unfiltered because they are corrupt. When a filter detects corruption in a file, it causes an event to be written to the event log. You can open the Index Server administration page and issue a query for unfiltered pages. This query lists all the unfiltered files. Note that files that are password protected cannot be filtered. You can control the generation of filtering-related event log messages using the EventLogFlags value in the registry. Index Server documentation has details on how to configure this parameter.

Index Server employs a child process, CiDaemon, to filter documents. This process wall protects the Index Server process, CiSvc, from an error-prone or malicious filter DLL that could take down the associated process. Index Server forwards a list of files to filter to the child process. The filter process filters those files and forwards the contents to the Index Server process. If the filtering process terminates for any reason the parent process, CiSvc, automatically restarts it.

Index Server also protects itself against malicious filters by discontinuing filtering of a document that emits too much data compared to its file size. How much data is too much? You can control that through the registry parameter MaxFilesizeMultiplier.

You can control the pace at which filtering proceeds using the registry parameters ThreadClassFilter and ThreadPriorityFilter.

Lexical Analysis

Text extracted from a document can be in any language. The languages supported by Index Server 2.0 are English, Chinese, French, German, Korean, Spanish, Italian, Dutch, Swedish, and Japanese. For each of the supported languages, Index Server provides all the tools discussed below. You may install only those languages you are interested in. Index Server uses locale information to identify the language in which the document is written and chooses lexical tools appropriate for the language. By default, the locale of a document is the locale of the server where the document resides. Individual HTML files may override the default locale using the MS.Locale meta tag. Index Server documentation has more details on this meta tag. Other file formats may have their own way of encoding the language. Index Server relies on the filter to detect and report the correct language.

The text extracted from a document is processed by a word breaker, which identifies words from the stream of text. Unfortunately, Index Server 2.0 does not allow you to plug in your own word breaker. If it did, you would have been able to control the words, phrases, numbers, and other features recognized by the word breaker.

Documents generally contain several frequently occurring words that are not of much use in discriminating one document from the other. The whole idea behind specifying specific words in a query is to separate documents that contain those words (and therefore are of potential interest to the user) from documents that don't. If a frequently used word such as "this" exists in a query, it is likely to match all the documents in the corpus. Therefore, "this" has little discriminatory value and is considered to be a "noise" word. Most search solutions allow you to eliminate such noise words from the index. Lists of noise words are also called stop lists because they stop noise words from seeping into the index. But what is a set of acceptable noise words? You should be able to define that based on your user's needs and the subject domain of the corpus. For example, a site containing C++ code files would probably not want to place the word "this" in the stop list because it has a special meaning in the domain of C++ programming. If you are not sure whether a given word ought to be a noise word, you may want to err on the side of caution and not include it in the stop list.

A judicious selection of noise words improves the quality of the retrieved document set, thereby increasing user satisfaction with your search solution. Because noise words typically occur frequently, eliminating them from the index significantly reduces index size. A smaller index increases Index Server performance. It is important to note that this increased performance should only be viewed as a desirable side effect of a stop list. Your user's experience with your search solution should be the primary goal.

Elimination of noise words occurs only when a file is filtered. If you change your stop list when an index is already built, it will only affect documents filtered after Index Server has been restarted. You will have to rescan all your directories to benefit fully from the modified stop list.

Index Server retains all the noise words in a file specified in the NoiseFile registry parameter under Language\<language>, which is a subkey of the ContentIndex key. You can modify it using any text editor. The appropriate word breaker will process this file and extract the noise words. The default list provided by Index Server is a conservatively chosen list of commonly occurring words. Each language has a different set of common words, so a separate file exists for each supported language.

Creation of the Word Lists

As soon as a document is filtered and processed by a word breaker, the resulting words are stored in a word list. Word lists are temporary, in-memory indexes used to cache data for a small number of documents. At any given time, there can be several word lists in memory. You can use the word lists to engage in a classic memory vs. speed tradeoff. Two parameters, MaxWordLists and MaxWordListSize, control the maximum memory that can be consumed by word lists. MaxWordLists is the maximum number of word lists Index Server can sustain in memory before initiating a shadow merge to persist the data into a shadow index. MaxWordListSize is the maximum amount of memory available to hold a word list. As the memory used by word lists increases, it results in a decrease of the number of times Index Server has to perform disk-based shadow merges. Conversely, as you decrease the memory available to word lists, it will result in more disk-based operations. Two other parameters, MinWordlistMemory and MinSizeMergeWordlists, help control the memory consumed by word lists. MinWordlistMemory is the minimum amount of free memory available for word list creation. MinSizeMergeWordlists is the minimum combined size of word lists that triggers a shadow merge.

Creation of the Shadow Indexes

When the number of word lists exceeds MaxWordLists or the total memory consumed by word lists exceeds MinSizeMergeWordlists, it is time to shadow merge the data. Being in-memory data that is compiled as quickly as possible, word lists are not well compressed. They also do not survive a shutdown and restart of Index Server. Persistent data solves both the problems. The first step in this direction is the creation of shadow indexes. A shadow index is a persistent index created by merging word lists and sometimes other shadow indexes into a single index.

The process of creating shadow indexes is called shadow merging. This usually quick operation persists the word lists and frees memory occupied by them. The source indexes for a shadow merge are usually word lists. However, if the total number of shadow indexes exceeds MaxIndexes, some of them are also used as source indexes to a shadow merge.

A special kind of shadow merge, called the annealing merge, is performed when the system is idle for a certain length of time and the total number of persistent indexes exceed MaxIndexes. The registry parameter MinMergeIdleTime specifies the percentage of processor time that must be idle during a time period (controlled by the registry parameter MaxMergeInterval), to trigger an annealing merge. An annealing merge improves query performance and disk space usage by reducing the number of shadow indexes.

Creation of the Master Index

A master index is the final destination of all the word lists created by Index Server. This is a well-compressed persistent data structure designed to resolve queries in an efficient manner. The master index is created from all of the existing shadow indexes and the current master index in a process called the master merge. Master merge is a very time- and disk I/O-intensive operation. When the merge is completed the resources are freed and intermediate shadow indexes are deleted. As a result, queries get executed faster than before.

Being a resource-intensive process, a master merge has to be very robust to allow you to be in control of the situation. You can control the pace at which indexing proceeds through the registry parameter ThreadPriorityMerge. If you don't like its current pace, you can stop Index Server while a master merge is in progress and change this parameter. The merge will continue when Index Server restarts. Master merge can also survive unexpected events such as a full disk or an abrupt system shutdown. After a restart, the master merge picks up where it left off. Index Server writes events to the event log whenever a master merge is started, restarted, or paused.

You can trigger the start of a master merge by controlling various parameters. A master merge is started for the following reasons.

  • Daily maintenance master merge. This can be done at a specified time every day. The registry parameter, MasterMergeTime, is the number of minutes after midnight when the merge should happen. By default, the daily master merge happens at midnight. This value should be adjusted to reflect the time when the load on the server is lowest.
  • The number of documents that have been changed since the last master merge is called the FreshCount. The larger the FreshCount, the higher the memory usage in the form of word lists. You can control the maximum FreshCount through the registry parameter MaxFreshCount. When FreshCount exceeds MaxFreshCount, a master merge is performed to reduce FreshCount to zero, thereby reducing the amount of memory used by Index Server. Adjust MaxFreshCount based on the amount of memory you have. The higher the value of MaxFreshCount, assuming there is sufficient memory to support it, the faster indexing can proceed.
  • While word lists take up space in memory, the shadow indexes take up disk space. A site with a large or dynamic corpus can have a significant amount of disk space temporarily consumed by shadow indexes. To avoid disk full condition, you can control the amount of disk space consumed by the shadow indexes through the registry parameter MinDiskFreeForceMerge. When the disk space remaining on the catalog drive is less than MinDiskFreeForceMerge and the cumulative space occupied by the shadow indexes exceeds the registry parameter MaxShadowFreeForceMerge, a master merge is initiated.
  • When the total disk space occupied by shadow indexes exceeds the registry parameter MaxShadowIndexSize, a master merge is initiated. This condition has higher precedence than the previous condition.
  • Finally, you can force a master merge through an administrative tool. You may want to do this whenever you are expecting a high query load. Although a master merge is resource-intensive when it is in progress, the end result is improved query response time.

Monitoring Status and Performance

There are three ways you can administer Index Server—using the Index Server snap-in for the Microsoft Management Console (MMC); using the Index Server administration page (available through the Index Server program group); and by directly editing the registry with a registry editor such as RegEdit. Using the MMC snap-in is probably the easiest of the three. It is also the recommended tool because most future Windows NT-based administrative tools will be MMC-based snap-ins. You can make your life easy and at the same time get a head start. All three tools allow you to administer Index Server running on a remote server. We will only discuss the Index Server MMC snap-in in this article.

Using the snap-in you can perform the following tasks:

  • Create and delete catalogs
  • Start and stop the service
  • Monitor status of all the catalogs
  • Add and remove directories as well as initiate a scan on a directory
  • Modify the set of properties to be cached in the property cache

On successful installation, Index Server creates an entry labeled "Index Server Manager" in the "Microsoft Index Server" program group under the Windows NT option pack program group. You may launch the Index Server snap-in using this entry. You may also separately load the Index Server snap-in into MMC using the "Add/Remove Snap-in . . ." menu item under the Console main menu item. Click on the Add. . . button in the Add Standalone Snap-in dialog box and choose Index Server from the list.

Catalog Management

Catalog creation through the snap-in is simple. You only need to provide a name for the catalog and specify a location for the index files. Later you can add directories and modify the property cache. The snap-in saves all the details of the catalog in the registry and creates a physical directory named catalog.wci at the specified location. Catalog deletion is even simpler. You right click on the catalog icon and ask it to be deleted. That causes the snap-in to remove all the entries in the registry and delete the catalog.wci directory. If Index Server is running when a catalog is deleted, the snap-in waits for it to stop before physically deleting the entries and files.

Directory Management

Once you have created a catalog, you can define your corpus. Index Server 2.0 supports content that is managed by the Web server, the Network News Transfer Protocol (NNTP) server, and the Windows NT operating system. You can include content from any of these sources using the snap-in.

To include content managed by the Web and NNTP servers, open the Properties dialog box of the catalog and select the Web tab. Check Track Virtual Roots if you want to index a Web site and select the virtual server you want to track. Check the Track NNTP Roots button if you want to a NNTP site and choose the NNTP server you want to track.

To include content managed by Windows NT file systems, expand the catalog folder and find the Directories subfolder. Add directories through the Add Directory dialog box, which can be opened by right-clicking on the Directories subfolder. This dialog box also allows you to specify remote directories.

Property Cache Management

If you have custom properties in your documents that you want to retrieve into your result set or use in property value queries, they should be made known to the property cache. You can view all the known properties along with their definitions and add or remove properties from the cache using the Index Server snap-in. Index Server should be running to enable enumeration of the known properties.

All the cached properties have a nonzero value in the Cached Size column. Properties with an empty column or with a zero value are not cached. Open the "Properties" dialog of the property of interest to you. To cache the property, check the "Cached" check box and provide a size for the property. Most data types except the string property probably have a fixed size so it is easy to specify the size. For string properties that exceed the specified size, Index Server handles the overflow. This means you do not need to set the size of a string property type to be the maximum possible. Instead, choose a median value. Choosing a larger value wastes spaces and results in runtime inefficiency. Choosing a smaller value doesn't waste space, but causes inefficiency because of the overhead of handling overflows.

The discussion about string size is all the more important because the HTML filter shipped with Index Server 2.0 can only report values of HTML meta tags as strings. A typical Web site's corpus is likely to be dominated by HTML documents, so the custom properties included in HTML documents will be reported as strings. A judicious choice of size for each string could make a significant difference in performance.

After you make all the changes to the set of cached properties, commit these changes using the Commit Changes menu item. That menu item is under the Task menu, which is part of the context-sensitive popup menu that shows up when you right click on the "Properties" subfolder. Commit causes all the changes to take effect. Index Server creates a new property cache with space for each cached property and copies the already existing value for each cached property for each of the already indexed documents to the new cache. This is a time-consuming process, so minimize the number of property cache commits. You can do this by batching all your changes and committing them all in a single session.

A document filter extracts properties during filtering. Therefore, when you add a property to the set of cached properties, all previously filtered documents will have an empty value for the property. Only documents filtered in the future will have appropriate values extracted by the filter. This may result in incorrect result sets because property values are empty where the user expects them not to be. To avoid this, you must initiate a full scan on all the directories so they can be re-filtered. A scan can be initiated through the Task menu item of the context-sensitive popup menu that shows up when you right-click on the chosen directory. If your commit only caused existing properties to be removed from the cache, you have nothing else to do. The unnecessary properties have already been eliminated from the cache and values of deleted properties extracted in the future will not be cached.

Monitoring Performance

Index Server provides performance counters for both the filtering process and the indexing and searching process. These counters can be used with the Windows NT performance monitor, perfmon.exe.

The filtering-related counters are split between the two processes. The counters tracking file processing are under the Content Index object. They are # of documents filtered, Files to be filtered, and Total # of documents. The counters directly related to filtering are under the Content Index Filter object. They are Binding Time in milliseconds, Filter Speed in Megabytes per hour, and Total Filter Speed in Megabytes per hour.

The indexing and merging-related counters are under the Content Index object. They are Index Size in Megabytes, number of Persistent Indexes, percentage of Merge Progress, and the number of Word lists in memory.

Event Log Messages

Index Server system errors are reported in the Windows NT application event log under the Ci Filter Service and Ci Service categories. Errors reported here include filtering problems, out-of-resource conditions, index corruption, and so on. Index Server documentation contains an exhaustive list of all the messages along with appropriate action.

Catalog Design

Creating and deleting catalogs through the Index Server snap-in is a snap. It is deceptively simple. Unless you are creating a prototype search solution or working with a small document corpus, you should spend some time designing your catalog and consider issues such as usability, performance, size, and maintenance. The following discussion on choosing hardware, deciding between single and multiple catalogs, and catalog growth covers various issues specific to Index Server. General issues that might apply to any Windows NT-based server, though important, are not discussed here. You may find a discussion of those issues in other parts of the MSDN Library, notably in the Windows Resource Kits.

Hardware Issues

Index Server can effectively use multiple processors and lots of RAM. The more resources at its disposal, the merrier it goes about its work. Index Server is disk I/O-intensive. In a typical configuration, disk I/O is more likely to throttle overall indexing performance than any other factor. The faster the disk drives and the faster the bus they sit on, the faster indexing can proceed. You can also employ disk striping to increase I/O throughput. Use Windows NT File System (NTFS) for your corpus and the catalog. It offers better scalability, performance, and security than the FAT file system.

What about disk mirroring? The index generated by Index Server can be fully regenerated from the registry information and the corpus. So disk mirroring is not as critical to Index Server catalog as it is to the corpus and the Windows NT registry. Considering that mirroring might sacrifice throughput to gain redundancy-based reliability, it may not be worth the price. On the other hand, mirroring does lessen the risk of losing your index data. That may be a good enough reason in situations where server downtime is not tolerable. After all, it is faster to rescue an index from a mirror disk than regenerate it from the corpus.

One Catalog or Multiple Catalogs?

Whenever possible, deploy one catalog instead of deploying multiple catalogs. There are several advantages to doing so. There are some situations where multiple catalogs appear to be a logical choice, but a closer look may persuade you to go with a single catalog. We will first discuss the advantages of dealing with a single catalog. Then we will consider situations where multiple catalogs might make sense.

The major advantage of deploying a single catalog is the ease of administration. Initial installation, configuration, and daily maintenance of a catalog don't demand much administrative attention. But little insignificant tasks do add up to a significant mass if you have to do them several times. Dealing with a single catalog versus multiple catalogs is no different.

Another advantage to deploying a singe catalog is improved performance and lesser overhead. Each active catalog has to be loaded into memory when Index Server is up and running. If the same content that is distributed between multiple catalogs is unified into a single catalog, Index Server will be able to concentrate its resources on the single catalog resulting in improved performance.

Finally, if you need to expose your corpus as a single entity, there is no better way of doing it than deploying a single catalog. Index Server 2.0 is not capable of automatically merging result sets from multiple catalogs. You can merge result sets from different catalogs and derive a single result set using a script, but that increases your query turnaround time. If you have only a single catalog, Index Server handles sorting and merging internally. That would be more efficient than a script-based sorting and merging.

You will be unable to use a single catalog if you have a huge corpus that refuses to perform reasonably well on a single server. In that case, a larger-scale solution, such as Microsoft SharePoint Portal Server, might serve you better than a collection of catalogs glued together by ad hoc scripts.

A typical organization, large or small, needs to serve the needs of a diverse set of user groups. For example, your organization may have technical documents used only by the engineering department and legal documents used only by the legal department. The document sets of these two users are disjointed. It certainly appears that you have a case for two catalogs—one for each department. But wait! Index Server allows you to partition your corpus using directories. You can use a single catalog and use two different directories—one for each set of documents—and provide two different sets of very similar search scripts, each pointing to the appropriate directory. If security is involved, you probably already have the right permissions set on the documents. Index Server honors Windows NT security and doesn't allow access to otherwise inaccessible documents. Need more convincing? The same organization may have a third set of documents, say news articles that are of interest to both the groups. With a multiple catalog approach, you will either have to create a third catalog or index the same articles twice into each catalog. With a single catalog, however, you only need a third directory that is included in both sets of scripts.

If for some reason you need to have multiple catalogs on a server, it may help to know that you can make a catalog inactive when you don't need it. Set the CatalogInactive registry parameter in the Catalogs\<catalog> subkey to 1 to inactivate the catalog. The next time Index Server starts up, the inactive catalogs won't be loaded.

If you have a corpus that is too large for your existing server, consider a bigger behemoth before deciding to split the corpus across multiple smaller servers. The cost of procuring a capable server may be lower than that of dealing with multiple servers.

Catalog Growth

Chances are that your corpus is steadily growing in size. So does the Index Server-generated index as it keeps up with your growing corpus. A significant limitation of Index Server 2.0 is that the index can only reside on a single drive. If that drive is full, all the free Gigabytes you have on the other drives cannot help the index grow. Choose a disk with sufficient space for the index to grow in the foreseeable future. How much disk space do you need? It depends on how much textual content your corpus has and how many properties you store in the property cache. For a default configuration of Index Server, a general rule of thumb is to expect the index to be about 40 percent of the corpus size.

Another aspect of corpus growth is the possible introduction of a new document format. Before introducing a new document format to your corpus, consider the availability of a document filter for that format.

Troubleshooting Tips

The most common symptom of trouble appears when you search for documents you know exist and contain the keywords you used in the query. If you don't find the expected set of documents, follow this troubleshooting algorithm. Read all the steps before trying them out. A later step may be more applicable to your situation.

  1. Look in the application event log for CI-generated events. Address errors if necessary.
  2. Do the files have extensions that cause them not to be filtered? Use the filter enumeration tool, FILTREG, to see what Index Server sees in the registry.
  3. Are the files password protected? A filter cannot access them if they are.
  4. Is Index Server still scanning? Chances are the missing documents are not yet indexed. Wait for scanning to be complete (look in the catalog status reported by the MMC snap-in).
  5. Does Index Server return a message that the "Index is out of date" in response to a query? If so, chances are the missing documents are not yet filtered. Wait for all files to be filtered.
  6. Query for unfiltered documents using the administration page. This is a list of documents that Index Server failed to filter. The file may have an unrecognized extension and thus was not filtered. Or the filter may not have been able to filter that file because it couldn't understand the format or the file is corrupt. Index Server also protects itself against malicious filters by discontinuing filtering of a document that emits too much data compared to its file size.
  7. Are all the files from the same directory? If yes, it is possible that the directory wasn't scanned. Check to make sure the directory is covered by the set of included directories. Note that directories under catalog.wci will not be filtered even if they are covered. If a directory is covered, check to see whether it is visible to programs such as Windows Explorer and Dir. If all is well, try forcing a scan on the affected directories.
  8. Is the directory that contains the missing files a remote directory? If so, it may not have been indexed due to an incorrect user name. When specifying the logon ID for a remote directory, type both the domain name and the user name using the domain\username format. Note that the domain name may actually be the name of the computer, if the account is local to that computer.
  9. Use FILTDUMP on the files and examine the output. Are all the expected keywords present in the dump? If they should be, but are not, check with the filter provider.
  10. If the words are in the filter's output, it is possible that they weren't included in the index because they were considered noise words. Check the noise word list and change it if necessary.
  11. Were different languages used to filter the file(s) and issue the query? Lexical analysis is a language-dependent process. When a query is issued from one language and the document is filtered in another, query results can be unpredictable. The locale used to filter a document is dependent on the filter. Some file formats, such as Microsoft Word, mark documents with a language; this mark is used during filtering. Other formats, such as plain text, contain no language specifier. Most filters default to the system locale for these files. The locale for the query is specified by use of the CiLocale variable. If CiLocale is not specified, the locale of the browser is used (if available) or the default locale of the server.
  12. It may be possible that everything is indexed correctly, but the expected was not retrieved because you do not have the right permissions.
  13. Possibly the query was timed out or was deemed to be too complex by Index Server. In both cases you should have received errors to that effect. Refer to the Index Server documentation about recognizing when such errors occur and what to do to avoid them.

Acknowledgements

I would like to thank David Lee, Kyle Peltonen, and Susan Dumais for their valuable feedback.