Troubleshooting Language Resources and Best Practices

This topic provides best practices and suggestions for validating and troubleshooting your IWordBreaker and IStemmer implementations.

This topic is organized as follows:

Best Practices

  • Ensure that the threading model for language resources is set to "both" in the registry.
  • Where possible, put language data in a resource in your DLL rather than in a separate file. This makes the DLL easier to install and more secure. Additionally, putting language data in a resource will result in improved performance for that language resource component.
  • Minimize the system resources that language resource components use. For example, if each instance of a language resource object needs read-only access to a lexicon, consider sharing the lexicon across all instances.
  • Consider using the neutral word breaker to handle text that is not in the language or locale for your word breaker implementation. This will help ensure that text is processed consistently across all languages.
  • Check all return codes and return them from functions like IStemmer::GenerateWordForms and IWordBreaker::BreakText. If indexing fails, it is important to pass the error so that the user is notified which documents were indexed.

Testing Stemmer Consistency

We recommend that you monitor the performance of an IStemmer implementation for consistency under the following conditions:

  • The stemmer performs consistently across multiple calls to IStemmer::Init. The stemmer reinitializes with the same parameters as in the previous initialization, without releasing the parameters.
  • Given the same test corpus, and repetitions of the same query, IStemmer::GenerateWordForms produces the identical output and makes identical calls to the methods of the IWordFormSink object.

Testing for Invalid Input in the Stemmer

We recommend that you monitor how the IStemmer methods handle all errors related to invalid parameters. In addition, we recommend that you ensure that the stemmer methods do not raise unhandled exceptions. The stemmer should handle the following errors:

  • Call to IStemmer::Init with pfLicense set to NULL. Init fails and does not result in an access violation.
  • Call to IStemmer::GetLicenseToUse with the ppwcsLicense parameter set to NULL. IStemmer::GetLicenseToUse does not result in an access violation.
  • Call to IStemmer::GenerateWordForms with the pwcInBuf parameter set to NULL. IStemmer::GenerateWordForms fails (returns E_FAIL) and does not result in an access violation.
  • Call to IStemmer::GenerateWordForms with the cwc parameter equal to 0. IStemmer::GenerateWordForms returns successfully (returns S_OK) and does not result in an access violation.
  • Call to IStemmer::GenerateWordForms with the pwcInBuf parameter set to NULL and the cwc parameter equal to 0. IStemmer::GenerateWordForms fails (returns E_FAIL) and does not result in an access violation.

Testing Word Breaker Consistency

We recommend that you ensure that the IWordBreaker implementation performs consistently under the following conditions:

  • Word breaker performs consistently across multiple calls to its IWordBreaker::Init method. The word breaker reinitializes with the same parameters as in the previous initialization, without releasing the parameters.
  • Given the same test corpus, and repetitions of the same query, the IWordBreaker::BreakText method produces the identical output and makes identical calls to the methods of the IWordSink and IPhraseSink objects.

Testing for Invalid Input in the Word Breaker

We recommend that you ensure that the IWordBreaker methods handle all errors related to invalid parameters. In addition, we recommend that you ensure that the word breaker methods do not raise unhandled exceptions. The word breaker should perform the following functions and handle the following errors:

  • Call to IWordBreaker::Init must return either LANGUAGE_E_DATABASE_NOT_FOUND or S_OK.
  • Call to IWordBreaker::Init successfully initializes the pfLicense parameter to FALSE and calls IStemmer::GetLicenseToUse and does not result in an access violation.
  • Word breaker does not read past the end of the awcBuffer parameter in the IWordBreaker::BreakText method.
  • Call to IWordBreaker::BreakText with the pwcInBuf set to NULL. IWordBreaker::BreakText fails (returns E_FAIL) and does not result in an access violation.
  • Call to IWordBreaker::BreakText with the cwc parameter equal to 0. IWordBreaker::BreakText returns successfully (returns S_OK) and does not result in an access violation.
  • Call to the IWordBreaker::BreakText method with the pwcInBuf parameter set to NULL and the cwc parameter equal to 0. IWordBreaker::BreakText fails (returns E_FAIL) and does not result in an access violation.
  • Phrases generated during index creation contain the same number of words.
  • Phrases are generated during index creation through successive calls to the IWordFormSink::PutWord and IWordFormSink::PutAltWord methods. The word breaker uses only the IPhraseSink object during query time.

Extending Language Resources

Understanding Language Resource Components

Implementing a Word Breaker and Stemmer

Linguistic and Unicode Considerations