Configure search and analytics settings in eDiscovery (Premium)


Microsoft 365 compliance is now called Microsoft Purview and the solutions within the compliance area have been rebranded. For more information about Microsoft Purview, see the blog announcement and the What is Microsoft Purview? article.

You can configure settings for each Microsoft Purview eDiscovery (Premium) case to control the following functionality.

  • Near duplicates and email threading

  • Themes

  • Autogenerated review set query

  • Ignore text

  • Optical character recognition

To configure search and analytics settings for a case:

  1. On the eDiscovery (Premium) page, select the case.

  2. On the Settings tab, under Search & analytics, click Select.

    The case settings page is displayed. These settings are applied to all review sets in a case.

    Configure analytics and search settings for an eDiscovery (Premium) case.

Near duplicates and email threading

In this section, you can set parameters for duplicate detection, near duplicate detection, and email threading. For more information, see Near duplicate detection and Email threading.

  • Near duplicates/email threading: When turned on, duplicate detection, near duplicate detection, and email threading are included as part of the workflow when you run analytics on the data in a review set.

  • Document and email similarity threshold: If the similarity level for two documents is above the threshold, both documents are put in the same near duplicate set.

  • Minimum/maximum number of words: These settings specify that near duplicates and email threading analysis are performed only on documents that have at least the minimum number of words and at most the maximum number of words.


In this section, you can set parameters for themes. For more information, see Themes.

  • Themes: When turned on, themes clustering is performed as part of the workflow when you run analytics on the data in a review set.

  • Maximum number of themes: Specifies the maximum number of themes that can be generated when you run analytics on the data in a review set.

  • Include numbers in themes: When turned on, numbers (that identify a theme) are included when generating themes.

  • Adjust maximum number of themes dynamically: In certain situations, there may not be enough documents in a review set to produce the desired number of themes. When this setting is enabled, eDiscovery (Premium) adjusts the maximum number of themes dynamically rather than attempting to enforce the maximum number of themes.

Review set query

If you select the Automatically create a For Review saved search after analytics checkbox, eDiscovery (Premium) autogenerates review set query named For Review.

The For Review autogenerated query.

This query basically filters out duplicate items from the review set. This lets you review the unique items in the review set. This query is created only when you run analytics for a review set in the case. For more information, about review set queries, see Query the data in a review set.

Ignore text

There are situations where certain text will diminish the quality of analytics, such as lengthy disclaimers that get added to email messages regardless of the content of the email. If you know of text that should be ignored, you can exclude it from analytics by specifying the text string and the analytics functionality (Near-duplicates, Email threading, Themes, and Relevance) that the text should be excluded for. Using regular expressions (RegEx) as ignored text is also supported.

Optical character recognition (OCR)

When this setting is turned on, OCR processing will be run on image files. OCR processing is run in the following situations:

  • When custodians and non-custodial data sources are added to a case. When OCR is applied to image files, the text in those files will be searchable during a collection. OCR processing is performed during the Advanced indexing process. OCR is only run on items that are processed during Advanced indexing. For example, if a large PDF file that is partially indexed or had other indexing errors is processed during Advanced indexing, the file will also have OCR applied. In other words, OCR processing only occurs on files that are re-indexed during the Advanced indexing process. This means there may be situations where custodians are added to a case but some email attachments won't be processed for OCR because those files are not processed during Advanced indexing.

  • When content from other data sources (that aren't associated with a custodian and added to the case in a non-custodial data source) is added to a review set.

After data is added to a review set, image text can be reviewed, searched, tagged, and analyzed. You can view the extracted text in the Text viewer of the selected image file in the review set. For more information, see: