Create a keyword dictionary

Data loss prevention (DLP) can identify, monitor, and protect your sensitive items. Identifying sensitive items sometimes requires looking for keywords, particularly when identifying generic content (such as healthcare-related communication), or inappropriate or explicit language. Although you can create keyword lists in sensitive information types, keyword lists are limited in size and require modifying XML to create or edit them. Keyword dictionaries provide simpler management of keywords and at a much larger scale, supporting up to 1 MB of terms (post compression) in the dictionary and support any language. The tenant limit is also 1 MB after compression. 1 MB of post compression limit means that all dictionaries combined across a tenant can have close to 1 million characters.

Keyword dictionary limits

There is a limit of 50 keyword dictionary based sensitive information types that can be created per tenant. To find out how many keyword dictionaries you have in your tenant, connect using the procedures in Connect to the Security & Compliance Center PowerShell to connect to your tenant and run this PowerShell script.

$rawFile = $env:TEMP + "\rule.xml"

$kd = Get-DlpKeywordDictionary
$ruleCollections = Get-DlpSensitiveInformationTypeRulePackage
Set-Content -path $rawFile -Encoding Byte -Value $ruleCollections.SerializedClassificationRuleCollection
$UnicodeEncoding = New-Object System.Text.UnicodeEncoding
$FileContent = [System.IO.File]::ReadAllText((Resolve-Path $rawFile), $unicodeEncoding)

if($kd.Count -gt 0)
$count = 0
$entities = $FileContent -split "Entity id"
for($j=1;$j -lt $entities.Count;$j++)
for($i=0;$i -lt $kd.Count;$i++)
$Matches = Select-String -InputObject $entities[$j] -Pattern $kd[$i].Identity -AllMatches
$count = $Matches.Matches.Count + $count
if($Matches.Matches.Count -gt 0) {break}

Write-Output "Total Keyword Dictionary SIT:"
$Matches = Select-String -InputObject $FileContent -Pattern $kd.Identity -AllMatches
Write-Output "Total Keyword Dictionary SIT:"

Remove-Item $rawFile

Basic steps to creating a keyword dictionary

The keywords for your dictionary could come from various sources, most commonly from a file (such as a .csv or .txt list) imported in the service or by PowerShell cmdlet, from a list you enter directly in the PowerShell cmdlet, or from an existing dictionary. When you create a keyword dictionary, you follow the same core steps:

  1. Use the Security & Compliance Center ( or connect to Security & Compliance Center PowerShell.

  2. Define or load your keywords from your intended source. The wizard and the cmdlet both accept a comma-separated list of keywords to create a custom keyword dictionary, so this step will vary slightly depending on where your keywords come from. Once loaded, they're encoded and converted to a byte array before they're imported.

  3. Create your dictionary. Choose a name and description and create your dictionary.

Create a keyword dictionary using the Security & Compliance Center

Use the following steps to create and import keywords for a custom dictionary:

  1. Connect to the Security & Compliance Center (

  2. Navigate to Classifications > Sensitive info types.

  3. Select Create and enter a Name and Description for your sensitive info type, then select Next

  4. Select Add an element, then select Dictionary (Large keywords) in the Detect content containing drop-down list.

  5. Select Add a dictionary

  6. Under the Search control, select You can create new keyword dictionaries here.

  7. Enter a Name for your custom dictionary.

  8. Select Import, and select either From text or From csv depending on your keyword file type.

  9. In the file dialog, select the keyword file from your local PC or network file share, then select Open.

  10. Select Save, then select your custom dictionary from the Keyword dictionaries list.

  11. Select Add, then select Next.

  12. Review and finalize your sensitive info type selections, then select Finish.

Create a keyword dictionary from a file using PowerShell

Often when you need to create a large dictionary, it's to use keywords from a file or a list exported from some other source. In this case, you'll create a keyword dictionary containing a list of inappropriate language to screen in external email. You must first Connect to Security & Compliance Center PowerShell.

  1. Copy the keywords into a text file and make sure that each keyword is on a separate line.

  2. Save the text file with Unicode encoding. In Notepad > Save As > Encoding > Unicode.

  3. Read the file into a variable by running this cmdlet:

    $fileData = Get-Content <filename> -Encoding Byte -ReadCount 0
  4. Create the dictionary by running this cmdlet:

    New-DlpKeywordDictionary -Name <name> -Description <description> -FileData $fileData

Modifying an existing keyword dictionary

You might need to modify keywords in one of your keyword dictionaries, or modify one of the built-in dictionaries. Currently, your can only update a custom keyword dictionary using PowerShell.

For example, we'll modify some terms in PowerShell, save the terms locally where you can modify them in an editor, and then update the previous terms in place.

First, retrieve the dictionary object:

$dict = Get-DlpKeywordDictionary -Name "Diseases"

Printing $dict will show the various variables. The keywords themselves are stored in an object on the backend, but $dict.KeywordDictionary contains a string representation of them, which you'll use to modify the dictionary.

Before you modify the dictionary, you need to turn the string of terms back into an array using the .split(',') method. Then you'll clean up the unwanted spaces between the keywords with the .trim() method, leaving just the keywords to work with.

$terms = $dict.KeywordDictionary.split(',').trim()

Now you'll remove some terms from the dictionary. Because the example dictionary has only a few keywords, you could as easily skip to exporting the dictionary and editing it in Notepad, but dictionaries generally contain a large amount of text, so you'll first learn this way to edit them easily in PowerShell.

In the last step, you saved the keywords to an array. There are several ways to remove items from an array, but as a straightforward approach, you'll create an array of the terms you want to remove from the dictionary, and then copy only the dictionary terms to it that aren't in the list of terms to remove.

Run the command $terms to show the current list of terms. The output of the command looks like this:

aarskog's syndrome abandonment abasia abderhalden-kaufmann-lignac abdominalgia abduction contracture abetalipoproteinemia abiotrophy ablatio ablation ablepharia abocclusion abolition aborter abortion abortus aboulomania abrami's disease

Run this command to specify the terms that you want to remove:

$termsToRemove = @('abandonment', 'ablatio')

Run this command to actually remove the terms from the list:

$updatedTerms = $terms | Where-Object{ $_ -notin $termsToRemove }

Run the command $updatedTerms to show the updated list of terms. The output of the command looks like this (the specified terms have been removed):

aarskog's syndrome abasia abderhalden-kaufmann-lignac abdominalgia abduction contracture abetalipo proteinemia abiotrophy ablation ablepharia abocclusion abolition aborter abortion abortus aboulomania abrami's disease

Now save the dictionary locally and add a few more terms. You could add the terms right here in PowerShell, but you'll still need to export the file locally to ensure it's saved with Unicode encoding and contains the BOM.
Save the dictionary locally by running the following:
Set-Content $updatedTerms -Path "C:\myPath\terms.txt"

Now open the file, add your other terms, and save with Unicode encoding (UTF-16). Now you'll upload the updated terms and update the dictionary in place.

PS> Set-DlpKeywordDictionary -Identity "Diseases" -FileData (Get-Content -Path "C:myPath\terms.txt" -Encoding Byte -ReadCount 0)

Now the dictionary has been updated in place. The Identity field takes the name of the dictionary. If you wanted to also change the name of your dictionary using the set- cmdlet, you would just need to add the -Name parameter to what's above with your new dictionary name.

Using keyword dictionaries in custom sensitive information types and DLP policies

Keyword dictionaries can be used as part of the match requirements for a custom sensitive information type, or as a sensitive information type themselves. Both require you to create a custom sensitive information type. Follow the instructions in the linked article to create a sensitive information type. Once you have the XML, you'll need the GUID identifier for the dictionary to use it.

<Entity id="9e5382d0-1b6a-42fd-820e-44e0d3b15b6e" patternsProximity="300" recommendedConfidence="75">
    <Pattern confidenceLevel="75">
        <IdMatch idRef=". . ."/>

To get the identity of your dictionary, run this command and copy the Identity property value:

Get-DlpKeywordDictionary -Name "Diseases"

The output of the command looks like this:

RunspaceId : 138e55e7-ea1e-4f7a-b824-79f2c4252255 Identity : 8d2d44b0-91f4-41f2-94e0-21c1c5b5fc9f Name : Diseases Description : Names of diseases and injuries from ICD-10-CM lexicon KeywordDictionary : aarskog's syndrome, abandonment, abasia, abderhalden-kaufmann-lignac, abdominalgia, abduction contracture, abetalipo proteinemia, abiotrophy, ablatio, ablation, ablepharia, abocclusion, abolition, aborter, abortion, abortus, aboulomania, abrami's disease, abramo IsValid : True ObjectState : Unchanged

Paste the identity into your custom sensitive information type's XML and upload it. Now your dictionary will appear in your list of sensitive information types and you can use it right in your policy, specifying how many keywords are required to match.

<Entity id="d333c6c2-5f4c-4131-9433-db3ef72a89e8" patternsProximity="300" recommendedConfidence="85">
      <Pattern confidenceLevel="85">
        <IdMatch idRef="8d2d44b0-91f4-41f2-94e0-21c1c5b5fc9f" />
      <Resource idRef="d333c6c2-5f4c-4131-9433-db3ef72a89e8">
        <Name default="true" langcode="en-us">Diseases</Name>
        <Description default="true" langcode="en-us">Detects various diseases</Description>


Microsoft 365 Information Protection supports in preview double byte character set languages for:

  • Chinese (simplified)
  • Chinese (traditional)
  • Korean
  • Japanese

This support is available for sensitive information types. See, Information protection support for double byte character sets release notes (preview) for more information.