建立關鍵字字典Create a keyword dictionary

資料外洩防護 (DLP) 可識別、監視及保護您的敏感性項目。Data loss prevention (DLP) can identify, monitor, and protect your sensitive items. 識別敏感性項目有時需要尋找關鍵字,特別是在識別一般內容 (例如醫療保健相關通訊),或是不適當或偏激的言語。Identifying sensitive items sometimes requires looking for keywords, particularly when identifying generic content (such as healthcare-related communication), or inappropriate or explicit language. 雖然您可以在敏感性資訊類型中建立關鍵字清單,但關鍵字清單的大小有限,且需要修改 XML 來建立或編輯。Although you can create keyword lists in sensitive information types, keyword lists are limited in size and require modifying XML to create or edit them. 關鍵字字典提供更簡單的關鍵字管理並具有更大的規模,在字典中最多可支援 1 MB 的字詞 (壓縮後) 及各式語種。Keyword dictionaries provide simpler management of keywords and at a much larger scale, supporting up to 1MB of terms (post compression) in the dictionary and support any language. 壓縮後的租用戶限制也是 1 MB。The tenant limit is also 1MB after compression. 1MB 的壓縮後限制表示整個租用戶的所有詞典加起來可以接近 1 百萬字元。1MB of post compression limit means that all dictionaries combined across a tenant can have close to 1 million character.

注意

Microsoft 365 資訊保護目前在預覽版中支援下列雙位元組字元集語言:Microsoft 365 Information Protection now supports in preview double byte character set languages for:

  • 中文 (簡體)Chinese (simplified)
  • 中文 (繁體)Chinese (traditional)
  • 韓文Korean
  • 日文Japanese

這項支援適用於敏感性資訊類型。This support is available for sensitive information types. 如需詳細資訊,請參閱資訊保護支援雙位元組字元集的版本資訊 (預覽版)See, Information protection support for double byte character sets release notes (preview) for more information.

建立關鍵字字典的基本步驟Basic steps to creating a keyword dictionary

字典的關鍵字可以來自各種來源,最常來自在服務中或是透過 PowerShell Cmdlet 匯入的檔案 (例如 .csv 或 .txt 清單)、來自您直接在 PowerShell Cmdlet 中輸入的清單,或來自現有的字典。當建立關鍵字字典時,您會依照相同的核心步驟:The keywords for your dictionary could come from a variety of sources, most commonly from a file (such as a .csv or .txt list) imported in the service or by PowerShell cmdlet, from a list you enter directly in the PowerShell cmdlet, or from an existing dictionary. When you create a keyword dictionary, you follow the same core steps:

  1. 使用 安全性與合規性中心 (https://protection.office.com) 或連線到 安全性與合規性中心 PowerShellUse the Security & Compliance Center (https://protection.office.com) or connect to Security & Compliance Center PowerShell.

  2. 從預期的來源定義或載入關鍵字Define or load your keywords from your intended source. 精靈和 Cmdlet 都會接受以逗點分隔的關鍵字清單,用來建立自訂關鍵字字典,因此這個步驟會根據您的關鍵字來自何處而略有不同。The wizard and the cmdlet both accept a comma-separated list of keywords to create a custom keyword dictionary, so this step will vary slightly depending on where your keywords come from. 一旦載入,就會將它們編碼並轉換為位元組陣列,然後再匯入它們。Once loaded, they're encoded and converted to a byte array before they're imported.

  3. 建立字典Create your dictionary. 選擇名稱和描述,然後建立字典。Choose a name and description and create your dictionary.

使用安全性與合規性中心建立關鍵字字典Create a keyword dictionary using the Security & Compliance Center

請使用下列步驟來建立和匯入自訂字典的關鍵字:Use the following steps to create and import keywords for a custom dictionary:

  1. 連線到安全性與合規性中心 (https://protection.office.com)。Connect to the Security & Compliance Center (https://protection.office.com).

  2. 瀏覽至 [分類] > [敏感性資訊類型]。Navigate to Classifications > Sensitive info types.

  3. 選取 [建立],然後輸入您的敏感性資訊類型的 [名稱] 和 [描述],然後選取 [下一步]Select Create and enter a Name and Description for your sensitive info type, then select Next

  4. 選取 [新增項目],然後在 [偵測內容包含] 下拉式清單中選取 [字典 (大型關鍵字)]。Select Add an element, then select Dictionary (Large keywords) in the Detect content containing drop-down list.

  5. 選取 [新增字典]Select Add a dictionary

  6. 在 [搜尋] 控制項中,選取 [您可以在這裡建立新的關鍵字字典]。Under the Search control, select You can create new keyword dictionaries here.

  7. 輸入自訂字典的 [名稱]。Enter a Name for your custom dictionary.

  8. 選取 [匯入],然後根據您的關鍵字檔案類型選取 [從文字] 或 [從 csv]。Select Import, and select either From text or From csv depending on your keyword file type.

  9. 在 [檔案] 對話方塊中,選取來自您的本機電腦或網路檔案共用的關鍵字檔案,然後選取 [開啟]。In the file dialog, select the keyword file from your local PC or network file share, then select Open.

  10. 選取 [儲存],然後從 [關鍵字字典] 清單選取您的自訂字典。Select Save, then select your custom dictionary from the Keyword dictionaries list.

  11. 選取 [新增],然後選取 [下一步]。Select Add, then select Next.

  12. 檢閱並完成敏感性資訊類型選取項目,然後選取 [完成]。Review and finalize your sensitive info type selections, then select Finish.

使用 PowerShell 從檔案建立關鍵字字典Create a keyword dictionary from a file using PowerShell

當您需要建立大型字典時,通常會使用來自檔案或從其他來源匯出清單中的關鍵字。Often when you need to create a large dictionary, it's to use keywords from a file or a list exported from some other source. 在此情況下,您會建立一個關鍵字字典,其中包含要在外部電子郵件中過濾的不適當言語清單。In this case, you'll create a keyword dictionary containing a list of inappropriate language to screen in external email. 連線到安全性與合規性中心 PowerShellYou must first connect to Security & Compliance Center PowerShell.

  1. 將關鍵字複製到文字檔案,並確定每個關鍵字位於個別行。Copy the keywords into a text file and make sure that each keyword is on a separate line.

  2. 使用 Unicode 編碼來儲存檔案。在 [記事本] > [另存新檔] > [編碼] > [Unicode] 中。Save the text file with Unicode encoding. In Notepad > Save As > Encoding > Unicode.

  3. 執行下列 Cmdlet 將檔案讀成變數:Read the file into a variable by running this cmdlet:

    $fileData = Get-Content <filename> -Encoding Byte -ReadCount 0
    
  4. 執行下列 Cmdlet 來建立字典:Create the dictionary by running this cmdlet:

    New-DlpKeywordDictionary -Name <name> -Description <description> -FileData $fileData
    

修改現有的關鍵字字典Modifying an existing keyword dictionary

您可能需要修改您其中一個關鍵字字典中的關鍵字,或修改其中一個內建字典。You might need to modify keywords in one of your keyword dictionaries, or modify one of the built-in dictionaries. 目前,您只可以使用 PowerShell 來更新自訂關鍵字字典。Currently, your can only update a custom keyword dictionary using PowerShell.

在此範例中,我們會在 PowerShell 中修改一些字詞,將字詞儲存在本機,使得您可以在編輯器中加以修改,然後就地更新之前的字詞。For example, we'll modify some terms in PowerShell, save the terms locally where you can modify them in an editor, and then update the previous terms in place.

首先,擷取字典物件:First, retrieve the dictionary object:

$dict = Get-DlpKeywordDictionary -Name "Diseases"

列印 $dict 會顯示各種不同的變數。Printing $dict will show the various variables. 關鍵字本身會儲存在後端的物件中,但 $dict.KeywordDictionary 包含它們的字串表示,您將用來修改字典。The keywords themselves are stored in an object on the backend, but $dict.KeywordDictionary contains a string representation of them, which you'll use to modify the dictionary.

在修改字典之前,您必須使用 .split(',') 方法,將字詞字串重新轉換成陣列。Before you modify the dictionary, you need to turn the string of terms back into an array using the .split(',') method. 然後您會使用 .trim() 方法清除關鍵字之間不想要的空格,只留下要使用的關鍵字。Then you'll clean up the unwanted spaces between the keywords with the .trim() method, leaving just the keywords to work with.

$terms = $dict.KeywordDictionary.split(',').trim()

現在您將從字典中移除一些字詞。因為範例字典只有幾個關鍵字,所以您可以輕鬆地略過移除,直接匯出字典,並在 [記事本] 中編輯該字典,但字典通常包含大量文字,因此您首先將學習此方法,以在 PowerShell 中輕鬆地編輯它們。Now you'll remove some terms from the dictionary. Because the example dictionary has only a few keywords, you could just as easily skip to exporting the dictionary and editing it in Notepad, but dictionaries generally contain a large amount of text, so you'll first learn this way to edit them easily in PowerShell.

在最後一個步驟中,您已將關鍵字儲存到陣列。有幾種方法可從陣列中移除字詞 (英文),但直接方法為建立您要從字典中移除之字詞的陣列,然後只將不在要移除之字詞清單中的字典字詞複製到其中。In the last step, you saved the keywords to an array. There are several ways to remove items from an array, but as a straightforward approach, you'll create an array of the terms you want to remove from the dictionary, and then copy only the dictionary terms to it that aren't in the list of terms to remove.

執行命令 $terms 來顯示目前的字詞清單。命令的輸出看起來像這樣:Run the command $terms to show the current list of terms. The output of the command looks like this:

aarskog's syndrome abandonment abasia abderhalden-kaufmann-lignac abdominalgia abduction contracture abetalipoproteinemia abiotrophy ablatio ablation ablepharia abocclusion abolition aborter abortion abortus aboulomania abrami's disease

執行下列命令來指定您想要移除的字詞:Run this command to specify the terms that you want to remove:

$termsToRemove = @('abandonment', 'ablatio')

執行下列命令來實際移除清單中的字詞:Run this command to actually remove the terms from the list:

$updatedTerms = $terms | Where-Object{ $_ -notin $termsToRemove }

執行命令 $updatedTerms 來顯示已更新的字詞清單。命令的輸出看起來像這樣 (已移除指定的字詞):Run the command $updatedTerms to show the updated list of terms. The output of the command looks like this (the specified terms have been removed):

aarskog's syndrome abasia abderhalden-kaufmann-lignac abdominalgia abduction contracture abetalipo proteinemia abiotrophy ablation ablepharia abocclusion abolition aborter abortion abortus aboulomania abrami's disease


Now save the dictionary locally and add a few more terms. You could add the terms right here in PowerShell, but you'll still need to export the file locally to ensure it's saved with Unicode encoding and contains the BOM.
  
Save the dictionary locally by running the following:
  
```powershell
Set-Content $updatedTerms -Path "C:\myPath\terms.txt"

現在只需開啟檔案、新增其他字詞,並以 Unicode 編碼 (UTF-16) 儲存。現在,您將上傳更新的字詞並適當地更新字典。Now simply open the file, add your additional terms, and save with Unicode encoding (UTF-16). Now you'll upload the updated terms and update the dictionary in place.

PS> Set-DlpKeywordDictionary -Identity "Diseases" -FileData (Get-Content -Path "C:myPath\terms.txt" -Encoding Byte -ReadCount 0)

現在您已適當地更新字典。請注意,Identity 欄位會採用字典的名稱。如果您也想要使用 set- Cmdlet 來變更字典的名稱,只需將 -Name 參數與新的字典名稱新增至上述的命令即可。Now the dictionary has been updated in place. Note that the Identity field takes the name of the dictionary. If you wanted to also change the name of your dictionary using the set- cmdlet, you would just need to add the -Name parameter to what's above with your new dictionary name.

使用自訂敏感資訊類型和 DLP 原則中的關鍵字字典Using keyword dictionaries in custom sensitive information types and DLP policies

關鍵字字典可做為自訂敏感性資訊類型的符合需求一部分,或做為敏感性資訊類型本身。Keyword dictionaries can be used as part of the match requirements for a custom sensitive information type, or as a sensitive information type themselves. 兩者都需要您建立自訂敏感性資訊類型Both require you to create a custom sensitive information type. 按照連結文章中的指示建立敏感性資訊類型。Follow the instructions in the linked article to create a sensitive information type. 當您有 XML 之後,就需要字典的 GUID 識別碼來使用它。Once you have the XML, you'll need the GUID identifier for the dictionary to use it.

<Entity id="9e5382d0-1b6a-42fd-820e-44e0d3b15b6e" patternsProximity="300" recommendedConfidence="75">
    <Pattern confidenceLevel="75">
        <IdMatch idRef=". . ."/>
    </Pattern>
</Entity>

若要取得字典的身分識別,請執行下列命令,然後複製 Identity 屬性值:To get the identity of your dictionary, run this command and copy the Identity property value:

Get-DlpKeywordDictionary -Name "Diseases"

此命令的輸出看起來像這樣:The output of the command looks like this:

RunspaceId : 138e55e7-ea1e-4f7a-b824-79f2c4252255 Identity : 8d2d44b0-91f4-41f2-94e0-21c1c5b5fc9f Name : Diseases Description : Names of diseases and injuries from ICD-10-CM lexicon KeywordDictionary : aarskog's syndrome, abandonment, abasia, abderhalden-kaufmann-lignac, abdominalgia, abduction contracture, abetalipo proteinemia, abiotrophy, ablatio, ablation, ablepharia, abocclusion, abolition, aborter, abortion, abortus, aboulomania,RunspaceId : 138e55e7-ea1e-4f7a-b824-79f2c4252255 Identity : 8d2d44b0-91f4-41f2-94e0-21c1c5b5fc9f Name : Diseases Description : Names of diseases and injuries from ICD-10-CM lexicon KeywordDictionary : aarskog's syndrome, abandonment, abasia, abderhalden-kaufmann-lignac, abdominalgia, abduction contracture, abetalipo proteinemia, abiotrophy, ablatio, ablation, ablepharia, abocclusion, abolition, aborter, abortion, abortus, aboulomania, abrami's disease, abramo IsValid : True ObjectState : Unchanged

將身分識別貼入您的自訂敏感資訊類型的 XML 並上傳它。現在字典將出現在敏感資訊類型的清單中,而且您可以在原則中直接使用它,同時指定需要比對多少個關鍵字。Paste the identity into your custom sensitive information type's XML and upload it. Now your dictionary will appear in your list of sensitive information types and you can use it right in your policy, specifying how many keywords are required to match.

<Entity id="d333c6c2-5f4c-4131-9433-db3ef72a89e8" patternsProximity="300" recommendedConfidence="85">
      <Pattern confidenceLevel="85">
        <IdMatch idRef="8d2d44b0-91f4-41f2-94e0-21c1c5b5fc9f" />
      </Pattern>
    </Entity>
    <LocalizedStrings>
      <Resource idRef="d333c6c2-5f4c-4131-9433-db3ef72a89e8">
        <Name default="true" langcode="en-us">Diseases</Name>
        <Description default="true" langcode="en-us">Detects various diseases</Description>
      </Resource>
    </LocalizedStrings>