Create custom sensitive information types with Exact Data Match based classification

Overview

Custom sensitive information types are used to help prevent inadvertent or inappropriate sharing of sensitive information. As an administrator, you can use the Security & Compliance Center or PowerShell to define a custom sensitive information type based on patterns, evidence (keywords such as employeebadgeID, and so on), character proximity (how close evidence is to characters in a particular pattern), and confidence levels. Such custom sensitive information types meet business needs for many organizations.

But what if you wanted a custom sensitive information type that uses exact data values, instead of matching only with generic patterns? With Exact Data Match (EDM)-based classification, you can create a custom sensitive information type that is designed to:

  • be dynamic and refreshable;
  • be more scalable;
  • result in fewer false-positives;
  • work with structured sensitive data;
  • handle sensitive information more securely; and
  • be used with several Microsoft cloud services.

EDM-based classification

EDM-based classification enables you to create custom sensitive information types that refer to exact values in a database of sensitive information. The database can be refreshed daily or weekly, and it can contain up to 10 million rows of data. So as employees, patients, or clients come and go, and records change, your custom sensitive information types remain current and applicable. And, you can use EDM-based classification with policies, such as data loss prevention policies (DLP) or Microsoft Cloud App Security file policies.

Required licenses and permissions

You must be a global admin, compliance administrator, or Exchange Online administrator to perform the tasks described in this article. To learn more about DLP permissions, see Permissions.

When generally available, EDM-based classification will be included in these subscriptions

  • Office 365 E5
  • Microsoft 365 E5
  • Microsoft 365 Information Protection and Compliance
  • Office 365 Advanced Compliance

The work flow at a glance

Phase What's needed
Part 1: Set up EDM-based classification

(As needed)
- Edit the database schema
- Remove the schema
- Read access to the sensitive data
- Database schema in .xml format (example provided)
- Rule package in .xml format (example provided)
- Admin permissions to the Security & Compliance Center (using PowerShell)
Part 2: Index and upload the sensitive data

(As needed)
Refresh the data
- Custom security group and user account
- Local admin access to machine with EDM Upload Agent
- Read access to the sensitive data
- Process and schedule for refreshing the data
Part 3: Use EDM-based classification with your Microsoft cloud services - Office 365 subscription with DLP
- EDM-based classification feature enabled

Part 1: Set up EDM-based classification

Setting up and configuring EDM-based classification involves saving sensitive data in .csv format, defining a schema for your database of sensitive information, creating a rule package, and then uploading the schema and rule package.

Define the schema for your database of sensitive information

  1. Identify the sensitive information you want to use. Export the data to an app, such as Microsoft Excel, and save the file in .csv format. The data file can include a maximum of:

    • Up to 10 million rows of sensitive data
    • Up to 32 columns (fields) per data source
    • Up to 5 columns (fields) marked as searchable
  2. Structure the sensitive data in the .csv file such that the first row includes the names of the fields used for EDM-based classification. In your .csv file, you might have field names, such as "ssn", "birthdate", "firstname", "lastname", and so on. As an example, our .csv file is called PatientRecords.csv, and its columns include PatientIDMRNLastNameFirstNameSSN and more.

  3. Define the schema for the database of sensitive information in .xml format (similar to our example below). Name this schema file edm.xml, and configure it such that for each column in the database, there is a line that uses the syntax: 

    \<Field name="" searchable=""/\>.

    • Use column names for Field name values.
    • Use searchable="true" for the fields that you want to be searchable up to a maximum of 5 fields. You must designate a minimum of one field as searchable.

    As an example, the following .xml file defines the schema for a patient records database, with five fields specified as searchable: PatientIDMRNSSNPhone, and DOB.

    (You can copy, modify, and use our example.)

    <EdmSchema xmlns="http://schemas.microsoft.com/office/2018/edm">
          <DataStore name="PatientRecords" description="Schema for patient records" version="1">
                <Field name="PatientID" searchable="true" />
                <Field name="MRN" searchable="true" />
                <Field name="FirstName" />
                <Field name="LastName" />
                <Field name="SSN" searchable="true" />
                <Field name="Phone" searchable="true" />
                <Field name="DOB" searchable="true" />
                <Field name="Gender" />
                <Field name="Address" />
          </DataStore>
    </EdmSchema>
    
  4. Connect to Office 365 Security & Compliance Center PowerShell.

  5. To upload the database schema, run the following cmdlets, one at a time:

    $edmSchemaXml=Get-Content .\\edm.xml -Encoding Byte -ReadCount 0
    New-DlpEdmSchema -FileData $edmSchemaXml -Confirm:$true
    

    You will be prompted to confirm, as follows:

    Confirm

    Are you sure you want to perform this action?

    New EDM Schema for the data store 'patientrecords' will be imported.

    [Y] Yes [A] Yes to All [N] No [L] No to All [?] Help (default is "Y"):

Tip

If you want your changes to occur without confirmation, in Step 5, use this cmdlet instead: New-DlpEdmSchema -FileData $edmSchemaXml

Note

It can take between 10-60 minutes to update the EDMSchema with additions. The update must complete before you execute steps that use the additions.

Now that the schema for your database of sensitive information is defined, the next step is to set up a rule package. Proceed to the section Set up a rule package.

Editing the schema for EDM-based classification

If you want to make changes to your edm.xml file, such as changing which fields are used for EDM-based classification, follow these steps:

  1. Edit your edm.xml file (this is the file discussed in the Define the schema section of this article).

  2. Connect to Office 365 Security & Compliance Center PowerShell.

  3. To update your database schema, run the following cmdlets, one at a time:

    $edmSchemaXml=Get-Content .\\edm.xml -Encoding Byte -ReadCount 0
    Set-DlpEdmSchema -FileData $edmSchemaXml -Confirm:$true
    

    You will be prompted to confirm, as follows:

    Confirm

    Are you sure you want to perform this action?

    EDM Schema for the data store 'patientrecords' will be updated.

    [Y] Yes [A] Yes to All [N] No [L] No to All [?] Help (default is "Y"):

    Tip

    If you want your changes to occur without confirmation, in Step 3, use this cmdlet instead: Set-DlpEdmSchema -FileData $edmSchemaXml

    Note

    It can take between 10-60 minutes to update the EDMSchema with additions. The update must complete before you execute steps that use the additions.

Removing the schema for EDM-based classification

(As needed) If you want to remove the schema you're using for EDM-based classification, follow these steps:

  1. Connect to Office 365 Security & Compliance Center PowerShell.

  2. Run the following PowerShell cmdlets, substituting the data store name of "patientrecords" with the one you want to remove:

    Remove-DlpEdmSchema -Identity patientrecords
    

    You will be prompted to confirm, as follows:

    Confirm

    Are you sure you want to perform this action?

    EDM Schema for the data store 'patientrecords' will be removed.

    [Y] Yes [A] Yes to All [N] No [L] No to All [?] Help (default is "Y"):

    Tip

    If you want your changes to occur without confirmation, in Step 2, use this cmdlet instead: Remove-DlpEdmSchema -Identity patientrecords -Confirm:$false

Set up a rule package

  1. Create a rule package in .xml format (with Unicode encoding), similar to the following example. (You can copy, modify, and use our example.)

    When you set up your rule package, make sure to correctly reference your .csv file and edm.xml file. You can copy, modify, and use our example. In this sample xml the following fields needs to be customized to create your EDM sensitive type:

    • RulePack id & ExactMatch id: Use New-GUID to generate a GUID.

    • Datastore: This field specifies EDM lookup data store to be used. You provide a data source name of a configured EDM Schema.

    • idMatch: This field points to the primary element for EDM.

      • Matches: Specifies the field to be used in exact lookup. You provide a searchable field name in EDM Schema for the DataStore.
      • Classification: This field specifies the sensitive type match that triggers EDM lookup. You can provide Name or GUID of an existing built-in or custom classification.
    • Match: This field points to additional evidence found in proximity of idMatch.

      • Matches: You provide any field name in EDM Schema for DataStore.
    • Resource: This section specifies the name and description for sensitive type in multiple locales.

      • idRef: You provide GUID for ExactMatch ID.
      • Name & descriptions: customize as required.
    <RulePackage xmlns="http://schemas.microsoft.com/office/2018/edm">
      <RulePack id="fd098e03-1796-41a5-8ab6-198c93c62b11">
        <Version build="0" major="2" minor="0" revision="0" />
        <Publisher id="eb553734-8306-44b4-9ad5-c388ad970528" />
        <Details defaultLangCode="en-us">
          <LocalizedDetails langcode="en-us">
            <PublisherName>IP DLP</PublisherName>
            <Name>Health Care EDM Rulepack</Name>
            <Description>This rule package contains the EDM sensitive type for health care sensitive types.</Description>
          </LocalizedDetails>
        </Details>
      </RulePack>
      <Rules>
        <ExactMatch id = "E1CC861E-3FE9-4A58-82DF-4BD259EAB371" patternsProximity = "300" dataStore ="PatientRecords" recommendedConfidence = "65" >
          <Pattern confidenceLevel="65">
            <idMatch matches = "SSN" classification = "U.S. Social Security Number (SSN)" />
          </Pattern>
          <Pattern confidenceLevel="75">
            <idMatch matches = "SSN" classification = "U.S. Social Security Number (SSN)" />
            <Any minMatches ="3" maxMatches ="100">
              <match matches="PatientID" />
              <match matches="MRN"/>
              <match matches="FirstName"/>
              <match matches="LastName"/>
              <match matches="Phone"/>
              <match matches="DOB"/>
            </Any>
          </Pattern>
        </ExactMatch>
        <LocalizedStrings>
          <Resource idRef="E1CC861E-3FE9-4A58-82DF-4BD259EAB371">
            <Name default="true" langcode="en-us">Patient SSN Exact Match.</Name>
            <Description default="true" langcode="en-us">EDM Sensitive type for detecting Patient SSN.</Description>
          </Resource>
        </LocalizedStrings>
      </Rules>
    </RulePackage>
    
  2. Upload the rule package by running the following PowerShell cmdlets, one at a time:

    $rulepack=Get-Content .\\rulepack.xml -Encoding Byte -ReadCount 0
    New-DlpSensitiveInformationTypeRulePackage -FileData $rulepack
    

At this point, you have set up EDM-based classification. The next step is to index the sensitive data, and then upload the indexed data.

Recall from the previous procedure that our PatientRecords schema defines five fields as searchable: PatientIDMRNSSNPhone, and DOB. Our example rule package includes those fields and references the database schema file (edm.xml), with one ExactMatch items per searchable field. Consider the following ExactMatch item:

<ExactMatch id = "E1CC861E-3FE9-4A58-82DF-4BD259EAB371" patternsProximity = "300" dataStore ="PatientRecords" recommendedConfidence = "65" >
      <Pattern confidenceLevel="65">
        <idMatch matches = "SSN" classification = "U.S. Social Security Number (SSN)" />
      </Pattern>
      <Pattern confidenceLevel="75">
        <idMatch matches = "SSN" classification = "U.S. Social Security Number (SSN)" />
        <Any minMatches ="3" maxMatches ="100">
          <match matches="PatientID" />
          <match matches="MRN"/>
          <match matches="FirstName"/>
          <match matches="LastName"/>
          <match matches="Phone"/>
          <match matches="DOB"/>
        </Any>
      </Pattern>
    </ExactMatch>

In this example, note the following:

  • The dataStore name references the .csv file we created earlier: dataStore = "PatientRecords".

  • The idMatch value references a searchable field that is listed in the database schema file: idMatch matches = "SSN".

  • The classification value references an existing or custom sensitive information type: classification = "U.S. Social Security Number (SSN)". (In this case, we use the existing sensitive information type of U.S. Social Security Number.)

Note

It can take between 10-60 minutes to update the EDMSchema with additions. The update must complete before you execute steps that use the additions.

Part 2: Index and upload the sensitive data

During this phase, you set up a custom security group and user account, and set up the EDM Upload Agent tool. Then, you use the tool to index the sensitive data, and upload the indexed data.

Set up the security group and user account

  1. As a global administrator, go to the admin center (https://admin.microsoft.com) and create a security group called EDM_DataUploaders.

  2. Add one or more users to the EDM_DataUploaders security group. (These users will manage the database of sensitive information.)

  3. Make sure each user who is managing the sensitive data is a local admin on the machine used for the EDM Upload Agent.

Set up the EDM Upload Agent

Note

Before you begin this procedure, make sure that you are a member of the EDM_DataUploaders security group and a local admin on your machine.

  1. Download and install the EDM Upload Agent. By default, the installation location should be C:\Program Files\Microsoft\EdmUploadAgent.

    Tip

    To a get a list out of the supported command parameters, run the agent no arguments. For example 'EdmUploadAgent.exe'.

  2. To authorize the EDM Upload Agent, open Windows Command Prompt (as an administrator), and then run the following command:

    EdmUploadAgent.exe /Authorize

  3. Sign in with your work or school account for Office 365.

The next step is to use the EDM Upload Agent to index the sensitive data, and then upload the indexed data.

Index and upload the sensitive data

Save the sensitive data file (recall our example is PatientRecords.csv) to the local drive on the machine. (We saved our example PatientRecords.csv file to C:\Edm\Data.)

To index and upload the sensitive data, run the following command in Windows Command Prompt:

EdmUploadAgent.exe /UploadData /DataStoreName \<DataStoreName\> /DataFile \<DataFilePath\> /HashLocation \<HashedFileLocation\>

Example: EdmUploadAgent.exe /UploadData /DataStoreName PatientRecords /DataFile C:\Edm\Hash\PatientRecords.csv /HashLocation C:\Edm\Hash

To separate and execute index of sensitive data in an isolated environment, execute index and upload steps separately.

To index the sensitive data, run the following command in Windows Command Prompt:

EdmUploadAgent.exe /CreateHash /DataFile \<DataFilePath\> /HashLocation \<HashedFileLocation\>

For example:

EdmUploadAgent.exe /CreateHash /DataFile C:\Edm\Data\PatientRecords.csv /HashLocation C:\Edm\Hash

To upload the indexed data, run the following command in Windows Command Prompt:

EdmUploadAgent.exe /UploadHash /DataStoreName \<DataStoreName\> /HashFile \<HashedSourceFilePath\>

For example:

EdmUploadAgent.exe /UploadHash /DataStoreName PatientRecords /HashFile C:\Edm\Hash\PatientRecords.EdmHash

To verify your sensitive data has been uploaded, run the following command in Windows Command Prompt:

EdmUploadAgent.exe /GetDataStore

You'll see a list of data stores and when they were last updated.

Proceed to set up your process and schedule for Refreshing your sensitive information database.

At this point, you are ready to use EDM-based classification with your Microsoft cloud services. For example, you can set up a DLP policy using EDM-based classification.

Refreshing your sensitive information database

You can refresh your sensitive information database daily or weekly, and the EDM Upload Tool can reindex the sensitive data and then reupload the indexed data.

  1. Determine your process and frequency (daily or weekly) for refreshing the database of sensitive information.

  2. Re-export the sensitive data to an app, such as Microsoft Excel, and save the file in .csv format. Keep the same file name and location you used when you followed the steps described in Index and upload the sensitive data.

    Note

    If there are no changes to the structure (field names) of the .csv file, you won't need to make any changes to your database schema file when you refresh the data. But if you must make changes, make sure to edit the database schema and your rule package accordingly.

  3. Use Task Scheduler to automate steps 2 and 3 in the Index and upload the sensitive data procedure. You can schedule tasks using several methods:

    Method What to do
    Windows PowerShell See the ScheduledTasks documentation and the example PowerShell script in this article
    Task Scheduler API See the Task Scheduler documentation
    Windows user interface In Windows, click Start, and type Task Scheduler. Then, in the list of results, right-click Task Scheduler, and choose Run as administrator.

Example PowerShell script for Task Scheduler

This section includes an example PowerShell script you can use to schedule your tasks for indexing data and uploading the indexed data:

To schedule index and upload in a combined step
param(\[string\]$dataStoreName,\[string\]$fileLocation)
\# Assuming current user is also the user context to run the task
$user = "$env:USERDOMAIN\\$env:USERNAME"
$edminstallpath = 'C:\\Program Files\\Microsoft\\EdmUploadAgent\\'
$edmuploader = $edminstallpath + 'EdmUploadAgent.exe'
$csvext = '.csv'
\# Assuming CSV file name is same as data store name
$dataFile = "$fileLocation\\$dataStoreName$csvext"
\# Assuming location to store hash file is same as the location of csv file
$hashLocation = $fileLocation
$uploadDataArgs = '/UploadData /DataStoreName ' + $dataStoreName + ' /DataFile ' + $dataFile + ‘ /HashLocation’ + $hashLocation
\# Set up actions associated with the task
$actions = @()
$actions += New-ScheduledTaskAction -Execute $edmuploader -Argument $uploadDataArgs -WorkingDirectory $edminstallpath
\# Set up trigger for the task
$trigger = New-ScheduledTaskTrigger -Weekly -DaysOfWeek Sunday -At 2am
\# Set up task settings
$principal = New-ScheduledTaskPrincipal -UserId $user -LogonType S4U -RunLevel Highest
$settings = New-ScheduledTaskSettingsSet -RunOnlyIfNetworkAvailable -StartWhenAvailable -WakeToRun
\# Create the scheduled task
$scheduledTask = New-ScheduledTask -Action $actions -Principal $principal -Trigger $trigger -Settings $settings
\# Get credentials to run the task
$creds = Get-Credential -UserName $user -Message "Enter credentials to run the task"
$password=\[Runtime.InteropServices.Marshal\]::PtrToStringAuto(\[Runtime.InteropServices.Marshal\]::SecureStringToBSTR($creds.Password))
\# Register the scheduled task
$taskName = 'EDMUpload\_' + $dataStoreName
Register-ScheduledTask -TaskName $taskName -InputObject $scheduledTask -User $user -Password $password

To schedule index and upload as separate steps

param(\[string\]$dataStoreName,\[string\]$fileLocation)
\# Assuming current user is also the user context to run the task
$user = "$env:USERDOMAIN\\$env:USERNAME"
$edminstallpath = 'C:\\Program Files\\Microsoft\\EdmUploadAgent\\'
$edmuploader = $edminstallpath + 'EdmUploadAgent.exe'
$csvext = '.csv'
$edmext = '.EdmHash'
\# Assuming CSV file name is same as data store name
$dataFile = "$fileLocation\\$dataStoreName$csvext"
$hashFile = "$fileLocation\\$dataStoreName$edmext"
\# Assuming location to store hash file is same as the location of csv file
$hashLocation = $fileLocation
$createHashArgs = '/CreateHash' + ' /DataFile ' + $dataFile + ' /HashLocation ' + $hashLocation
$uploadHashArgs = '/UploadHash /DataStoreName ' + $dataStoreName + ' /HashFile ' + $hashFile
\# Set up actions associated with the task
$actions = @()
$actions += New-ScheduledTaskAction -Execute $edmuploader -Argument $createHashArgs -WorkingDirectory $edminstallpath
$actions += New-ScheduledTaskAction -Execute $edmuploader -Argument $uploadHashArgs -WorkingDirectory $edminstallpath
\# Set up trigger for the task
$trigger = New-ScheduledTaskTrigger -Weekly -DaysOfWeek Sunday -At 2am
\# Set up task settings
$principal = New-ScheduledTaskPrincipal -UserId $user -LogonType S4U -RunLevel Highest
$settings = New-ScheduledTaskSettingsSet -RunOnlyIfNetworkAvailable -StartWhenAvailable -WakeToRun
\# Create the scheduled task
$scheduledTask = New-ScheduledTask -Action $actions -Principal $principal -Trigger $trigger -Settings $settings
\# Get credentials to run the task
$creds = Get-Credential -UserName $user -Message "Enter credentials to run the task"
$password=\[Runtime.InteropServices.Marshal\]::PtrToStringAuto(\[Runtime.InteropServices.Marshal\]::SecureStringToBSTR($creds.Password))
\# Register the scheduled task
$taskName = 'EDMUpload\_' + $dataStoreName
Register-ScheduledTask -TaskName $taskName -InputObject $scheduledTask -User $user -Password $password

Part 3: Use EDM-based classification with your Microsoft cloud services

Office 365 DLP for Exchange Online (email), OneDrive for Business (files), Microsoft Teams (conversations) and Microsoft Cloud App Security DLP policies will support EDM sensitive information types.

EDM sensitive information types for following scenarios are currently in development, but not yet available:

  • Office 365 DLP for SharePoint (files)
  • Auto-classification of sensitivity labels and retention labels

To create a DLP policy with EDM

  1. Go to the Security & Compliance Center (https://protection.office.com).

  2. Choose Data loss prevention > Policy.

  3. Choose Create a policy > Custom > Next.

  4. On the Name your policy tab, specify a name and description, and then choose Next.

  5. On the Choose locations tab, select Let me choose specific locations, and then choose Next.

  6. In the Status column, select Exchange email, OneDrive accounts, Teams chat and channel message , and then choose Next. (Note: EDM is currently not supported in SharePoint sites and DLP policy will not detect files in Sharepoint for EDM)

  7. On the Policy settings tab, choose Use advanced settings, and then choose Next.

  8. Choose + New rule.

  9. In the Name section, specify a name and description for the rule.

  10. In the Conditions section, in the + Add a condition list, choose Content contains sensitive type.

    Content contains sensitive info types

  11. Search for the sensitive information type you created when you set up your rule package, and then choose + Add.
    Then choose Done.

  12. Finish selecting options for your rule, such as User notificationsUser overridesIncident reports, and so on, and then choose Save.

  13. On the Policy settings tab, review your rules, and then choose Next.

  14. Specify whether to turn on the policy right away, test it out, or keep it turned off. Then choose Next.

  15. On the Review your settings tab, review your policy. Make any needed changes. When you're ready, choose Create.

Note

Allow approximately one hour for your new DLP policy to work its way through your data center.

Built-in sensitive information types and what they look for

Custom sensitive information types

Overview of DLP policies

Microsoft Cloud App Security

New-DlpEdmSchema

Feedback

GitHub feedback is enabled, but adding issues is only available on the public site.