Create an EDM SIT schema and rule package (New experience)

Article
12/11/2023

You can create both the exact data match (EDM) schema and EDM sensitive information type (SIT) in the new experience by using a single workflow in the Microsoft Purview Compliance Portal.

Tip

If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview compliance portal trials hub. Learn details about signing up and trial terms.

Applies to

New experience

If you want to create an EDM SIT using the classic experience, see Create an EDM SIT (Classic experience). If you need help deciding which experience to use, see Choosing the right EDM SIT creation experience for you.

Before you begin

Make sure to complete the steps in the following articles before beginning the procedures described in this article.

If you aren't familiar with EDM-based SITs or their implementation, it's essential to familiarize yourself with the concepts in the following articles:

Permissions

You must have Global admin or Compliance admin permissions to create, test, and deploy a custom sensitive information type through the compliance portal. For more information about roles and permissions, see About admin roles in Office 365.

Important considerations

Following are several important considerations you should take into account when creating your EDM schema and EDM-based SITs.

Before selecting your primary elements, review the built-in SITs to get an idea of which ones best meet your needs.
For each EDM SIT, you must select at least one primary element and no more than 10 primary elements. If you have a multi-token corroborative data field, choose one of the following options:
1. Select Multi-token matching.
2. Map the elements in your multi-token fields to a SIT that can detect that content. (The more fields with unique values that you map, the more accurate your EDM SIT will be. Mapping multiple fields also helps improve performance, reduce the processing load, and avoid system timeouts.)
When selecting the primary elements for your SIT, select fields that ensure each row in your data table is unique. For example, don't use fields like FirstName or DateOfBirth. Why? Because first names and dates of birth are likely to be duplicated throughout your sensitive data table. Instead, use fields with unique values, such as SocialSecurityNumber or BankAccountNumber.
Recommendation: Build your EDM schema from a sample data file. In following this recommendation, make sure that your sample data file adheres to the following requirements:
- Data must be organized as a table, with columns and rows. Use your field names for the column headers. (The rows in your table correspond with your individual data items.)
- Field names can include only alphanumeric characters.
- Field names must start with a letter and must consist of at least three alphanumeric characters.
- If these naming requirements are not met, errors may occur when uploading your sample data file.
If you use a sample file of sensitive information to configure your EDM SIT, the system suggests the existing SIT for each field that best detects the uploaded data, if such a SIT is available. Microsoft Purview defaults to single-token matching for detecting sensitive content. So, if no existing SITs are able to detect the data for a field in your EDM schema, it applies the single-token matching mode. It is critical that you make sure that the SIT suggested for each element will detect the exact string you want to monitor:
1. Make sure that the suggested SIT doesn't contain any surrounding characters that differ from the content you want to detect.
2. Make sure that the suggested SIT doesn't exclude any valid portion of the string as stored in your sensitive information table.
3. Make sure that the SIT you use closely matches the format of the data you want to detect. For example, look for something like Nine digits with optional hyphens or spaces rather than simply digits, or check for A combination of 32 characters consisting of letters and digits rather than simply text strings.
  
  Using SITs that closely match the format of the data you're trying to detect is another way to improve the accuracy of your results and shorten the time it takes for the matching to complete.

Note

All of your data is saved as you navigate forward (Next) and backward (Back) through the tool while making your selections. Backward navigation only supports moving from top-level page to top-level page and from sub-page to sub-page. You can't navigate backward from a top-level page to the preceding sub-page or from a sub-page to a preceding top-level page.

Create your EDM schema and SIT

The following procedure provides step-by step guidance for creating your EDM schema and SITs using the new experience. For a conceptual overview and diagram of the process as a whole, see Overview of the EDM workflow (New experience).

Instructions

In the compliance portal for your tenant, go to Data classification > EDM classifiers.
Make sure the New EDM experience toggle is set to On.
Choose + Create EDM classifier.
Name the SIT and add a description. The system uses this name, appended with the word schema, for the associated schema it generates.
Choose Next.
Select the method you want to use for your schema: either Upload a file containing sample data, or Manually define your data structure. (Best practice is to upload a sample data file. The rest of this procedure assumes this option.)

In either case, you need the information discussed in Create an EDM SIT sample file (New experience) for your sample file.
Choose Next.
Select your sample file and then select Upload file. Choose Next.
(If errors display during the upload, address them and then try again.)
On the Select primary elements page:
1. In the Primary element column, select your primary element. Each primary element must be mapped to a SIT. Best practice is to select fields that show Full match under the Match Validation column.
2. In the Match mode column for each field, designate which of the following matching options to apply:
  - Option 1: Do nothing to accept the system-suggested SIT.
  - Option 2: Expand the dropdown menu. Under Sensitive Info type (SIT), choose the pencil (Edit) icon and then select another existing SIT.
  - Option 3: Under Match mode select Single token.
  - Option 4: Under Match mode select Multi-token.
Choose Next.
Configure settings for data in selected columns.
- The toggle Use the same settings for all columns is set to On by default. If you want to use separate settings for each data field, set the toggle to Off.
- The Data in columns are case-insensitive option is selected by default. To enforce case-sensitive detection, uncheck this box.
- If needed, select the option to Ignore delimiters and punctuation for data in all columns You can then either select the delimiters and punctuation marks you want to ignore from a list or you can enter custom delimiters and punctuation marks to ignore.

Important

If you select the Ignore Delimiters option for the primary element column in your schema, make sure that the SIT you map it to is designed to match data both with and without the selected delimiters.

Choose Submit.

Once you're finished, EDM automatically generates one detection rule for each of the primary elements you identified. It also creates a high-confidence rule and a medium-confidence rule. High-confidence rules have more matching requirements than medium-confidence rules, which, in turn, have more requirements than low-confidence rules. (Low-confidence rules must be created manually.) You can review and edit these rules on the Configure detection rules for primary elements page.

Tip

Those elements that are not selected as primary can still be used as corroborative (supporting) evidence. The more supporting elements found that are in a defined proximity to primary elements, the higher the confidence that the match is a true positive.

Recommendations

Wait at least one hour after creating or editing a schema before downloading and using it for the EDM data upload. This helps ensure that the schema has synced with the system. If a schema is downloaded too soon, an error message might display when attempting to download the schema via the command line.
Do not use the EDM Upload Agent to download, manually edit, and then re-upload a schema. Doing so results in an error because using the EDM Upload Agent to download a schema adds tags to the schema that don't pass schema creation checks.
To help ensure that all corroborative evidence is detected, take one of the following actions: - Trim multi-token corroborative evidence fields to the maximum number of tokens supported by the multi-token feature (currently five tokens).
- Map the multi-token field to a SIT that can fully detect the multi-token data. - After creating or editing your EDM SIT, test it using the following PowerShell cmdlet and then wait 24 hours before testing it in a data loss prevention (DLP) policy solution.

Test-DataClassification  -ClassificationNames “[Your EDM sensitive info type]” -TexttoClassify “[your own text to scan for matches]”

Next step

Hash and upload the sensitive information source table for EDM SITs