了解敏感信息类型Learn about sensitive information types

标识和分类组织控制下的敏感项目是信息保护规范的第一 Identifying and classifying sensitive items that are under your organizations control is the first step in the Information Protection discipline. Microsoft 365提供了三种标识项目的方法,以便对这些项目进行分类:Microsoft 365 provides three ways of identifying items so that they can be classified:

  • 用户手动执行manually by users
  • 自动模式识别,如敏感信息类型automated pattern recognition, like sensitive information types
  • 机器学习machine learning

敏感信息类型是基于模式的分类器。Sensitive information types are pattern-based classifiers. 它们检测敏感信息(如社会保险、信用卡或银行帐号)以标识敏感项目,请参阅敏感信息 类型实体定义They detect sensitive information like social security, credit card, or bank account numbers to identify sensitive items, see Sensitive information types entity definitions

敏感信息类型用于Sensitive information types are used in

敏感信息类型的基本部分Fundamental parts of a sensitive information type

每个敏感信息类型实体由以下字段定义:Every sensitive information type entity is defined by these fields:

  • name:敏感信息类型引用name: how the sensitive information type is referred to
  • description:描述敏感信息类型正在查找的内容description: describes what the sensitive information type is looking for
  • pattern:模式定义敏感信息类型检测到的信息。pattern: A pattern defines what a sensitive information type detects. 它由以下组件组成It consists of the following components
    • Primary 元素 – 敏感信息类型要查找的主元素。Primary element – the main element that the sensitive information type is looking for. 它可以 是包含或 不带校验和验证、关键字列表 、关键字词典 或函数的 正则表达式It can be a regular expression with or without a checksum validation, a keyword list, a keyword dictionary, or a function.
    • Supporting 元素 – 用作支持性证据的元素,有助于提高匹配可信度。Supporting element – elements that act as supporting evidence that help in increasing the confidence of the match. 例如,与 SSN 号码接近的关键字"SSN"。For example, keyword “SSN” in proximity of an SSN number. 它可以是包含或不带校验和验证、关键字列表、关键字词典的正则表达式。It can be a regular expression with or without a checksum validation, keyword list, keyword dictionary.
    • 可信度 - 高 (、中、低) 可信度反映与主要元素一起检测到的支持性证据量。Confidence Level - Confidence levels (high, medium, low) reflect how much supporting evidence was detected along with the primary element. 项目包含的支持性证据越充分,匹配项包含所查找敏感信息的置信度越高。The more supporting evidence an item contains, the higher the confidence that a matched item contains the sensitive info you're looking for.
    • 邻近度 – 主要元素和支持元素之间的字符数Proximity – Number of characters between primary and supporting element

确证性证据和接近度窗口的关系图

在此视频中了解有关可信度级别更多信息Learn more about confidence levels in this video

敏感信息类型示例Example sensitive information type

阿根廷国家/ (DNI) 号码Argentina national identity (DNI) number

FormatFormat

八个数字,用点分隔Eight digits separated by periods

模式Pattern

八个数字:Eight digits:

  • 两个数字two digits
  • a perioda period
  • 三个数字three digits
  • a perioda period
  • 三个数字three digits

校验和Checksum

不支持No

定义Definition

DLP 策略在 300 个字符的邻近度内检测到这种类型的敏感信息,可信度中等:A DLP policy has medium confidence that it's detected this type of sensitive information if, within a proximity of 300 characters:

  • 正则表达式 Regex_argentina_national_id 找到与该模式匹配的内容。The regular expression Regex_argentina_national_id finds content that matches the pattern.
  • 找到 Keyword_argentina_national_id 中的一个关键字。A keyword from Keyword_argentina_national_id is found.
<!-- Argentina National Identity (DNI) Number -->
<Entity id="eefbb00e-8282-433c-8620-8f1da3bffdb2" recommendedConfidence="75" patternsProximity="300">
   <Pattern confidenceLevel="75">
      <IdMatch idRef="Regex_argentina_national_id"/>
      <Match idRef="Keyword_argentina_national_id"/>
  </Pattern>
</Entity>

关键字Keywords

Keyword_argentina_national_idKeyword_argentina_national_id

  • Argentina National Identity numberArgentina National Identity number
  • 标识Identity
  • Identification National Identity CardIdentification National Identity Card
  • DNIDNI
  • NIC National Registry of PersonsNIC National Registry of Persons
  • Documento Nacional de IdentidadDocumento Nacional de Identidad
  • Registro Nacional de las PersonasRegistro Nacional de las Personas
  • IdentidadIdentidad
  • IdentificaciónIdentificación

有关可信度的更多More on confidence levels

在敏感信息类型实体定义中,可信度反映除了主要元素之外检测到的支持性证据的多少。In a sensitive information type entity definition, confidence level reflects how much supporting evidence is detected in addition to the primary element. 项目包含的支持性证据越充分,匹配项包含所查找敏感信息的置信度越高。The more supporting evidence an item contains, the higher the confidence that a matched item contains the sensitive info you're looking for. 例如,高可信度的匹配将包含与主要元素接近的更多支持性证据,而低可信度的匹配将包含几乎或没有任何接近度的支持性证据。For example, matches with a high confidence level will contain more supporting evidence in close proximity of the primary element, whereas matches with a low confidence level would contain little to no supporting evidence in close proximity.

高可信度返回最小的误报,但可能会导致更多的误报。A high confidence level returns the fewest false positives but might result in more false negatives. 低或中等可信度返回更多的误报,但漏报很少。Low or medium confidence levels returns more false positives but few to zero false negatives.

  • 低可信度:值 65,匹配项将包含最小的漏报,但误报最多。low confidence: Value of 65, matched items will contain the fewest false negatives but the most false positives. 低可信度返回所有低、中和高可信度匹配。Low confidence returns all low, medium, and high confidence matches.
  • 中等可信度:值为 75,匹配项将包含误报和漏报的平均数量。medium confidence: Value of 75, matched items will contain an average amount of false positives and false negatives. 中置信度返回所有中高可信度匹配。Medium confidence returns all medium, and high confidence matches.
  • 可信度:值为 85 时,匹配的项将包含最小的误报,但包含最多的漏报。high confidence: Value of 85, matched items will contain the fewest false positives but the most false negatives. 高可信度仅返回高可信度匹配。High confidence only returns high confidence matches.

你应该将高可信度模式与较低的计数(如 5 到 10)和低可信度模式(如 20 个或多个)一同使用。You should use high confidence level patterns with low counts, say five to ten, and low confidence patterns with higher counts, say 20 or more.

备注

如果你有现有的策略或自定义敏感信息类型 (SIT) 使用基于数字的可信度 (也知道为准确性) ,则它们将自动映射到三个离散可信度;整个安全 @ 合规中心 UI 中的低可信度、中等可信度和高可信度。If you have existing policies or custom sensitive information types (SITs) defined using number-based confidence levels (also know as accuracy), they will automatically be mapped to the three discrete confidence levels; low confidence, medium confidence, and high confidence, across the Security @ Compliance Center UI.

  • 置信度在 76 和 100 之间的最低准确度或自定义 SIT 模式的所有策略都将映射到高可信度。All policies with minimum accuracy or custom SIT patterns with confidence levels of between 76 and 100 will be mapped to high confidence.
  • 置信度在 66 和 75 之间的最低准确度或自定义 SIT 模式的所有策略都将映射到中等可信度。All policies with minimum accuracy or custom SIT patterns with confidence levels of between 66 and 75 will be mapped to medium confidence.
  • 置信度低于或等于 65 的所有策略或自定义 SIT 模式都将映射到低可信度。All policies with minimum accuracy or custom SIT patterns with confidence levels less than or equal to 65 will be mapped to low confidence.

创建自定义敏感信息类型Creating custom sensitive information types

若要在安全与合规中心内创建自定义敏感信息类型,可使用以下几种方法:To create custom sensitive information types in the Security & Compliance Center, you can choose from several options:

备注

改进的可信度可在 Microsoft 365 服务的数据丢失防护、Microsoft 365 服务的 Microsoft 信息保护、通信合规性、信息管理和记录管理中立即使用。Improved confidence levels are available for immediate use within Data Loss Prevention for Microsoft 365 services, Microsoft Information Protection for Microsoft 365 services, Communication Compliance, Information Governance, and Records Management.

Microsoft 365 信息保护现可为以下语言提供双字节字符集语言支持(预览):Microsoft 365 Information Protection now supports in preview double byte character set languages for:

  • 简体中文Chinese (simplified)
  • 繁体中文Chinese (traditional)
  • 韩语Korean
  • 日语Japanese

此支持适用于敏感信息类型。This support is available for sensitive information types. 有关详细信息,请参阅双字节字符集的信息保护支持发行说明(预览版)See, Information protection support for double byte character sets release notes (preview) for more information.

有关详细信息For further information

若要了解如何使用敏感信息类型来遵守数据隐私法规,请参阅使用 Microsoft 365 (aka.ms/m365dataprivacy) 为数据隐私法规部署信息保护。To learn how to use sensitive information types to comply with data privacy regulations, see Deploy information protection for data privacy regulations with Microsoft 365 (aka.ms/m365dataprivacy).