了解可训练分类器Learn about trainable classifiers

对内容进行分类和标记以便可以保护和正确处理内容是信息保护规范的起点。Classifying and labeling content so it can be protected and handled properly is the starting place for the information protection discipline. Microsoft 365 有三种对内容进行分类的方法。Microsoft 365 has three ways to classify content.

手动Manually

此方法需要人为的判断和操作。This method requires human judgment and action. 管理员可以使用预先存在的标签和敏感信息类型,也可以创建自己的标签和敏感信息类型,然后发布它们。An admin may either use the pre-existing labels and sensitive information types or create their own and then publish them. 用户和管理员在遇到内容时将其应用于内容。Users and admins apply them to content as they encounter it. 然后,您可以保护内容并管理其处置。You can then protect the content and manage its disposition.

自动模式匹配Automated pattern matching

此类别的分类机制包括按以下方法查找内容:This category of classification mechanisms include finding content by:

然后,可以自动应用敏感度和保留标签,使内容可用于数据丢失防护 (DLP) 和自动应用保留标签 策略Sensitivity and retention labels can then be automatically applied to make the content available for use in data loss prevention (DLP) and auto-apply polices for retention labels.

分类器Classifiers

此分类方法特别适用于无法通过手动或自动模式匹配方法轻松标识的内容。This classification method is particularly well suited to content that isn't easily identified by either the manual or automated pattern matching methods. 此分类方法更与训练分类器以根据项目是什么来标识项目有关,而不是由与项目模式匹配 (元素) 。This method of classification is more about training a classifier to identify an item based on what the item is, not by elements that are in the item (pattern matching). 分类器通过查看数百个对分类感兴趣的内容示例来了解如何标识内容类型。A classifier learns how to identify a type of content by looking at hundreds of examples of the content you're interested in classifying. 首先,请馈送该类别中绝对属于该类别的示例。You start by feeding it examples that are definitely in the category. 处理这些示例后,通过混合提供匹配和非匹配示例来测试它。Once it processes those, you test it by giving it a mix of both matching and non-matching examples. 然后,分类器将预测任何给定项目是否属于你正在构建的类别中。The classifier then makes predictions as to whether any given item falls into the category you're building. 然后确认其结果,对真正、真负、误报和漏报进行排序,以帮助提高其预测的准确性。You then confirm its results, sorting out the true positives, true negatives, false positives, and false negatives to help increase the accuracy of its predictions.

发布分类器时,分类器对 SharePoint Online、Exchange 和 OneDrive 等位置中的项目进行排序,并对内容分类。When you publish the classifier, it sorts through items in locations like SharePoint Online, Exchange, and OneDrive, and classifies the content. 发布分类器后,可以继续使用与初始培训过程类似的反馈流程对分类器进行培训。After you publish the classifier, you can continue to train it using a feedback process that is similar to the initial training process.

在哪里可以使用可训练分类器Where you can use trainable classifiers

内置分类器以及可训练分类器均作为使用敏感度标签进行 Office自动标记的条件,根据条件和通信合规性自动应用保留标签策略Both built-in classifiers and trainable classifiers are available as a condition for Office autolabeling with sensitivity labels, auto-apply retention label policy based on a condition and in communication compliance.

敏感度标签可以使用分类器作为条件,请参阅自动将敏感度 标签应用于内容Sensitivity labels can use classifiers as conditions, see Apply a sensitivity label to content automatically.

重要

分类器仅适用于未加密且为英语的项目。Classifiers only work with items that are not encrypted and are in English.

分类器的类型Types of classifiers

  • 预先训练的分类器 - Microsoft 已创建并预先训练许多分类器,无需培训即可开始使用这些分类器。pre-trained classifiers - Microsoft has created and pre-trained a number of classifiers that you can start using without training them. 这些分类器的状态将显示为 Ready to useThese classifiers will appear with the status of Ready to use.
  • 自定义分类 器 - 如果你的分类需求超出预先训练的分类器涵盖范围,可以创建并训练自己的分类器。custom classifiers - If you have classification needs that extend beyond what the pre-trained classifiers cover, you can create and train your own classifiers.

经过预先训练的分类器Pre-trained classifiers

Microsoft 365 附带五个预先训练的分类器:Microsoft 365 comes with five pre-trained classifiers:

注意

我们弃用冒犯性语言预先训练的分类器,因为它一直产生大量误报。We are deprecating the Offensive Language pre-trained classifier because it has been producing a high number of false positives. 请勿使用它,如果当前正在使用它,则应该将业务流程从其中移开。Don't use it and if you are currently using it, you should move your business processes off of it. 我们建议改为使用 威胁 亵和骚扰预先训练的分类器。We recommend using the Threat, Profanity, and Harassment pre-trained classifiers instead.

  • 简历:检测是简历个人、教育、专业资格、工作体验和其他个人识别信息的文本帐户的项目Resumes: detects items that are textual accounts of an applicant's personal, educational, professional qualifications, work experience, and other personally identifying information
  • 源代码: 检测包含用 GitHub 上前 25 种使用的计算机编程语言编写的一组说明和语句的项目Source Code: detects items that contain a set of instructions and statements written in the top 25 used computer programming languages on GitHub
    • ActionScriptActionScript
    • CC
    • C#C#
    • C++C++
    • 一个Clojure
    • CoffeeScriptCoffeeScript
    • 转到Go
    • HaskellHaskell
    • JavaJava
    • JavaScriptJavaScript
    • LuaLua
    • MATLABMATLAB
    • Objective-CObjective-C
    • PerlPerl
    • PHPPHP
    • PythonPython
    • RR
    • RubyRuby
    • ScalaScala
    • 命令行管理程序Shell
    • SwiftSwift
    • TexTex
    • Vim 脚本Vim Script

备注

源代码经过训练,可检测大部分文本是源代码时。Source Code is trained to detect when the bulk of the text is source code. 它不会检测与纯文本交错的源代码文本。It does not detect source code text that is interspersed with plain text.

  • 冒犯:根据以下特征检测与针对一个或多个个人的攻击行为相关的特定冒犯性语言文本项:种族、种族、宗教、国家/地区、性别、性取向、年龄、残障Harassment: detects a specific category of offensive language text items related to offensive conduct targeting one or multiple individuals based on the following traits: race, ethnicity, religion, national origin, gender, sexual orientation, age, disability
  • 冒犯性:检测特定类别的冒犯性语言文本项,这些文本项包含使大多数用户都为难的表达式Profanity: detects a specific category of offensive language text items that contain expressions that embarrass most people
  • 威胁:检测与威胁相关的特定类别的冒犯性语言文本项,以实施暴力或对人员或属性进行物理损害或损害Threat: detects a specific category of offensive language text items related to threats to commit violence or do physical harm or damage to a person or property

它们显示在 Microsoft 365 合规中心 > 数据分类 > 训练分类器视图中,状态为 Ready to useThese appear in the Microsoft 365 compliance center > Data classification > Trainable classifiers view with the status of Ready to use.

classifiers-pre-trained-classifiers

重要

请注意,冒犯性语言、冒犯、冒犯和威胁分类器仅适用于可搜索文本,并不详尽或完整。Please note that the offensive language, harassment, profanity, and threat classifiers only work with searchable text are not exhaustive or complete. 此外,语言和文化标准会不断改变,鉴于这些希望,Microsoft 保留自行更新这些分类器的权利。Further, language and cultural standards continually change, and in light of these realities, Microsoft reserves the right to update these classifiers in its discretion. 虽然分类器可帮助组织监视使用的攻击性语言和其他语言,但分类器不会解决此类语言的后果,也不旨在提供组织监视或响应此类语言使用的唯一方式。While the classifiers may assist your organization in monitoring offensive and other language used, the classifiers do not address consequences of such language and are not intended to provide your organization's sole means of monitoring or responding to the use of such language. 贵组织(而非 Microsoft 或其子公司)仍负责与监视、强制执行、阻止、删除和保留由预先训练的分类器标识的任何内容相关的所有决策。Your organization, and not Microsoft or its subsidiaries, remains responsible for all decisions related to monitoring, enforcement, blocking, removal and retention of any content identified by a pre-trained classifier.

自定义分类器Custom classifiers

当预先训练的分类器不能满足你的需求时,你可以创建并训练你自己的分类器。When the pre-trained classifiers don't meet your needs, you can create and train your own classifiers. 创建自己的任务涉及的工作明显更多,但它们可以更好地根据组织需求进行定制。There's significantly more work involved with creating your own, but they'll be much better tailored to your organizations needs.

例如,你可以为:For example you could create trainable classifiers for:

  • 法律文档 - 例如律师客户特权、结束集、工作声明Legal documents - such as attorney client privilege, closing sets, statement of work
  • 战略业务文档 - 如新闻稿、合并和收购、交易、业务或市场营销计划、知识产权、专利、设计文档Strategic business documents - like press releases, merger and acquisition, deals, business or marketing plans, intellectual property, patents, design docs
  • 定价信息 - 如发票、报价、工作订单、采购文档Pricing information - like invoices, price quotes, work orders, bidding documents
  • 财务信息 - 例如组织投资、季度或年度结果Financial information - such as organizational investments, quarterly or annual results

创建自定义分类器的流程Process flow for creating custom classifiers

创建和发布分类器以用于合规性解决方案(如保留策略和通信监督)遵循此流程。Creating and publishing a classifier for use in compliance solutions, such as retention policies and communication supervision, follows this flow. 有关创建自定义可训练分类器的详细信息,请参阅"创建自定义分类器"。For more detail on creating a custom trainable classifier see, Creating a custom classifier.

进程流自定义分类器

重新分类分类器Retraining classifiers

通过提供有关自定义分类器及其所执行分类的准确性的反馈,可帮助提高所有自定义分类器以及一些经过预先训练的分类器的准确性。You can help improve the accuracy of all custom classifiers and some pre-trained classifiers by providing them with feedback on the accuracy of the classification that they perform. 这称为重新培训并遵循此工作流。This is called retraining and follow this workflow.

分类器重新分类工作流

另请参阅See also