您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

Azure 认知搜索中的 AI 简介Introduction to AI in Azure Cognitive Search

AI 扩充是 Azure 认知搜索索引的一项功能,用于从图像、Blob 和其他非结构化的数据源中提取文本,这样就可以丰富内容,使其在索引或知识存储中更易于搜索。AI enrichment is a capability of Azure Cognitive Search indexing used to extract text from images, blobs, and other unstructured data sources - enriching the content to make it more searchable in an index or knowledge store. 通过附加到索引管道的“认知技能”来实现提取和丰富 。Extraction and enrichment are implemented through cognitive skills attached to an indexing pipeline. 服务内置的认知技能分为以下几类:Cognitive skills built into the service fall into these categories:

  • “自然语言处理”技能包括实体识别语言检测关键短语提取、文本操作和情绪检测Natural language processing skills include entity recognition, language detection, key phrase extraction, text manipulation, and sentiment detection. 通过这些技能,非结构化文本可以假定新窗体,在索引中映射为可搜索和可筛选字段。With these skills, unstructured text can assume new forms, mapped as searchable and filterable fields in an index.

  • “图像处理”技能包括 光学字符识别 (OCR)视觉特征标识,例如面部检测、图像解释、图像识别(名人和地标)或属性(例如颜色或图像方向) 。Image processing skills include Optical Character Recognition (OCR) and identification of visual features, such as facial detection, image interpretation, image recognition (famous people and landmarks) or attributes like colors or image orientation. 可以创建图像内容的文本表示形式,这样就可以使用 Azure 认知搜索的所有查询功能进行搜索。You can create text-representations of image content, searchable using all the query capabilities of Azure Cognitive Search.

扩充管道关系图Enrichment pipeline diagram

Azure 认知搜索中的认知技能基于认知服务 API 中预先训练的机器学习模型:计算机视觉文本分析Cognitive skills in Azure Cognitive Search are based on pre-trained machine learning models in Cognitive Services APIs: Computer Vision and Text Analysis.

数据引入阶段应用了自然语言和图形处理,其结果会成为 Azure 认知搜索的可搜索索引中文档撰写内容的一部分。Natural language and image processing is applied during the data ingestion phase, with results becoming part of a document's composition in a searchable index in Azure Cognitive Search. 数据作为 Azure 数据集的来源,然后使用任意所需的内置技能通过索引管道进行推送。Data is sourced as an Azure data set and then pushed through an indexing pipeline using whichever built-in skills you need. 体系结构可扩展,因此如果内置技能不足,可以创建并附加自定义技能,以集成自定义处理。The architecture is extensible so if the built-in skills are not sufficient, you can create and attach custom skills to integrate custom processing. 示例包括面向特定领域(例如金融、科技出版或医疗)的自定义实体模块或文档分类器。Examples might be a custom entity module or document classifier targeting a specific domain such as finance, scientific publications, or medicine.

备注

通过增大处理频率、添加更多文档或添加更多 AI 算法来扩大范围时,需要附加可计费的认知服务资源As you expand scope by increasing the frequency of processing, adding more documents, or adding more AI algorithms, you will need to attach a billable Cognitive Services resource. 调用认知服务中的 API 以及在 Azure 认知搜索中的文档破解阶段提取图像时,会产生费用。Charges accrue when calling APIs in Cognitive Services, and for image extraction as part of the document-cracking stage in Azure Cognitive Search. 提取文档中的文本不会产生费用。There are no charges for text extraction from documents.

内置技能执行按现有认知服务即用即付价格计费。Execution of built-in skills is charged at the existing Cognitive Services pay-as-you go price. 图像提取定价如 Azure 认知搜索定价页所述。Image extraction pricing is described on the Azure Cognitive Search pricing page.

何时使用认知技能When to use cognitive skills

如果原始内容为非结构化文本、图像内容或需要语言检测和翻译的内容,则应考虑使用内置认知技能。You should consider using built-in cognitive skills if your raw content is unstructured text, image content, or content that needs language detection and translation. 通过内置认知技能应用 AI,可以对此内容进行解锁,在搜索和数据科学应用中提高其价值和实用性。Applying AI through the built-in cognitive skills can unlock this content, increasing its value and utility in your search and data science apps.

此外,如果你有要集成到管道中的开源、第三方或第一方代码,则可以考虑添加自定义技能。Additionally, you might consider adding a custom skill if you have open-source, third-party, or first-party code that you'd like to integrate into the pipeline. 标识各种文档类型的突出特征的分类模型属于此类别,但也可以使用将值添加到内容的任何包。Classification models that identify salient characteristics of various document types fall into this category, but any package that adds value to your content could also be used.

有关内置技能的详细信息More about built-in skills

使用内置技能组合起来的技能组非常适合以下应用方案:A skillset that's assembled using built-in skills is well suited for the following application scenarios:

  • 需要对其启用全文搜索的已扫描文档 (JPEG)。Scanned documents (JPEG) that you want to make full-text searchable. 可以附加光学字符识别 (OCR) 技能,以便标识、提取和引入 JPEG 文件中的文本。You can attach an optical character recognition (OCR) skill to identify, extract, and ingest text from JPEG files.

  • 组合使用图像和文本的 PDF。PDFs with combined image and text. PDF 中的文本可以在索引期间提取,不需使用扩充步骤,但在添加图像并进行自然语言处理的情况下,所产生的结果通常比标准索引提供的结果要好。Text in PDFs can be extracted during indexing without the use of enrichment steps, but the addition of image and natural language processing can often produce a better outcome than a standard indexing provides.

  • 需对其应用语言检测并可能对其应用文本翻译的多语言内容。Multi-lingual content against which you want to apply language detection and possibly text translation.

  • 非结构化或半结构化的文档,其中包含的内容有固有的含义,或者其上下文隐藏在更大的文档中。Unstructured or semi-structured documents containing content that has inherent meaning or context that is hidden in the larger document.

    具体说来,Blob 通常包含大量的内容,这些内容打包到单个“字段”中。Blobs in particular often contain a large body of content that is packed into a singled "field". 将图像和自然语言处理技能附加到索引器以后,即可创建新信息,该信息存在于原始内容中,但在其他情况下并不显示为非重复字段。By attaching image and natural language processing skills to an indexer, you can create new information that is extant in the raw content, but not otherwise surfaced as distinct fields. 某些对你有帮助的可用内置认知技能:关键短语提取、情绪分析、实体识别(人、组织和位置)。Some ready-to-use built-in cognitive skills that can help: key phrase extraction, sentiment analysis, and entity recognition (people, organizations, and locations).

    另外,内置技能还可以用来通过文本拆分、合并和形状操作来重新构造内容。Additionally, built-in skills can also be used restructure content through text split, merge, and shape operations.

有关自定义技能的详细信息More about custom skills

自定义技能可以支持更复杂的方案,例如识别表单,或者使用你提供的模型进行自定义实体检测,以及在自定义技能 Web 界面中进行包装。Custom skills can support more complex scenarios, such as recognizing forms, or custom entity detection using a model that you provide and wrap in the custom skill web interface. 自定义技能的一些示例:表单识别器、集成必应实体搜索 API自定义实体识别Several examples of custom skills include Forms Recognizer, integration of the Bing Entity Search API, and custom entity recognition.

扩充管道的组件Components of an enrichment pipeline

扩充管道基于索引器 ,后者可对数据源爬网,并提供端到端索引处理。An enrichment pipeline is based on indexers that crawl data sources and provide end-to-end index processing. 技能现已附加到索引器,根据定义的技能集截获并扩充文档。Skills are now attached to indexers, intercepting and enriching documents according to the skillset you define. 编制索引后,可以使用所有受 Azure 认知搜索支持的查询类型通过搜索请求来访问内容。Once indexed, you can access content via search requests through all query types supported by Azure Cognitive Search. 本部分引导索引器的新手完成这些步骤。If you are new to indexers, this section walks you through the steps.

步骤 1:连接和文档破解阶段Step 1: Connection and document cracking phase

在管道的开头部分包含非结构化文本或非文本内容(例如图像和扫描的文档 JPEG 文件)。At the start of the pipeline, you have unstructured text or non-text content (such as image and scanned document JPEG files). 数据必须存在于可由索引器访问的 Azure 数据存储服务中。Data must exist in an Azure data storage service that can be accessed by an indexer. 索引器可以“破解”源文档,以从源数据提取文本。Indexers can "crack" source documents to extract text from source data.

文档破解阶段Document cracking phase

支持的源包括 Azure Blob 存储、Azure 表存储、Azure SQL 数据库和 Azure Cosmos DB。Supported sources include Azure blob storage, Azure table storage, Azure SQL Database, and Azure Cosmos DB. 可从以下类型的文件提取基于文本的内容:PDF、Word、PowerPoint、CSV 文件。Text-based content can be extracted from the following file types: PDFs, Word, PowerPoint, CSV files. 有关完整列表,请参阅支持的格式For the full list, see Supported formats.

步骤 2:认知技能和扩充阶段Step 2: Cognitive skills and enrichment phase

扩充通过认知技能实现。这些技能执行原子操作。 Enrichment is through cognitive skills performing atomic operations. 例如,从 PDF 提取文本内容后,可以应用实体识别语言检测或关键短语提取,以便在索引中生成原生未在源代码中提供的新字段。For example, once you have text content from a PDF, you can apply entity recognition language detection, or key phrase extraction to produce new fields in your index that are not available natively in the source. 管道中使用的技能的集合统称为技能集。 Altogether, the collection of skills used in your pipeline is called a skillset.

扩充阶段Enrichment phase

技能集基于你提供的、与该技能集连接的内置认知技能自定义技能A skillset is based on built-in cognitive skills or custom skills you provide and connect to the skillset. 技能集既可以很精简,也可以很复杂,它不仅确定处理的类型,而且还确定运算的顺序。A skillset can be minimal or highly complex, and determines not only the type of processing, but also the order of operations. 技能集以及定义为索引器一部分的字段映射全面指定扩充管道。A skillset plus the field mappings defined as part of an indexer fully specifies the enrichment pipeline. 有关将所有组成部分一起提取的详细信息,请参阅定义技能集For more information about pulling all of these pieces together, see Define a skillset.

在内部,管道生成扩充文档的集合。Internally, the pipeline generates a collection of enriched documents. 可以确定要将扩充文档的哪些部分映射到搜索索引中可编制索引的字段。You can decide which parts of the enriched documents should be mapped to indexable fields in your search index. 例如,如果应用了关键短语提取和实体识别技能,则这些新字段将成为扩充文档的部分,并可以映射到索引中的字段。For example, if you applied the key phrases extraction and the entity recognition skills, then those new fields would become part of the enriched document, and they can be mapped to fields on your index. 请参阅注释详细了解输入/输出的形成。See Annotations to learn more about input/output formations.

添加用于保存扩充的 knowledgeStore 元素Add a knowledgeStore element to save enrichments

搜索 REST api-version=2019-05-06-Preview 使用 knowledgeStore 定义扩展技能集。该定义提供 Azure 存储连接以及描述如何存储扩充的投影。Search REST api-version=2019-05-06-Preview extends skillsets with a knowledgeStore definition that provides an Azure storage connection and projections that describe how the enrichments are stored.

将知识存储添加到技能集,可以投影除全文搜索以外的方案的扩充表示形式。Adding a knowledge store to a skillset gives you the ability to project a representation of your enrichments for scenarios other than full text search. 有关详细信息,请参阅知识存储(预览版)For more information, see Knowledge store (preview).

步骤 3:搜索索引和基于查询的访问Step 3: Search index and query-based access

完成处理后,便会获得由扩充的文档组成的搜索索引,这些文档在 Azure 认知搜索中可全文搜索。When processing is finished, you have a search index consisting of enriched documents, fully text-searchable in Azure Cognitive Search. 开发者和用户可以通过查询索引来访问管道生成的扩充内容。Querying the index is how developers and users access the enriched content generated by the pipeline.

带搜索图标的索引Index with search icon

索引类似于可为 Azure 认知搜索创建的其他任何对象:可以使用自定义分析器进行补充、调用模糊搜索查询、添加筛选的搜索结果,或试着使用评分配置文件为搜索结果重新整型。The index is like any other you might create for Azure Cognitive Search: you can supplement with custom analyzers, invoke fuzzy search queries, add filtered search, or experiment with scoring profiles to reshape the search results.

索引从某个索引架构生成。该架构定义字段、属性,以及附加到特定索引的其他构造,例如评分配置文件和同义词映射。Indexes are generated from an index schema that defines the fields, attributes, and other constructs attached to a specific index, such as scoring profiles and synonym maps. 定义并填充索引后,可以增量方式编制索引,以拾取新的和更新的源文档。Once an index is defined and populated, you can index incrementally to pick up new and updated source documents. 某些修改需要完全重新生成。Certain modifications require a full rebuild. 在架构设计稳定之前,应使用小型数据集。You should use a small data set until the schema design is stable. 有关详细信息,请参阅如何重新生成索引For more information, see How to rebuild an index.

重要功能和概念Key features and concepts

概念Concept 说明Description 链接Links
技能集Skillset 包含技能集合的顶级命名资源。A top-level named resource containing a collection of skills. 技能集是扩充管道。A skillset is the enrichment pipeline. 在索引编制期间索引器会调用它。It is invoked during indexing by an indexer. 请参阅定义技能组See Define a skillset
认知技能Cognitive skill 扩充管道中的原子转换。An atomic transformation in an enrichment pipeline. 通常,它是提取或推断结构的组件,因此增强了我们对输入数据的理解。Often, it is a component that extracts or infers structure, and therefore augments your understanding of the input data. 输出几乎总是基于文本,处理是自然语言处理,或者从图像输入提取或生成文本的图像处理。Almost always, the output is text-based and the processing is natural language processing or image processing that extracts or generates text from image inputs. 技能的输出可映射到索引中的字段,或用作下游扩充组件的输入。Output from a skill can be mapped to a field in an index, or used as an input for a downstream enrichment. 技能是预定义的、由 Microsoft 提供的或自定义的(由你创建并部署)。A skill is either predefined and provided by Microsoft, or custom: created and deployed by you. 内置认知技能Built-in cognitive skills
数据提取Data extraction 涵盖大量处理,但与 AI 扩充相关,实体识别技能最常用于从不以本机方式提供相关信息的源中提取数据(即实体)。Covers a broad range of processing, but pertaining to AI enrichment, the entity recognition skill is most typically used to extract data (an entity) from a source that doesn't provide that information natively. 请参阅实体识别技能文档提取技能(预览版)See Entity Recognition Skill and Document Extraction Skill (preview)
图像处理Image processing 从图像推断文本,例如识别某个地标,或者从图像提取文本。Infers text from an image, such as the ability to recognize a landmark, or extracts text from an image. 常见示例包括从扫描的文档 (JPEG) 文件中提取字符的 OCR,或者在包含街道标志的照片中识别街道名称。Common examples include OCR for lifting characters from a scanned document (JPEG) file, or recognizing a street name in a photograph containing a street sign. 请参阅图像分析技能OCR 技能See Image Analysis Skill or OCR Skill
自然语言处理Natural language processing 进行文本处理以提供见解,并提供有关文本输入的信息。Text processing for insights and information about text inputs. 语言检测、情绪分析和关键短语提取是属于自然语言处理的技能。Language detection, sentiment analysis, and key phrase extraction are skills that fall under natural language processing. 请参阅关键短语提取技能语言检测技能文本翻译技能(预览版)情绪分析技能See Key Phrase Extraction Skill, Language Detection Skill, Text Translation Skill (preview), Sentiment Analysis Skill
文档破解Document cracking 在索引编制期间从非文本源提取或创建文本内容的过程。The process of extracting or creating text content from non-text sources during indexing. 光学字符识别 (OCR) 就是一个例子,但它通常是指索引器从应用程序文件中提取内容时使用的核心索引器功能。Optical character recognition (OCR) is an example, but generally it refers to core indexer functionality as the indexer extracts content from application files. 提供源文件位置的数据源,以及提供字段映射的索引器定义都是文档破解中的两个关键因素。The data source providing source file location, and the indexer definition providing field mappings, are both key factors in document cracking. 请参阅索引器概述See Indexers overview
造型Shaping 将文本片段整合到较大的结构,或反之,将较大的文本区块分解成易于管理的大小,以进一步执行下游处理。Consolidate text fragments into a larger structure, or conversely break down larger text chunks into a manageable size for further downstream processing. 请参阅整型程序技能文本合并技能文本拆分技能See Shaper Skill, Text Merger Skill, Text Split Skill
扩充文档Enriched documents 在处理过程中生成的临时内部结构,其最终输出反映在搜索索引中。A transitory internal structure, generated during processing, with final output reflected in a search index. 技能集决定执行哪些扩充。A skillset determines which enrichments are performed. 字段映射确定要将哪些数据元素添加到索引。Field mappings determine which data elements are added to the index. (可选)可以使用存储资源管理器、Power BI 等工具或连接到 Azure Blob 存储的任何其他工具创建知识存储,以保留和浏览丰富的文档。Optionally, you can create a knowledge store to persist and explore enriched documents using tools like Storage Explorer, Power BI, or any other tool that connects to Azure Blob storage. 请参阅知识存储(预览版)See Knowledge store (preview)
索引器Indexer 一种爬网程序,它从外部数据源提取可搜索的数据和元数据,并根据索引与数据源之间字段到字段的映射填充索引,以进行文档破解。A crawler that extracts searchable data and metadata from an external data source and populates an index based on field-to-field mappings between the index and your data source for document cracking. 针对 AI 扩充,索引器会调用技能组,并包含字段映射,以便将扩充输出关联到索引中的目标字段。For AI enrichments, the indexer invokes a skillset, and contains the field mappings associating enrichment output to target fields in the index. 索引器定义包含管道操作的所有说明和引用,运行索引器时会调用管道。The indexer definition contains all of the instructions and references for pipeline operations, and the pipeline is invoked when you run the indexer. 通过其他配置,可以重复使用现有的已处理内容,并仅执行那些已更改的步骤和技能。With additional configuration, you can re-use existing processed content and execute only those steps and skills that are changed. 请参阅索引器增量扩充(预览版)See Indexers and Incremental enrichment (preview).
数据源Data Source 由索引器用来连接 Azure 中受支持类型的外部数据源的对象。An object used by an indexer to connect to an external data source of supported types on Azure. 请参阅索引器概述See Indexers overview
索引Index Azure 认知搜索中的持久化搜索索引,通过一个定义字段结构和用法的索引架构生成。A persisted search index in Azure Cognitive Search, built from an index schema that defines field structure and usage. 请参阅创建基本索引See Create a basic index
知识存储Knowledge store 一个存储帐户,其中的扩充文档可以在搜索索引的基础上整型和投影A storage account where the enriched documents can be shaped and projected in addition to the search index 请参阅知识存储简介See Introduction to knowledge store
缓存Cache 包含由扩充管道创建的缓存输出的存储帐户。A storage account that contains cached output created by an enrichment pipeline. 启用缓存将保留不受技能集或扩充管道其他组件更改影响的现有输出。Enabling the cache preserves existing output that is unaffected by changes to a skillset or other components of the enrichment pipeline. 请参阅增量扩充See Incremental enrichment

从哪里开始?Where do I start?

步骤 1:创建 Azure 认知搜索资源Step 1: Create an Azure Cognitive Search resource

步骤 2:亲自体验,尝试一些快速入门操作和示例Step 2: Try some quickstarts and examples for hands-on experience

我们建议将免费服务用于学习目的,但是,免费事务的数量限制为每天 20 个文档。We recommend the Free service for learning purposes, however the number of free transactions is limited to 20 documents per day. 若要在一天内同时运行快速入门和教程,请使用较小的文件集(10 个文档),这样就可以同时进行这两个练习,也可以删除在快速入门或教程中使用的索引器,将计数器重置为零。To run both the quickstart and tutorial in one day, use a smaller file set (10 documents) so that you can fit in both exercises, or delete the indexer you used in the quickstart or tutorial to rest the counter to zero.

步骤 3:查看 APIStep 3: Review the API

可以对请求或 .NET SDK 使用 REST api-version=2019-05-06You can use REST api-version=2019-05-06 on requests or the .NET SDK. 若要浏览知识存储,请改用预览版 REST API (api-version=2019-05-06-Preview)。If you are exploring knowledge store, use the preview REST API instead (api-version=2019-05-06-Preview).

这一步使用 REST API 生成 AI 扩充解决方案。This step uses the REST APIs to build an AI enrichment solution. 只会为 AI 扩充添加或扩展两个 API。Only two APIs are added or extended for AI enrichment. 其他 API 的语法与正式版相同。Other APIs have the same syntax as the generally available versions.

REST APIREST API 说明Description
创建数据源Create Data Source 标识外部数据源的资源,提供用于创建扩充文档的源数据。A resource identifying an external data source providing source data used to create enriched documents.
创建技能集 (api-version=2019-05-06)Create Skillset (api-version=2019-05-06) 此 API 特定于 AI 扩充。This API is specific to AI enrichment. 它是一项资源,在编制索引期间协调扩充管道中内置技能自定义认知技能的使用。It is a resource coordinating the use of built-in skills and custom cognitive skills used in an enrichment pipeline during indexing.
创建索引Create Index 一个表示 Azure 认知搜索索引的架构。A schema expressing an Azure Cognitive Search index. 索引中的字段映射到源数据中的字段,或扩充阶段生成的字段(例如,实体识别创建的组织名称的字段)。Fields in the index map to fields in source data or to fields manufactured during the enrichment phase (for example, a field for organization names created by entity recognition).
创建索引器 (api-version=2019-05-06)Create Indexer (api-version=2019-05-06) 定义索引编制期间使用的组件(包括数据源、技能、从源和中间数据结构到目标索引的字段关联,以及索引本身)的资源。A resource defining components used during indexing: including a data source, a skillset, field associations from source and intermediary data structures to target index, and the index itself. 运行索引器会触发数据引入和扩充。Running the indexer is the trigger for data ingestion and enrichment. 输出是基于索引架构的搜索索引,其中填充有源数据,并通过技能集进行了扩充。The output is a search index based on the index schema, populated with source data, enriched through skillsets. 此现有的 API 在扩展后适用于包含技能集属性的认知搜索方案。This existing API is extended for cognitive search scenarios with the inclusion of a skillset property.

清单:典型工作流Checklist: A typical workflow

  1. 将 Azure 源数据分解为代表性样本。Subset your Azure source data into a representative sample. 编制索引需要花费一定的时间,因此请从较少的有代表性数据集着手,然后随着解决方案的不断成熟,逐渐增加数据集的大小。Indexing takes time so start with a small, representative data set and then build it up incrementally as your solution matures.

  2. 在 Azure 认知搜索中创建数据源对象,以便提供用于数据检索的连接字符串。Create a data source object in Azure Cognitive Search to provide a connection string for data retrieval.

  3. 使用扩充步骤创建技能集Create a skillset with enrichment steps.

  4. 定义索引架构Define the index schema. 字段集合包含源数据中的字段 。The Fields collection includes fields from source data. 还应该抽出其他字段,以保存扩充期间创建的内容的生成值。You should also stub out additional fields to hold generated values for content created during enrichment.

  5. 定义引用数据源、技能集和索引的索引器Define the indexer referencing the data source, skillset, and index.

  6. 在索引器中,添加 outputFieldMappings 。Within the indexer, add outputFieldMappings. 此节将技能集的输出(步骤 3)映射到索引架构中的输入字段(步骤 4)。This section maps output from the skillset (in step 3) to the inputs fields in the index schema (in step 4).

  7. 发送刚刚创建的“创建索引器”请求(一个 POST 请求,其请求正文包含索引器定义),用于表示 Azure 认知搜索中的索引器 。Send Create Indexer request you just created (a POST request with an indexer definition in the request body) to express the indexer in Azure Cognitive Search. 通过此步骤运行索引器,并调用管道。This step is how you run the indexer, invoking the pipeline.

  8. 运行查询以评估结果,并修改代码以更新技能集、架构或索引器配置。Run queries to evaluate results and modify code to update skillsets, schema, or indexer configuration.

  9. 重新生成管道之前重置索引器Reset the indexer before rebuilding the pipeline.

有关具体问题的详细信息,请参阅故障排除提示For more information about specific questions or problems, see Troubleshooting tips.

后续步骤Next steps