您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

使用 Azure 认知搜索将全文搜索添加到 Azure blob 数据Add full text search to Azure blob data using Azure Cognitive Search

在 Azure Blob 存储中存储的各种内容类型之间进行搜索可能是一个很难解决的问题。Searching across the variety of content types stored in Azure Blob storage can be a difficult problem to solve. 但是,只需通过使用Azure 认知搜索,只需单击几下鼠标就能索引和搜索 blob 的内容。However, you can index and search the content of your Blobs in just a few clicks by using Azure Cognitive Search. Azure 认知搜索具有内置集成功能,可通过将数据源感知功能添加到索引的blob 索引器来从 blob 存储空间编制索引。Azure Cognitive Search has built-in integration for indexing out of Blob storage through a Blob indexer that adds data-source-aware capabilities to indexing.

将全文搜索添加到 blob 数据的意义What it means to add full text search to blob data

Azure 认知搜索是一种云搜索服务,提供索引和查询引擎,可对搜索服务上承载的用户定义索引进行操作。Azure Cognitive Search is a cloud search service that provides indexing and query engines that operate over user-defined indexes hosted on your search service. 若要提高性能,可以在云中使用查询引擎将可搜索的内容共存,并以用户从搜索查询中获得的速度来返回结果。Co-locating your searchable content with the query engine in the cloud is necessary for performance, returning results at a speed users have come to expect from search queries.

Azure 认知搜索在索引层与 Azure Blob 存储集成,将你的 Blob 内容作为索引的搜索文档导入到反转索引中,并将其他查询结构作为支持自由格式文本查询和筛选器表达式的查询结构.Azure Cognitive Search integrates with Azure Blob storage at the indexing layer, importing your blob content as search documents that are indexed into inverted indexes and other query structures that support free form text queries and filter expressions. 由于你的 blob 内容已索引为搜索索引,因此,对 blob 内容的访问可利用 Azure 认知搜索中的各种查询功能。Because your blob content is indexed into a search index, access to blob content can leverage the full range of query features in Azure Cognitive Search.

创建并填充索引后,该索引独立于你的 blob 容器,但你可以重新运行索引操作,以便通过更改基础容器来刷新索引。Once the index is created and populated, it exists independently of your blob container, but you can re-rerun indexing operations to refresh your index with changes to the underlying container. 各个 blob 的时间戳信息用于更改检测。Timestamp information on individual blobs is used for change detection. 您可以选择计划执行或按需索引作为刷新机制。You can opt for either scheduled execution or on-demand indexing as the refresh mechanism.

输入是在 Azure Blob 存储中的单个容器中的 blob。Inputs are your blobs, in a single container, in Azure Blob storage. Blob 几乎可以是任何类型的文本数据。Blobs can be almost any kind of text data. 如果 blob 包含图像,则可以将AI 扩充添加到 blob 索引,以创建和提取图像中的文本。If your blobs contain images, you can add AI enrichment to blob indexing to create and extract text from images.

输出始终为 Azure 认知搜索索引,用于在客户端应用程序中快速搜索、检索和浏览。Output is always an Azure Cognitive Search index, used for fast text search, retrieval, and exploration in client applications. 介于之间是索引管道体系结构本身。In between is the indexing pipeline architecture itself. 该管道基于索引器功能,在本文中对此进行了进一步讨论。The pipeline is based on the indexer feature, discussed further on in this article.

从服务开始Start with services

需要 Azure 认知搜索和 Azure Blob 存储。You need Azure Cognitive Search and Azure Blob storage. 在 Blob 存储中,需要一个提供源内容的容器。Within Blob storage, you need a container that provides source content.

你可以直接在存储帐户门户页中开始。You can start directly in your Storage account portal page. 在左侧导航页的 " Blob 服务" 下,单击 "添加 Azure 认知搜索以创建新服务或选择现有服务。In the left navigation page, under Blob service click Add Azure Cognitive Search to create a new service or select an existing one.

将 Azure 认知搜索添加到存储帐户后,可以按照标准过程对 blob 数据进行索引。Once you add Azure Cognitive Search to your storage account, you can follow the standard process to index blob data. 建议使用 Azure 认知搜索中的 "导入数据" 向导来实现简单的初始简介,或使用类似于 Postman 的工具调用 REST api。We recommend the Import data wizard in Azure Cognitive Search for an easy initial introduction, or call the REST APIs using a tool like Postman. 本教程将指导你完成在 Azure 认知搜索中调用 Postman:索引和搜索半结构化数据(JSON blob)中的 REST API 的步骤。This tutorial walks you through the steps of calling the REST API in Postman: Index and search semi-structured data (JSON blobs) in Azure Cognitive Search.

使用 Blob 索引器Use a Blob indexer

索引器是一个数据源感知子查询,它具有用于采样数据、读取元数据、检索数据和将本机格式的数据序列化为 JSON 文档以供后续导入的内部逻辑。An indexer is a data-source-aware subservice equipped with internal logic for sampling data, reading metadata data, retrieving data, and serializing data from native formats into JSON documents for subsequent import.

Azure 存储中的 blob 使用azure 认知搜索 Blob 存储索引器进行索引。Blobs in Azure Storage are indexed using the Azure Cognitive Search Blob storage indexer. 您可以使用导入数据向导、REST API 或 .net SDK 来调用此索引器。You can invoke this indexer by using the Import data wizard, a REST API, or the .NET SDK. 在代码中,可以通过设置类型来使用此索引器,还可以提供包含 Azure 存储帐户和 blob 容器的连接信息。In code, you use this indexer by setting the type, and by providing connection information that includes an Azure Storage account along with a blob container. 您可以通过创建虚拟目录子集化 blob,然后可以将该虚拟目录作为参数传递,也可以通过筛选文件类型扩展。You can subset your blobs by creating a virtual directory, which you can then pass as a parameter, or by filtering on a file type extension.

索引器执行 "文档破解",打开一个 blob 来检查内容。An indexer does the "document cracking", opening a blob to inspect content. 连接到数据源后,它是管道中的第一步。After connecting to the data source, it's the first step in the pipeline. 对于 blob 数据,将在其中检测到 PDF、office 文档和其他内容类型。For blob data, this is where PDF, office docs, and other content types are detected. 带有文本提取的文档破解是免费的。Document cracking with text extraction is no charge. 如果 blob 包含图像内容,则将忽略图像,除非你添加 AI 扩充If your blobs contain image content, images are ignored unless you add AI enrichment. 标准索引仅适用于文本内容。Standard indexing applies to text content only.

Blob 索引器附带了配置参数,并且如果基础数据提供了足够的信息,则支持更改跟踪。The Blob indexer comes with configuration parameters and supports change tracking if the underlying data provides sufficient information. 可以在Azure 认知搜索 Blob 存储索引器中了解有关核心功能的详细信息。You can learn more about the core functionality in Azure Cognitive Search Blob storage indexer.

支持的内容类型Supported content types

通过在容器上运行 Blob 索引器,你可以使用单个查询从以下内容类型中提取文本和元数据:By running a Blob indexer over a container, you can extract text and metadata from the following content types with a single query:

  • PDFPDF
  • Microsoft Office 格式: .DOCX/DOC/DOCM,.XLSX/XLS/XLSM,.PPTX/PPT/PPTM,MSG (Outlook 电子邮件),XML (2003和 2006 WORD XML)Microsoft Office formats: DOCX/DOC/DOCM, XLSX/XLS/XLSM, PPTX/PPT/PPTM, MSG (Outlook emails), XML(both 2003 and 2006 WORD XML)
  • 开放式文档格式: ODT、ODS、ODPOpen Document formats: ODT, ODS, ODP
  • HTMLHTML
  • XMLXML
  • ZIPZIP
  • GZGZ
  • EPUBEPUB
  • EMLEML
  • RTFRTF
  • 纯文本文件(另请参阅为纯文本编制索引Plain text files (see also Indexing plain text)
  • JSON(请参阅为 JSON blob 编制索引JSON (see Indexing JSON blobs)
  • CSV (请参阅为csv Blob 编制索引CSV (see Indexing CSV blobs)

为 blob 元数据编制索引Indexing blob metadata

使您能够轻松地对任何内容类型的 blob 进行排序的常见方案是为每个 blob 的自定义元数据和系统属性编制索引。A common scenario that makes it easy to sort through blobs of any content type is to index both custom metadata and system properties for each blob. 通过这种方式,将对所有 blob 的信息编制索引,而不管文档类型是什么,存储在搜索服务中的索引中。In this way, information for all blobs is indexed regardless of document type, stored in an index in your search service. 然后,可以使用新索引在所有 Blob 存储内容中进行排序、筛选和分面。Using your new index, you can then proceed to sort, filter, and facet across all Blob storage content.

为 JSON blob 编制索引Indexing JSON blobs

可以将索引器配置为提取在包含 JSON 的 blob 中找到的结构化内容。Indexers can be configured to extract structured content found in blobs that contain JSON. 索引器可以读取 JSON blob 并将结构化内容分析为搜索文档的相应字段。An indexer can read JSON blobs and parse the structured content into the appropriate fields of an search document. 索引器还可以采用包含 JSON 对象数组的 blob,并将每个元素映射到单独的搜索文档中。Indexers can also take blobs that contain an array of JSON objects and map each element to a separate search document. 可以设置分析模式,以影响索引器创建的 JSON 对象的类型。You can set a parsing mode to affect the type of JSON object created by the indexer.

在搜索索引中搜索 blob 内容Search blob content in a search index

索引的输出是一个搜索索引,用于在客户端应用中使用自由文本和筛选查询进行交互式浏览。The output of an indexing is a search index, used for interactive exploration using free text and filtered queries in a client app. 对于内容的初始探索和验证,建议从门户中的 "搜索资源管理器" 开始检查文档结构。For initial exploration and verification of content, we recommend starting with Search Explorer in the portal to examine document structure. 可以在 "搜索资源管理器" 中使用简单查询语法完整查询语法筛选表达式语法You can use simple query syntax, full query syntax, and filter expression syntax in Search explorer.

更永久性的解决方案是收集查询输入,并在客户端应用程序中以搜索结果的形式显示响应。A more permanent solution is to gather query inputs and present the response as search results in a client application. 以下C#教程介绍了如何构建搜索应用程序:在 Azure 中创建第一个应用程序认知搜索The following C# tutorial explains how to build a search application: Create your first application in Azure Cognitive Search.

后续步骤Next steps