教程:在云存储中搜索非结构化数据Tutorial: Search unstructured data in cloud storage

本教程介绍如何使用 Azure Blob 存储中存储的数据通过 Azure 搜索来搜索非结构化数据。In this tutorial, you learn how to search unstructured data by using Azure Search, using data stored in Azure Blob storage. 非结构化数据或者不按预定义的方式组织,或者不具备数据模型。Unstructured data is data that either is not organized in a pre-defined manner or does not have a data model. 例如,一个 .txt 文件。An example is a .txt file.

本教程要求读者有一个 Azure 订阅。This tutorial requires that you have an Azure subscription. 如果没有 Azure 订阅,请在开始之前创建一个免费帐户If you don't have an Azure subscription, create a free account before you begin.

本教程介绍如何执行下列操作:In this tutorial you learn how to:

  • 创建资源组Create a resource group
  • 创建存储帐户Create a storage account
  • 创建容器Create a container
  • 将数据上传到容器Upload data to your container
  • 在门户中创建搜索服务Create a search service through the portal
  • 将搜索服务连接到存储帐户Connect a search service to a storage account
  • 创建数据源Create a data source
  • 配置索引Configure the index
  • 创建索引器Create an indexer
  • 使用搜索服务搜索容器Use the search service to search your container


每个存储帐户都必须属于 Azure 资源组。Every storage account must belong to an Azure resource group. 资源组是对 Azure 资源进行分组的逻辑容器。A resource group is a logical container for grouping your Azure services. 创建存储帐户时,可以选择创建新的资源组,或者使用现有的资源组。When you create a storage account, you have the option to either create a new resource group or use an existing resource group. 本教程将创建新的资源组。This tutorial creates a new resource group.

登录到 Azure 门户Sign in to the Azure portal.

若要在 Azure 门户中创建常规用途 v2 存储帐户,请执行以下步骤:To create a general-purpose v2 storage account in the Azure portal, follow these steps:

  1. 在 Azure 门户中,选择“所有服务”。 In the Azure portal, select All services. 在资源列表中,键入“存储帐户” 。In the list of resources, type Storage Accounts. 开始键入时,会根据输入筛选该列表。As you begin typing, the list filters based on your input. 选择“存储帐户” 。Select Storage Accounts.

  2. 在显示的“存储帐户”窗口中,选择“添加”。 On the Storage Accounts window that appears, choose Add.

  3. 选择要在其中创建存储帐户的订阅。Select the subscription in which to create the storage account.

  4. 在“资源组” 字段下,选择“新建” 。Under the Resource group field, select Create new. 输入新资源组的名称,如下图中所示。Enter a name for your new resource group, as shown in the following image.


  5. 然后,输入存储帐户的名称。Next, enter a name for your storage account. 所选名称在 Azure 中必须唯一。The name you choose must be unique across Azure. 该名称还必须为 3 到 24 个字符,并且只能包含数字和小写字母。The name also must be between 3 and 24 characters in length, and can include numbers and lowercase letters only.

  6. 选择存储帐户的位置或使用默认位置。Select a location for your storage account, or use the default location.

  7. 将这些字段设置为其默认值:Leave these fields set to their default values:

    字段Field Value
    部署模型Deployment model 资源管理器Resource Manager
    性能Performance 标准Standard
    帐户类型Account kind StorageV2(常规用途 v2)StorageV2 (general-purpose v2)
    复制Replication 读取访问异地冗余存储 (RA-GRS)Read-access geo-redundant storage (RA-GRS)
    访问层Access tier Hot
  8. 选择“查看+创建” 可查看存储帐户设置并创建帐户。Select Review + Create to review your storage account settings and create the account.

  9. 选择“创建” 。Select Create.

有关存储帐户类型和其他存储帐户设置的详细信息,请参阅 Azure 存储帐户概述For more information about types of storage accounts and other storage account settings, see Azure storage account overview. 有关资源组的详细信息,请参阅 Azure 资源管理器概述For more information on resource groups, see Azure Resource Manager overview.

示例数据集已准备好,可在本教程中使用。A sample data set has been prepared for you so that you can make use of it for this tutorial. 下载 clinical-trials.zip 并将其解压缩到所在的文件夹中。Download clinical-trials.zip and unzip it to its own folder.

此示例包含从clinicaltrials.gov 获取的文本文件。The sample consists of text files obtained from clinicaltrials.gov. 本教程将以这些文本文件为例,说明如何使用 Azure 搜索服务来搜索它们。This tutorial uses them as example text files that are searched using Azure Search services.

创建容器Create a container

这些容器类似于文件夹,用于存储 blob。Containers are similar to folders and are used to store blobs.

在本教程中,可以使用单个容器来存储从 clinicaltrials.gov 获取的文本文件。For this tutorial, you use a single container to store the text files obtained from clinicaltrials.gov.

  1. 在 Azure 门户中转到自己的存储帐户。Go to your storage account in the Azure portal.

  2. 在“Blob 服务”下选择“浏览 Blob”。Select Browse blobs under Blob Service.

  3. 添加新容器。Add a new container.

  4. 将容器命名为 data,并选择“容器”作为公共访问级别。Name the container data and select Container for the public access level.

  5. 选择“确定”创建容器。Select OK to create the container.


上传示例数据Upload the example data

现在,你已拥有一个容器,可以将示例数据上传到该容器中。Now that you have a container, you can upload your example data to it.

  1. 选择容器,然后选择“上传”。Select your container and select Upload.

  2. 选择“文件”字段旁边的蓝色文件夹图标,浏览到示例数据解压缩到的本地文件夹。Select the blue folder icon next to the Files field and browse to the local folder where you extracted the sample data.

  3. 选择所有已解压缩的文件,然后选择“打开”Select all of the extracted files and select Open.

  4. 选择“上传”,开始上传过程。Select Upload to begin the upload process.


上传过程可能需要花费片刻时间。The upload process might take a moment.

上传完成后,返回到 data 容器,以确认文本文件是否已上传。After it finishes, go back into your data container to confirm the text files were uploaded.


创建搜索服务Create a search service

Azure 搜索是一种搜索即服务云解决方案,它提供开发人员 API 和工具,以便基于数据添加丰富的搜索体验。Azure Search is a search-as-a-service cloud solution that gives developers APIs and tools for adding a search experience over your data.

本教程使用某个搜索服务来搜索从 clinicaltrials.gov 获取的文本文件。For this tutorial, you use a search service to search the text files obtained from clinicaltrials.gov.

  1. 在 Azure 门户中转到自己的存储帐户。Go to your storage account in the Azure portal.

  2. 向下滚动并选择“BLOB 服务”下的“添加 Azure 搜索”。Scroll down and select Add Azure Search under BLOB SERVICE.

  3. 在“导入数据”下,选择“选取服务”。Under Import Data, select Pick your service.

  4. 选择“单击此处创建一个新的搜索服务”。Select Click here to create a new search service.

  5. 在“新建搜索服务”“URL”字段中,输入搜索服务的唯一名称。Inside New Search Service enter a unique name for your search service in the URL field.

  6. 在“资源组”下,选择“使用现有项”并选择前面创建的资源组。Under Resource group, select Use existing and choose the resource group that you created earlier.

  7. 为“定价层”选择“免费”层,然后单击“选择”。For Pricing tier, select the Free tier and click Select.

  8. 选择“创建”,创建搜索服务。Select Create to create the search service.


将搜索服务连接到容器Connect your search service to your container

现在,你已拥有一个搜索服务,可以将其附加到 blob 存储。Now that you have a search service, you can attach it to your blob storage. 本部分将分步介绍选择数据源、创建索引,以及创建索引器的过程。This section walks you through the process of choosing a data source, creating an index, and creating an indexer.

  1. 转到存储帐户。Go to your storage account.

  2. 选择“BLOB 服务”下的“添加 Azure 搜索”。Select Add Azure Search under BLOB SERVICE.

  3. 在“导入数据”中选择“搜索服务”,然后单击在上一部分创建的搜索服务。Select Search Service inside Import Data, and then click the search service that you created in the preceding section. 此时会打开“新建数据源”。This opens New data source.

创建数据源Create a data source

数据源指定要编制索引的数据以及如何访问数据。A data source specifies which data to index and how to access the data. 一个数据源可供同一搜索服务多次使用。A data source can be used multiple times by the same search service.

  1. 输入数据源的名称。Enter a name for the data source. 在“要提取的数据”下,选择“内容和元数据”。Under Data to extract, select Content and Metadata. 数据源指定要对 blob 的哪些部分编制索引。The data source specifies which parts of the blob are indexed.

  2. 由于所用的 Blob 是文本文件,因此请将“分析模式”设置为“文本”。Because the blobs you're using are text files, set Parsing Mode to Text.


  3. 选择“存储容器”,列出可用的存储帐户。Select Storage Container to list the available storage accounts.

  4. 选择自己的存储帐户,然后选择前面创建的容器。Select your storage account, and then select the container that you created previously.


  5. 单击“选择”返回到“新建数据源”,然后选择“确定”以继续。Click Select to return to New data source, and select OK to continue.

配置索引Configure the index

索引是来自可搜索的数据源的字段集合。An index is a collection of fields from your data source that can be searched. 在这些字段中设置并配置参数,使搜索服务知道要以哪种方式搜索数据。You set and configure parameters on these fields so that your search service knows what ways your data should be searched.

  1. 在“导入数据”中,选择“自定义目标索引”。In Import data, select Customize target index.

  2. 在“索引名称”字段中输入索引的名称。Enter a name for your index in the Index name field.

  3. 选中“metadata_storage_name”下的“可检索”属性的复选框。Select the Retrievable attribute's check box under metadata_storage_name.


  4. 选择“确定”,显示“创建索引器”。Select OK, which brings up Create an Indexer.

编制索引的参数和为这些参数提供的属性都非常重要。The parameters of your index and the attributes you give those parameters are important. 参数指定要存储哪些数据,属性指定如何存储这些数据。The parameters specify what data to store, and the attributes specify how to store that data.

“字段名称”列包含参数。The FIELD NAME column contains the parameters. 下表提供了可用属性及其说明的列表。The following table provides a listing of the available attributes and their descriptions.

字段属性Field attributes

属性Attribute 说明Description
Key 为每个文档提供唯一 ID 以便查找文档的字符串。A string that provides the unique ID of each document, used for document lookup. 每个索引必须有一个 key。Every index must have one key. 只有一个字段可以是 key,并且此字段类型必须设置为 Edm.String。Only one field can be the key, and its type must be set to Edm.String.
RetrievableRetrievable 指定是否可以在搜索结果中返回字段。Specifies whether a field can be returned in a search result.
FilterableFilterable 允许在筛选查询中使用字段。Allows the field to be used in filter queries.
SortableSortable 允许查询使用此字段对搜索结果排序。Allows a query to sort search results using this field.
FacetableFacetable 允许在分面导航结构中使用字段进行用户自主筛选。Allows a field to be used in a faceted navigation structure for user self-directed filtering. 通常,包含重复值的字段最适合分面导航,这些重复值可用于将文档(例如,同属一个品牌或服务类别的多个文档)组合在一起。Typically, fields containing repetitive values that you can use to group documents together (for example, multiple documents that fall under a single brand or service category) work best as facets.
SearchableSearchable 将字段标记为可全文搜索。Marks the field as full-text searchable.

创建索引器Create an indexer

索引器将数据源与搜索索引关联,并提供对数据重新编制索引的计划。An indexer connects a data source with a search index, and provides a schedule to reindex your data.

  1. 在“名称”字段中输入一个名称,然后选择“确定”。Enter a name in the Name field and select OK.


  2. 随后将返回到“导入数据”。You are brought back to Import Data. 选择“确定”完成连接过程。Select OK to complete the connection process.

现在已成功将 blob 关联到搜索服务。You've now successfully connected your blob to your search service. 几分钟后,门户中才能显示索引已填充。It takes a few minutes for the portal to show that the index is populated. 但是,搜索服务会立即开始编制索引,因此,可以即刻开始搜索。However, the search service begins indexing immediately so you can begin searching right away.

搜索文本文件Search your text files

若要搜索文件,请打开新创建的搜索服务的索引内的搜索资源管理器。To search your files, open the search explorer inside the index of your newly created search service.

以下步骤介绍在何处可以找到搜索资源管理器,并提供一些示例查询:The following steps show you where to find the search explorer and provides you some example queries:

  1. 转到所有资源,并查找新建的搜索服务。Go to all resources and find your newly created search service.


  2. 选择索引将其打开。Select your index to open it.


  3. 选择“搜索资源管理器”打开搜索资源管理器,可在其中对数据进行实时查询。Select Search Explorer to open the search explorer, where you can make live queries on your data.


  4. 在查询字符串字段为空时选择“搜索”。Select Search while the query string field is empty. 空查询返回 blob 中的所有数据。An empty query returns all the data from your blobs.


在“查询字符串”字段中输入 Myopia,然后选择“搜索”。Enter Myopia in the Query string field, and select Search. 此步骤会启动文件内容的搜索,并返回包含单词“Myopia”的一部分内容。This step starts a search of the files' contents and returns a subset of them, which contains the word "Myopia."


除全文搜索以外,还可以使用 $select 参数创建按系统属性搜索的查询。In addition to a full-text search, you can create queries that search by system properties using the $select parameter.

在查询字符串中输入 $select=metadata_storage_name 并按 EnterEnter $select=metadata_storage_name into the query string and press Enter. 这只会返回该特定字段。This causes only that particular field to return.

查询字符串直接修改 URL,因此不允许有空格。The query string is directly modifying the URL, so spaces are not permitted. 若要搜索多个字段,请使用逗号,如:$select=metadata_storage_name,metadata_storage_pathTo search multiple fields, use a comma, such as: $select=metadata_storage_name,metadata_storage_path

$select 参数只能用于在定义索引时标记为“可检索”的字段。The $select parameter can only be used with fields that are marked retrievable when defining your index.


现在已经完成了本教程,并具有一个可搜索的非结构化数据集。You have now completed this tutorial and have a searchable set of unstructured data.

清理资源Clean up resources

若要删除所创建的资源,最简单的方法是删除资源组。The easiest way to remove the resources you've created is to delete the resource group. 删除资源组还会删除该组中包含的所有资源。Removing the resource group also deletes all resources included within the group. 在以下示例中,删除资源组会删除存储帐户和资源组本身。In the following example, removing the resource group removes the storage account and the resource group itself.

  1. 在 Azure 门户中,转到订阅中的资源组列表。In the Azure portal, go to the list of resource groups in your subscription.
  2. 选择要删除的资源组。Select the resource group that you want to delete.
  3. 选择“删除资源组”按钮,然后在删除字段中输入资源组的名称。Select the Delete resource group button and enter the name of the resource group in the deletion field.
  4. 选择“删除”。Select Delete.

