您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

如何在 azure Blob 存储中用 Azure 认知搜索索引文档How to index documents in Azure Blob Storage with Azure Cognitive Search

本文介绍如何使用 Azure 认知搜索来索引存储在 Azure Blob 存储中的文档(如 Pdf、Microsoft Office 文档和多种其他常见格式)。This article shows how to use Azure Cognitive Search to index documents (such as PDFs, Microsoft Office documents, and several other common formats) stored in Azure Blob storage. 首先,本文说明了设置和配置 Blob 索引器的基础知识。First, it explains the basics of setting up and configuring a blob indexer. 其次,本文更加深入地探讨了你可能会遇到的行为和场景。Then, it offers a deeper exploration of behaviors and scenarios you are likely to encounter.

支持的文档格式Supported document formats

Blob 索引器可从以下文档格式提取文本:The blob indexer can extract text from the following document formats:

  • PDFPDF
  • Microsoft Office 格式: .DOCX/DOC/DOCM,.XLSX/XLS/XLSM,.PPTX/PPT/PPTM,MSG (Outlook 电子邮件),XML (2003和 2006 WORD XML)Microsoft Office formats: DOCX/DOC/DOCM, XLSX/XLS/XLSM, PPTX/PPT/PPTM, MSG (Outlook emails), XML(both 2003 and 2006 WORD XML)
  • 开放式文档格式: ODT、ODS、ODPOpen Document formats: ODT, ODS, ODP
  • HTMLHTML
  • XMLXML
  • ZIPZIP
  • GZGZ
  • EPUBEPUB
  • EMLEML
  • RTFRTF
  • 纯文本文件(另请参阅为纯文本编制索引Plain text files (see also Indexing plain text)
  • JSON(请参阅为 JSON blob 编制索引JSON (see Indexing JSON blobs)
  • CSV (请参阅为csv Blob 编制索引CSV (see Indexing CSV blobs)

设置 Blob 索引Setting up blob indexing

可以使用以下方式设置 Azure Blob 存储索引器:You can set up an Azure Blob Storage indexer using:

备注

某些功能(例如字段映射)在门户中尚不可用,必须以编程方式使用。Some features (for example, field mappings) are not yet available in the portal, and have to be used programmatically.

在这里,我们使用 REST API 演示流。Here, we demonstrate the flow using the REST API.

步骤 1:创建数据源Step 1: Create a data source

数据源指定要编制索引的数据、访问数据所需的凭据和有效标识数据更改(新行、修改的行或删除的行)的策略。A data source specifies which data to index, credentials needed to access the data, and policies to efficiently identify changes in the data (new, modified, or deleted rows). 一个数据源可供同一搜索服务中的多个索引器使用。A data source can be used by multiple indexers in the same search service.

若要为 Blob 编制索引,数据源必须具有以下属性:For blob indexing, the data source must have the following required properties:

  • name 是搜索服务中数据源的唯一名称。name is the unique name of the data source within your search service.
  • type 必须是 azureblobtype must be azureblob.
  • credentialscredentials.connectionString 参数的形式提供存储帐户连接字符串。credentials provides the storage account connection string as the credentials.connectionString parameter. 有关详细信息请参阅下方的如何指定凭据See How to specify credentials below for details.
  • container 指定存储帐户中的容器。container specifies a container in your storage account. 默认情况下,容器中的所有 Blob 都可检索。By default, all blobs within the container are retrievable. 如果只想为特定虚拟目录中的 Blob 编制索引,可以使用可选的 query 参数指定该目录。If you only want to index blobs in a particular virtual directory, you can specify that directory using the optional query parameter.

创建数据源:To create a data source:

POST https://[service name].search.windows.net/datasources?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
    "name" : "blob-datasource",
    "type" : "azureblob",
    "credentials" : { "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>;" },
    "container" : { "name" : "my-container", "query" : "<optional-virtual-directory-name>" }
}   

有关创建数据源 API 的详细信息,请参阅创建数据源For more on the Create Datasource API, see Create Datasource.

如何指定凭据How to specify credentials

可通过以下一种方式提供 blob 容器的凭据:You can provide the credentials for the blob container in one of these ways:

  • 完全访问存储帐户连接字符串DefaultEndpointsProtocol=https;AccountName=<your storage account>;AccountKey=<your account key> 可以通过导航到 "存储帐户" 边栏选项卡 > "> 密钥" (对于经典存储帐户)或 "设置 > 访问密钥" (对于 Azure 资源管理器存储帐户),从 Azure 门户中获取连接字符串。Full access storage account connection string: DefaultEndpointsProtocol=https;AccountName=<your storage account>;AccountKey=<your account key> You can get the connection string from the Azure portal by navigating to the storage account blade > Settings > Keys (for Classic storage accounts) or Settings > Access keys (for Azure Resource Manager storage accounts).
  • 存储帐户共享访问签名 (SAS) 连接字符串:BlobEndpoint=https://<your account>.blob.core.windows.net/;SharedAccessSignature=?sv=2016-05-31&sig=<the signature>&spr=https&se=<the validity end time>&srt=co&ss=b&sp=rlSAS 应当对容器和对象(在本例中为 blob)具有列出和读取权限。Storage account shared access signature (SAS) connection string: BlobEndpoint=https://<your account>.blob.core.windows.net/;SharedAccessSignature=?sv=2016-05-31&sig=<the signature>&spr=https&se=<the validity end time>&srt=co&ss=b&sp=rl The SAS should have the list and read permissions on containers and objects (blobs in this case).
  • 容器共享访问签名ContainerSharedAccessUri=https://<your storage account>.blob.core.windows.net/<container name>?sv=2016-05-31&sr=c&sig=<the signature>&se=<the validity end time>&sp=rlSAS 应当对容器具有列出和读取权限。Container shared access signature: ContainerSharedAccessUri=https://<your storage account>.blob.core.windows.net/<container name>?sv=2016-05-31&sr=c&sig=<the signature>&se=<the validity end time>&sp=rl The SAS should have the list and read permissions on the container.

有关存储共享访问签名的详细信息,请参阅使用共享访问签名For more info on storage shared access signatures, see Using Shared Access Signatures.

备注

如果使用 SAS 凭据,则需使用续订的签名定期更新数据源凭据,以防止其过期。If you use SAS credentials, you will need to update the data source credentials periodically with renewed signatures to prevent their expiration. 如果 SAS 凭据过期,索引器会失败,出现类似于 Credentials provided in the connection string are invalid or have expired. 的错误消息。If SAS credentials expire, the indexer will fail with an error message similar to Credentials provided in the connection string are invalid or have expired..

步骤 2:创建索引Step 2: Create an index

索引指定文档、属性和其他构造中可以塑造搜索体验的字段。The index specifies the fields in a document, attributes, and other constructs that shape the search experience.

下面介绍了如何使用可搜索 content 字段创建索引,以存储从 Blob 中提取的文本:Here's how to create an index with a searchable content field to store the text extracted from blobs:

POST https://[service name].search.windows.net/indexes?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
      "name" : "my-target-index",
      "fields": [
        { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
        { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
      ]
}

有关创建索引的详细信息,请参阅创建索引For more on creating indexes, see Create Index

步骤 3:创建索引器Step 3: Create an indexer

索引器将数据源与目标搜索索引关联,并提供自动执行数据刷新的计划。An indexer connects a data source with a target search index, and provides a schedule to automate the data refresh.

创建索引和数据源后,就可以准备创建索引器了:Once the index and data source have been created, you're ready to create the indexer:

POST https://[service name].search.windows.net/indexers?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
  "name" : "blob-indexer",
  "dataSourceName" : "blob-datasource",
  "targetIndexName" : "my-target-index",
  "schedule" : { "interval" : "PT2H" }
}

此索引器每隔两小时运行一次(已将计划间隔设置为“PT2H”)。This indexer will run every two hours (schedule interval is set to "PT2H"). 若要每隔 30 分钟运行一次索引器,可将间隔设置为“PT30M”。To run an indexer every 30 minutes, set the interval to "PT30M". 支持的最短间隔为 5 分钟。The shortest supported interval is 5 minutes. 计划是可选的 - 如果省略,则索引器在创建后只运行一次。The schedule is optional - if omitted, an indexer runs only once when it's created. 但是,可以随时根据需要运行索引器。However, you can run an indexer on-demand at any time.

有关创建索引器 API 的更多详细信息,请参阅创建索引器For more details on the Create Indexer API, check out Create Indexer.

有关定义索引器计划的详细信息,请参阅如何为 Azure 认知搜索计划索引器For more information about defining indexer schedules see How to schedule indexers for Azure Cognitive Search.

Azure 认知搜索如何为 blob 编制索引How Azure Cognitive Search indexes blobs

根据具体的索引器配置,Blob 索引器可以仅为存储元数据编制索引(如果只关注元数据,而无需为 Blob 的内容编制索引,则此功能非常有用)、为存储元数据和内容元数据编制索引,或者同时为元数据和文本内容编制索引。Depending on the indexer configuration, the blob indexer can index storage metadata only (useful when you only care about the metadata and don't need to index the content of blobs), storage and content metadata, or both metadata and textual content. 默认情况下,索引器提取元数据和内容。By default, the indexer extracts both metadata and content.

备注

默认情况下,包含结构化内容(例如 JSON 或 CSV)的 lob 以单一文本区块的形式编制索引。By default, blobs with structured content such as JSON or CSV are indexed as a single chunk of text. 如果想要以结构化方法为 JSON 和 CSV Blob 编制索引,请参阅为 JSON Blob 编制索引为 CSV Blob 编制索引来了解详细信息。If you want to index JSON and CSV blobs in a structured way, see Indexing JSON blobs and Indexing CSV blobs for more information.

复合或嵌入式文档(例如 ZIP 存档,或者嵌入了带附件 Outlook 电子邮件的 Word 文档)也以单一文档的形式编制索引。A compound or embedded document (such as a ZIP archive or a Word document with embedded Outlook email containing attachments) is also indexed as a single document.

  • 文档的文本内容将提取到名为 content 的字符串字段中。The textual content of the document is extracted into a string field named content.

备注

Azure 认知搜索会根据定价层限制其提取的文本量:用于免费层的32000个字符、适用于 Basic 的64000和标准、标准 S2 和标准 S3 层的4000000。Azure Cognitive Search limits how much text it extracts depending on the pricing tier: 32,000 characters for Free tier, 64,000 for Basic, and 4 million for Standard, Standard S2 and Standard S3 tiers. 已截断的文本会在索引器状态响应中出现一条警告。A warning is included in the indexer status response for truncated documents.

  • Blob 中用户指定的元数据属性(如果有)将逐字提取。User-specified metadata properties present on the blob, if any, are extracted verbatim.

  • 标准 Blob 元数据属性将提取到以下字段中:Standard blob metadata properties are extracted into the following fields:

    • metadata_storage_name (Edm.String) - Blob 的文件名。metadata_storage_name (Edm.String) - the file name of the blob. 例如,对于 Blob /my-container/my-folder/subfolder/resume.pdf 而言,此字段的值是 resume.pdfFor example, if you have a blob /my-container/my-folder/subfolder/resume.pdf, the value of this field is resume.pdf.
    • metadata_storage_path (Edm.String) - Blob 的完整 URI(包括存储帐户)。metadata_storage_path (Edm.String) - the full URI of the blob, including the storage account. 例如 https://myaccount.blob.core.windows.net/my-container/my-folder/subfolder/resume.pdfFor example, https://myaccount.blob.core.windows.net/my-container/my-folder/subfolder/resume.pdf
    • metadata_storage_content_type (Edm.String) - 用于上传 Blob 的代码指定的内容类型。metadata_storage_content_type (Edm.String) - content type as specified by the code you used to upload the blob. 例如,application/octet-streamFor example, application/octet-stream.
    • metadata_storage_last_modified (Edm.DateTimeOffset) - 上次修改 Blob 的时间戳。metadata_storage_last_modified (Edm.DateTimeOffset) - last modified timestamp for the blob. Azure 认知搜索使用此时间戳来识别已更改的 blob,以避免在初始索引后对所有内容进行索引。Azure Cognitive Search uses this timestamp to identify changed blobs, to avoid reindexing everything after the initial indexing.
    • metadata_storage_size (Edm.Int64) - Blob 大小,以字节为单位。metadata_storage_size (Edm.Int64) - blob size in bytes.
    • metadata_storage_content_md5 (Edm.String) - Blob 内容的 MD5 哈希(如果有)。metadata_storage_content_md5 (Edm.String) - MD5 hash of the blob content, if available.
    • _存储_sas_标记(Edm)的元数据-可由自定义技能用于获取对 blob 的访问权限的临时 sas 令牌。metadata_storage_sas_token (Edm.String) - A temporary SAS token that can be used by custom skills to get access to the blob. 不应存储此令牌以供以后使用,因为它可能会过期。This token should not be stored for later use as it might expire.
  • 特定于每种文档格式的元数据属性将提取到此处所列的字段。Metadata properties specific to each document format are extracted into the fields listed here.

无需在搜索索引中针对上述所有属性定义字段 - 系统只捕获应用程序所需的属性。You don't need to define fields for all of the above properties in your search index - just capture the properties you need for your application.

备注

通常,现有索引中的字段名称与文档提取期间所生成的字段名称不同。Often, the field names in your existing index will be different from the field names generated during document extraction. 可以使用字段映射将 Azure 认知搜索提供的属性名称映射到搜索索引中的字段名称。You can use field mappings to map the property names provided by Azure Cognitive Search to the field names in your search index. 后面会提供字段映射的用法示例。You will see an example of field mappings use below.

定义文档键和字段映射Defining document keys and field mappings

在 Azure 认知搜索中,文档键唯一标识一个文档。In Azure Cognitive Search, the document key uniquely identifies a document. 每个搜索索引只能有一个类型为 Edm.String 的键字段。Every search index must have exactly one key field of type Edm.String. 键字段对于要添加到索引的每个文档必不可少(事实上它是唯一的必填字段)。The key field is required for each document that is being added to the index (it is actually the only required field).

应该仔细考虑要将提取的哪个字段映射到索引的键字段。You should carefully consider which extracted field should map to the key field for your index. 候选字段包括:The candidates are:

  • metadata_storage_name - 这可能是一个便利的候选项,但是请注意,1) 名称可能不唯一,因为不同的文件夹中可能有同名的 Blob;2) 名称中包含的字符可能在文档键中无效,例如短划线。metadata_storage_name - this might be a convenient candidate, but note that 1) the names might not be unique, as you may have blobs with the same name in different folders, and 2) the name may contain characters that are invalid in document keys, such as dashes. 可以使用 base64Encode 字段映射函数处理无效的字符 - 如果这样做,请记得在将这些字符传入 Lookup 等 API 调用时为文档键编码。You can deal with invalid characters by using the base64Encode field mapping function - if you do this, remember to encode document keys when passing them in API calls such as Lookup. (例如,在 .NET 中,可以使用 UrlTokenEncode 方法实现此目的)。(For example, in .NET you can use the UrlTokenEncode method for that purpose).
  • metadata_storage_path - 使用完整路径可确保唯一性,但是路径必定会包含 / 字符,而该字符在文档键中无效metadata_storage_path - using the full path ensures uniqueness, but the path definitely contains / characters that are invalid in a document key. 如上所述,可以选择使用 base64Encode 函数为键编码。As above, you have the option of encoding the keys using the base64Encode function.
  • 如果上述所有做法都不起作用,可将一个自定义元数据属性添加到 Blob。If none of the options above work for you, you can add a custom metadata property to the blobs. 但是,这种做法需要通过 Blob 上传过程将该元数据属性添加到所有 Blob。This option does, however, require your blob upload process to add that metadata property to all blobs. 由于键是必需的属性,因此没有该属性的所有 Blob 都无法编制索引。Since the key is a required property, all blobs that don't have that property will fail to be indexed.

重要

如果索引中没有键字段的显式映射,Azure 认知搜索会自动使用 metadata_storage_path 作为键,并以64编码的密钥值(上面的第二个选项)进行。If there is no explicit mapping for the key field in the index, Azure Cognitive Search automatically uses metadata_storage_path as the key and base-64 encodes key values (the second option above).

本示例选择 metadata_storage_name 字段作为文档键。For this example, let's pick the metadata_storage_name field as the document key. 同时,假设索引具有名为 key 的键字段,以及一个用于存储文档大小的 fileSize 字段。Let's also assume your index has a key field named key and a field fileSize for storing the document size. 若要连接所需的元素,请在创建或更新索引器时指定以下字段映射:To wire things up as desired, specify the following field mappings when creating or updating your indexer:

"fieldMappings" : [
  { "sourceFieldName" : "metadata_storage_name", "targetFieldName" : "key", "mappingFunction" : { "name" : "base64Encode" } },
  { "sourceFieldName" : "metadata_storage_size", "targetFieldName" : "fileSize" }
]

要将所有元素合并在一起,可按如下所示添加字段映射,并为现有索引器的键启用 base-64 编码:To bring this all together, here's how you can add field mappings and enable base-64 encoding of keys for an existing indexer:

PUT https://[service name].search.windows.net/indexers/blob-indexer?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
  "dataSourceName" : " blob-datasource ",
  "targetIndexName" : "my-target-index",
  "schedule" : { "interval" : "PT2H" },
  "fieldMappings" : [
    { "sourceFieldName" : "metadata_storage_name", "targetFieldName" : "key", "mappingFunction" : { "name" : "base64Encode" } },
    { "sourceFieldName" : "metadata_storage_size", "targetFieldName" : "fileSize" }
  ]
}

备注

有关字段映射的详细信息,请参阅此文To learn more about field mappings, see this article.

控制要为哪些 Blob 编制索引Controlling which blobs are indexed

可以控制要为哪些 Blob 编制索引,以及要跳过哪些 Blob。You can control which blobs are indexed, and which are skipped.

只为具有特定文件扩展名的 Blob 编制索引Index only the blobs with specific file extensions

使用 indexedFileNameExtensions 索引器配置参数可以做到只为具有指定扩展名的 Blob 编制索引。You can index only the blobs with the file name extensions you specify by using the indexedFileNameExtensions indexer configuration parameter. 值是包含文件扩展名(包括前置句点)逗号分隔列表的字符串。The value is a string containing a comma-separated list of file extensions (with a leading dot). 例如,如果只要为 .PDF 和 .DOCX Blob 编制索引,请执行以下操作:For example, to index only the .PDF and .DOCX blobs, do this:

PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
  ... other parts of indexer definition
  "parameters" : { "configuration" : { "indexedFileNameExtensions" : ".pdf,.docx" } }
}

排除具有特定文件扩展名的 BlobExclude blobs with specific file extensions

使用 excludedFileNameExtensions 配置参数可在编制索引时排除具有特定文件扩展名的 Blob。You can exclude blobs with specific file name extensions from indexing by using the excludedFileNameExtensions configuration parameter. 值是包含文件扩展名(包括前置句点)逗号分隔列表的字符串。The value is a string containing a comma-separated list of file extensions (with a leading dot). 例如,若要为所有 Blob 编制索引,但要排除具有 .PNG 和 .JPEG 扩展名的 Blob,请执行以下操作:For example, to index all blobs except those with the .PNG and .JPEG extensions, do this:

PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
  ... other parts of indexer definition
  "parameters" : { "configuration" : { "excludedFileNameExtensions" : ".png,.jpeg" } }
}

如果同时存在 indexedFileNameExtensionsexcludedFileNameExtensions 参数,Azure 认知搜索首先查看 indexedFileNameExtensions,然后在 excludedFileNameExtensionsIf both indexedFileNameExtensions and excludedFileNameExtensions parameters are present, Azure Cognitive Search first looks at indexedFileNameExtensions, then at excludedFileNameExtensions. 这意味着,如果两个列表中存在同一个文件扩展名,将从索引编制中排除该扩展名。This means that if the same file extension is present in both lists, it will be excluded from indexing.

控制要为 Blob 中的哪些部分编制索引Controlling which parts of the blob are indexed

可以使用 dataToExtract 配置参数控制要为 Blob 中的哪些部分编制索引。You can control which parts of the blobs are indexed using the dataToExtract configuration parameter. 该参数采用以下值:It can take the following values:

例如,如果只要为存储元数据编制索引,请使用:For example, to index only the storage metadata, use:

PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
  ... other parts of indexer definition
  "parameters" : { "configuration" : { "dataToExtract" : "storageMetadata" } }
}

使用 Blob 元数据来控制如何为 Blob 编制索引Using blob metadata to control how blobs are indexed

上述配置参数适用于所有 Blob。The configuration parameters described above apply to all blobs. 有时,你可能想要控制为个体 Blob 编制索引的方式。Sometimes, you may want to control how individual blobs are indexed. 为此,可以添加以下 Blob 元数据属性和值:You can do this by adding the following blob metadata properties and values:

属性名称Property name 属性值Property value 说明Explanation
AzureSearch_SkipAzureSearch_Skip "true""true" 指示 Blob 索引器完全跳过该 Blob,Instructs the blob indexer to completely skip the blob. 既不尝试提取元数据,也不提取内容。Neither metadata nor content extraction is attempted. 如果特定的 Blob 反复失败并且中断编制索引过程,此属性非常有用。This is useful when a particular blob fails repeatedly and interrupts the indexing process.
AzureSearch_SkipContentAzureSearch_SkipContent "true""true" 此属性等效于"dataToExtract" : "allMetadata"上面所述的与特定 Blob 相关的 设置。This is equivalent of "dataToExtract" : "allMetadata" setting described above scoped to a particular blob.

处理错误Dealing with errors

默认情况下,Blob 索引器一旦遇到包含不受支持内容类型(例如图像)的 Blob 时,就会立即停止。By default, the blob indexer stops as soon as it encounters a blob with an unsupported content type (for example, an image). 当然,可以使用 excludedFileNameExtensions 参数跳过某些内容类型。You can of course use the excludedFileNameExtensions parameter to skip certain content types. 但是,可能需要在未事先了解所有可能的内容类型的情况下,为 Blob 编制索引。However, you may need to index blobs without knowing all the possible content types in advance. 要在遇到了不受支持的内容类型时继续编制索引,可将 failOnUnsupportedContentType 配置参数设置为 falseTo continue indexing when an unsupported content type is encountered, set the failOnUnsupportedContentType configuration parameter to false:

PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
  ... other parts of indexer definition
  "parameters" : { "configuration" : { "failOnUnsupportedContentType" : false } }
}

对于某些 blob,Azure 认知搜索无法确定内容类型,或者无法处理其他受支持的内容类型的文档。For some blobs, Azure Cognitive Search is unable to determine the content type, or unable to process a document of otherwise supported content type. 若要忽略此故障模式,将 failOnUnprocessableDocument 配置参数设置为 false:To ignore this failure mode, set the failOnUnprocessableDocument configuration parameter to false:

  "parameters" : { "configuration" : { "failOnUnprocessableDocument" : false } }

Azure 认知搜索限制已编制索引的 blob 的大小。Azure Cognitive Search limits the size of blobs that are indexed. Azure 认知搜索中的服务限制中记录了这些限制。These limits are documented in Service Limits in Azure Cognitive Search. 过大的 blob 会被默认视为错误。Oversized blobs are treated as errors by default. 但是,如果将 indexStorageMetadataOnlyForOversizedDocuments 配置参数设为 true,你仍可以对过大 blob 的存储元数据编制索引:However, you can still index storage metadata of oversized blobs if you set indexStorageMetadataOnlyForOversizedDocuments configuration parameter to true:

"parameters" : { "configuration" : { "indexStorageMetadataOnlyForOversizedDocuments" : true } }

如果在任意处理点(无论是在解析 blob 时,还是在将文档添加到索引时)发生错误,仍然可以继续索引。You can also continue indexing if errors happen at any point of processing, either while parsing blobs or while adding documents to an index. 若要忽略特定的错误数,将 maxFailedItemsmaxFailedItemsPerBatch 配置参数设置为所需值。To ignore a specific number of errors, set the maxFailedItems and maxFailedItemsPerBatch configuration parameters to the desired values. 例如:For example:

{
  ... other parts of indexer definition
  "parameters" : { "maxFailedItems" : 10, "maxFailedItemsPerBatch" : 10 }
}

增量索引和删除检测Incremental indexing and deletion detection

将 Blob 索引器设置为按计划运行时,它将只根据 Blob 的 LastModified 时间戳,为更改的 Blob 重新编制索引。When you set up a blob indexer to run on a schedule, it reindexes only the changed blobs, as determined by the blob's LastModified timestamp.

备注

无需指定更改检测策略 - 系统会自动启用增量索引。You don't have to specify a change detection policy – incremental indexing is enabled for you automatically.

若要支持删除文档,请使用“软删除”方法。To support deleting documents, use a "soft delete" approach. 如果彻底删除 Blob,相应的文档不会从搜索索引中删除。If you delete the blobs outright, corresponding documents will not be removed from the search index. 应该改用以下步骤:Instead, use the following steps:

  1. 将自定义元数据属性添加到 blob,以指示 Azure 认知搜索逻辑删除它Add a custom metadata property to the blob to indicate to Azure Cognitive Search that it is logically deleted
  2. 在数据源上配置软删除检测策略Configure a soft deletion detection policy on the data source
  3. 索引器处理 Blob 后(如索引器状态 API 所示),可以使用物理方式删除该 BlobOnce the indexer has processed the blob (as shown by the indexer status API), you can physically delete the blob

例如,如果某个 Blob 具有值为 IsDeleted 的元数据属性 true,以下策略会将该 Blob 视为已删除:For example, the following policy considers a blob to be deleted if it has a metadata property IsDeleted with the value true:

PUT https://[service name].search.windows.net/datasources/blob-datasource?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
    "name" : "blob-datasource",
    "type" : "azureblob",
    "credentials" : { "connectionString" : "<your storage connection string>" },
    "container" : { "name" : "my-container", "query" : "my-folder" },
    "dataDeletionDetectionPolicy" : {
        "@odata.type" :"#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy",     
        "softDeleteColumnName" : "IsDeleted",
        "softDeleteMarkerValue" : "true"
    }
}   

为大型数据集编制索引Indexing large datasets

Blob 编制索引可能是一个耗时的过程。Indexing blobs can be a time-consuming process. 如果有几百万个 Blob 需要编制索引,可以将数据分区,并使用多个索引器来并行处理数据,从而加快索引编制的速度。In cases where you have millions of blobs to index, you can speed up indexing by partitioning your data and using multiple indexers to process the data in parallel. 设置方法如下:Here's how you can set this up:

  • 将数据分区到多个 Blob 容器或虚拟文件夹Partition your data into multiple blob containers or virtual folders

  • 设置多个 Azure 认知搜索数据源,每个容器或文件夹一个。Set up several Azure Cognitive Search data sources, one per container or folder. 若要指向某个 Blob 文件夹,请使用 query 参数:To point to a blob folder, use the query parameter:

    {
        "name" : "blob-datasource",
        "type" : "azureblob",
        "credentials" : { "connectionString" : "<your storage connection string>" },
        "container" : { "name" : "my-container", "query" : "my-folder" }
    }
    
  • 为每个数据源创建相应的索引器。Create a corresponding indexer for each data source. 所有索引器可以指向同一目标搜索索引。All the indexers can point to the same target search index.

  • 服务中的每个搜索单位在任何给定的时间都只能运行一个索引器。One search unit in your service can run one indexer at any given time. 只有当索引器实际上并行运行时,如上所述创建多个索引器才很有用。Creating multiple indexers as described above is only useful if they actually run in parallel. 若要并行运行多个索引器,请通过创建合适数量的分区和副本来横向扩展搜索服务。To run multiple indexers in parallel, scale out your search service by creating an appropriate number of partitions and replicas. 例如,如果搜索服务有 6 个搜索单位(例如,2 个分区 x 3 个副本),则 6 个索引器可以同时运行,导致索引吞吐量增加六倍。For example, if your search service has 6 search units (for example, 2 partitions x 3 replicas), then 6 indexers can run simultaneously, resulting in a six-fold increase in the indexing throughput. 若要了解有关缩放和容量规划的详细信息,请参阅在 Azure 中缩放用于查询和索引工作负荷的资源级别认知搜索To learn more about scaling and capacity planning, see Scale resource levels for query and indexing workloads in Azure Cognitive Search.

你可能希望从索引中的多个源“组装”文档。You may want to "assemble" documents from multiple sources in your index. 例如,你可能希望将 blob 中的文本与 Cosmos DB 中存储的其他元数据进行合并。For example, you may want to merge text from blobs with other metadata stored in Cosmos DB. 甚至可以将推送索引 API 与各种索引器一起使用来基于多个部件搭建搜索文档。You can even use the push indexing API together with various indexers to build up search documents from multiple parts.

若要使此方式可行,所有索引器和其他组件需要针对文档键达成一致。For this to work, all indexers and other components need to agree on the document key. 有关本主题的更多详细信息,请参阅为多个 Azure 数据源编制索引For additional details on this topic, refer to Index multiple Azure data sources. 有关详细演练,请参阅外部文章:将文档与 Azure 中的其他数据合并认知搜索中For a detailed walk-through, see this external article: Combine documents with other data in Azure Cognitive Search.

为纯文本编制索引Indexing plain text

如果所有 blob 都包含采用同一编码的纯文本,则可以通过使用文本分析模式显著提高索引编制性能。If all your blobs contain plain text in the same encoding, you can significantly improve indexing performance by using text parsing mode. 若要使用文本分析模式,请将 parsingMode 配置属性设置为 textTo use text parsing mode, set the parsingMode configuration property to text:

PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2019-05-06
Content-Type: application/json
api-key: [admin key]

{
  ... other parts of indexer definition
  "parameters" : { "configuration" : { "parsingMode" : "text" } }
}

默认情况下将采用 UTF-8 编码。By default, the UTF-8 encoding is assumed. 若要指定不同的编码,请使用 encoding 配置属性:To specify a different encoding, use the encoding configuration property:

{
  ... other parts of indexer definition
  "parameters" : { "configuration" : { "parsingMode" : "text", "encoding" : "windows-1252" } }
}

特定于内容类型的元数据属性Content type-specific metadata properties

下表汇总了每种文档格式的处理过程,并介绍了由 Azure 认知搜索提取的元数据属性。The following table summarizes processing done for each document format, and describes the metadata properties extracted by Azure Cognitive Search.

文档格式/内容类型Document format / content type 特定于内容类型的元数据属性Content-type specific metadata properties 处理详细信息Processing details
HTML (文本/html)HTML (text/html) metadata_content_encoding
metadata_content_type
metadata_language
metadata_description
metadata_keywords
metadata_title
剥离 HTML 标记并提取文本Strip HTML markup and extract text
PDF (应用程序/pdf)PDF (application/pdf) metadata_content_type
metadata_language
metadata_author
metadata_title
提取文本,包括嵌入的文档(不包括图像)Extract text, including embedded documents (excluding images)
DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document)DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
提取文本,包括嵌入的文档Extract text, including embedded documents
DOC (application/msword)DOC (application/msword) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
提取文本,包括嵌入的文档Extract text, including embedded documents
DOCM (application/vnd.apple.mpegurl. macroenabled)DOCM (application/vnd.ms-word.document.macroenabled.12) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
提取文本,包括嵌入的文档Extract text, including embedded documents
WORD XML (application/vnd.apple.mpegurl-word2006ml)WORD XML (application/vnd.ms-word2006ml) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
剥离 XML 标记并提取文本Strip XML markup and extract text
WORD 2003 XML (application/vnd.apple.mpegurl. ms-wordml)WORD 2003 XML (application/vnd.ms-wordml) metadata_content_type
metadata_author
metadata_creation_date
剥离 XML 标记并提取文本Strip XML markup and extract text
XLSX (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)XLSX (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
提取文本,包括嵌入的文档Extract text, including embedded documents
XLS (application/vnd.ms-excel)XLS (application/vnd.ms-excel) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
提取文本,包括嵌入的文档Extract text, including embedded documents
XLSM (application/vnd.apple.mpegurl. macroenabled)XLSM (application/vnd.ms-excel.sheet.macroenabled.12) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
提取文本,包括嵌入的文档Extract text, including embedded documents
PPTX (application/vnd.openxmlformats-officedocument.presentationml.presentation)PPTX (application/vnd.openxmlformats-officedocument.presentationml.presentation) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
提取文本,包括嵌入的文档Extract text, including embedded documents
PPT (application/vnd.ms-powerpoint)PPT (application/vnd.ms-powerpoint) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
提取文本,包括嵌入的文档Extract text, including embedded documents
PPTM (application/vnd.apple.mpegurl. macroenabled)PPTM (application/vnd.ms-powerpoint.presentation.macroenabled.12) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
metadata_slide_count
metadata_title
提取文本,包括嵌入的文档Extract text, including embedded documents
MSG (application/vnd.ms-outlook)MSG (application/vnd.ms-outlook) metadata_content_type
metadata_message_from
metadata_message_from_email
metadata_message_to
metadata_message_to_email
metadata_message_cc
metadata_message_cc_email
metadata_message_bcc
metadata_message_bcc_email
metadata_creation_date
metadata_last_modified
metadata_subject
提取文本,包括附件Extract text, including attachments
ODT (application/vnd.apple.mpegurl)ODT (application/vnd.oasis.opendocument.text) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_last_modified
metadata_page_count
metadata_word_count
提取文本,包括嵌入的文档Extract text, including embedded documents
ODS (application/vnd.apple.mpegurl. oasis)ODS (application/vnd.oasis.opendocument.spreadsheet) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
提取文本,包括嵌入的文档Extract text, including embedded documents
ODP (application/vnd.apple.mpegurl. oasis)ODP (application/vnd.oasis.opendocument.presentation) metadata_content_type
metadata_author
metadata_creation_date
metadata_last_modified
title
提取文本,包括嵌入的文档Extract text, including embedded documents
ZIP (application/zip)ZIP (application/zip) metadata_content_type 从存档中的所有文档提取文本Extract text from all documents in the archive
GZ (应用程序/gzip)GZ (application/gzip) metadata_content_type 从存档中的所有文档提取文本Extract text from all documents in the archive
EPUB (application/EPUB + zip)EPUB (application/epub+zip) metadata_content_type
metadata_author
metadata_creation_date
metadata_title
metadata_description
metadata_language
metadata_keywords
metadata_identifier
metadata_publisher
从存档中的所有文档提取文本Extract text from all documents in the archive
XML (application/xml)XML (application/xml) metadata_content_type
metadata_content_encoding
剥离 XML 标记并提取文本Strip XML markup and extract text
JSON (application/json)JSON (application/json) metadata_content_type
metadata_content_encoding
提取文本Extract text
注意:如果需要从 JSON Blob 提取多个文档字段,请参阅为 JSON Blob 编制索引了解详细信息NOTE: If you need to extract multiple document fields from a JSON blob, see Indexing JSON blobs for details
EML (message/rfc822)EML (message/rfc822) metadata_content_type
metadata_message_from
metadata_message_to
metadata_message_cc
metadata_creation_date
metadata_subject
提取文本,包括附件Extract text, including attachments
RTF(应用程序/rtf)RTF (application/rtf) metadata_content_type
metadata_author
metadata_character_count
metadata_creation_date
metadata_page_count
metadata_word_count
提取文本Extract text
纯文本 (text/plain)Plain text (text/plain) metadata_content_type
metadata_content_encoding
提取文本Extract text

帮助我们提高 Azure 认知搜索Help us make Azure Cognitive Search better

如果想要请求新功能或者在改进方面有什么看法,敬请通过 UserVoice 站点告诉我们。If you have feature requests or ideas for improvements, let us know on our UserVoice site.