您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

在 Azure 中创建基本索引认知搜索Create a basic index in Azure Cognitive Search

在 Azure 认知搜索中,索引是在 azure 认知搜索服务上用于筛选和全文搜索的文档和其他构造的持久存储区。In Azure Cognitive Search, an index is a persistent store of documents and other constructs used for filtered and full text search on an Azure Cognitive Search service. 从概念上讲,文档是索引中的一个可搜索数据单元。Conceptually, a document is a single unit of searchable data in your index. 例如,电子商务零售商可能有所销售每件商品的文档,新闻机构可能有每篇报道的文档。For example, an e-commerce retailer might have a document for each item they sell, a news organization might have a document for each article, and so forth. 将这些概念对应到更为熟悉的数据库等效对象:索引在概念上类似于文档大致相当于表中的Mapping these concepts to more familiar database equivalents: an index is conceptually similar to a table, and documents are roughly equivalent to rows in a table.

添加或上载索引时,Azure 认知搜索会根据你提供的架构创建物理结构。When you add or upload an index, Azure Cognitive Search creates physical structures based on the schema you provide. 例如,如果将索引中的某个字段标记为可搜索,则为该字段创建倒排索引。For example, if a field in your index is marked as searchable, an inverted index is created for that field. 稍后,在添加或上载文档,或向 Azure 认知搜索提交搜索查询时,会将请求发送到搜索服务中的特定索引。Later, when you add or upload documents, or submit search queries to Azure Cognitive Search, you are sending requests to a specific index in your search service. 加载包含文档值的字段称为索引编制或数据引入。Loading fields with document values is called indexing or data ingestion.

可以在门户、REST API.NET SDK 中创建索引。You can create an index in the portal, REST API, or .NET SDK.

合理的索引设计通常是通过多个迭代实现的。Arriving at the right index design is typically achieved through multiple iterations. 使用工具和 API 的组合有助于快速完成设计。Using a combination of tools and APIs can help you finalize your design quickly.

  1. 确定是否可以使用索引器Determine whether you can use an indexer. 如果你的外部数据是支持的数据源之一,则你可以使用导入数据向导制作原型和加载索引。If your external data is one of the supported data sources, you can prototype and load an index using the Import data wizard.

  2. 如果无法使用“导入数据”,仍可以使用“添加索引”页上的控件在门户中创建初始索引,以及添加字段、数据类型和分配属性。If you can't use Import data, you can still create an initial index in the portal, adding fields, data types, and assigning attributes using controls on the Add Index page. 门户会显示不同数据类型可用的属性。The portal shows you which attributes are available for different data types. 如果你不太熟悉索引设计,此功能非常有用。If you're new to index design, this is helpful.

    添加索引页,按数据类型显示属性Add index page showing attributes by data type

    单击“创建”时,将在搜索服务中创建支持你的索引的所有物理结构。When you click Create, all of the physical structures supporting your index are created in your search service.

  3. 使用获取索引 REST APIPostman 等 Web 测试工具下载索引架构。Download the index schema using Get Index REST API and a web testing tool like Postman. 现在,门户中会显示所创建的索引的 JSON 表示形式。You now have a JSON representation of the index you created in the portal.

    接下来,你将切换到基于代码的方法。You are switching to a code-based approach at this point. 门户不太适合用于迭代,因为您无法编辑已创建的索引。The portal is not well suited for iteration because you cannot edit an index that is already created. 但是,可以使用 Postman 和 REST 完成剩余的任务。But you can use Postman and REST for the remaining tasks.

  4. 加载索引和数据Load your index with data. Azure 认知搜索接受 JSON 文档。Azure Cognitive Search accepts JSON documents. 若要以编程方式加载数据,可以在请求有效负载中使用包含 JSON 文档的 Postman。To load your data programmatically, you can use Postman with JSON documents in the request payload. 如果无法轻松将数据表示为 JSON,此步骤耗费的精力是最大的。If your data is not easily expressed as JSON, this step will be the most labor intensive.

  5. 查询索引,检查结果,并进一步迭代索引架构,直到开始看到预期的结果。Query your index, examine results, and further iterate on the index schema until you begin to see the results you expect. 可以使用搜索资源管理器或 Postman 来查询索引。You can use Search explorer or Postman to query your index.

  6. 继续使用代码来迭代设计。Continue using code to iterate over your design.

由于物理结构是在服务中创建的,因此,如果对现有字段定义进行了重要更改,则需要删除和重新创建索引Because physical structures are created in the service, dropping and recreating indexes is necessary whenever you make material changes to an existing field definition. 这意味着,在开发期间,应该对频繁的重新生成做好规划。This means that during development, you should plan on frequent rebuilds. 可以考虑使用一部分数据来加快重新生成的速度。You might consider working with a subset of your data to make rebuilds go faster.

迭代设计的建议方法是使用代码而不是门户。Code, rather than a portal approach, is recommended for iterative design. 如果依赖于使用门户创建索引定义,则每次重新生成都必须填充索引定义。If you rely on the portal for index definition, you will have to fill out the index definition on each rebuild. 如果开发项目仍处于早期阶段,Postman 和 REST API 等备选工具也有助于完成概念证明测试。As an alternative, tools like Postman and the REST API are helpful for proof-of-concept testing when development projects are still in early phases. 可对请求正文中的索引定义进行增量更改,然后将请求发送到服务,以使用更新的架构重新创建索引。You can make incremental changes to an index definition in a request body, and then send the request to your service to recreate an index using an updated schema.

索引的组成部分Components of an index

来说,Azure 认知搜索索引由下列元素组成。Schematically, an Azure Cognitive Search index is composed of the following elements.

字段集合通常是索引的最大组成部分,其中每个字段都已命名、类型化,并具有允许行为的属性(确定该字段的用法)。The fields collection is typically the largest part of an index, where each field is named, typed, and attributed with allowable behaviors that determine how it is used. 其他元素包括建议器计分配置文件、包含组件部件的分析器,以支持自定义、 CORS加密密钥选项。Other elements include suggesters, scoring profiles, analyzers with component parts to support customization, CORS and encryption key options.

{
  "name": (optional on PUT; required on POST) "name_of_index",
  "fields": [
    {
      "name": "name_of_field",
      "type": "Edm.String | Collection(Edm.String) | Edm.Int32 | Edm.Int64 | Edm.Double | Edm.Boolean | Edm.DateTimeOffset | Edm.GeographyPoint",
      "searchable": true (default where applicable) | false (only Edm.String and Collection(Edm.String) fields can be searchable),
      "filterable": true (default) | false,
      "sortable": true (default where applicable) | false (Collection(Edm.String) fields cannot be sortable),
      "facetable": true (default where applicable) | false (Edm.GeographyPoint fields cannot be facetable),
      "key": true | false (default, only Edm.String fields can be keys),
      "retrievable": true (default) | false,
      "analyzer": "name_of_analyzer_for_search_and_indexing", (only if 'searchAnalyzer' and 'indexAnalyzer' are not set)
      "searchAnalyzer": "name_of_search_analyzer", (only if 'indexAnalyzer' is set and 'analyzer' is not set)
      "indexAnalyzer": "name_of_indexing_analyzer", (only if 'searchAnalyzer' is set and 'analyzer' is not set)
      "synonymMaps": [ "name_of_synonym_map" ] (optional, only one synonym map per field is currently supported)
    }
  ],
  "suggesters": [
    {
      "name": "name of suggester",
      "searchMode": "analyzingInfixMatching",
      "sourceFields": ["field1", "field2", ...]
    }
  ],
  "scoringProfiles": [
    {
      "name": "name of scoring profile",
      "text": (optional, only applies to searchable fields) {
        "weights": {
          "searchable_field_name": relative_weight_value (positive #'s),
          ...
        }
      },
      "functions": (optional) [
        {
          "type": "magnitude | freshness | distance | tag",
          "boost": # (positive number used as multiplier for raw score != 1),
          "fieldName": "...",
          "interpolation": "constant | linear (default) | quadratic | logarithmic",
          "magnitude": {
            "boostingRangeStart": #,
            "boostingRangeEnd": #,
            "constantBoostBeyondRange": true | false (default)
          },
          "freshness": {
            "boostingDuration": "..." (value representing timespan leading to now over which boosting occurs)
          },
          "distance": {
            "referencePointParameter": "...", (parameter to be passed in queries to use as reference location)
            "boostingDistance": # (the distance in kilometers from the reference location where the boosting range ends)
          },
          "tag": {
            "tagsParameter": "..." (parameter to be passed in queries to specify a list of tags to compare against target fields)
          }
        }
      ],
      "functionAggregation": (optional, applies only when functions are specified) 
        "sum (default) | average | minimum | maximum | firstMatching"
    }
  ],
  "analyzers":(optional)[ ... ],
  "charFilters":(optional)[ ... ],
  "tokenizers":(optional)[ ... ],
  "tokenFilters":(optional)[ ... ],
  "defaultScoringProfile": (optional) "...",
  "corsOptions": (optional) {
    "allowedOrigins": ["*"] | ["origin_1", "origin_2", ...],
    "maxAgeInSeconds": (optional) max_age_in_seconds (non-negative integer)
  },
  "encryptionKey":(optional){
    "keyVaultUri": "azure_key_vault_uri",
    "keyVaultKeyName": "name_of_azure_key_vault_key",
    "keyVaultKeyVersion": "version_of_azure_key_vault_key",
    "accessCredentials":(optional){
      "applicationId": "azure_active_directory_application_id",
      "applicationSecret": "azure_active_directory_application_authentication_key"
    }
  }
}

字段集合与字段属性Fields collection and field attributes

定义架构时,必须在索引中指定每个字段的名称、类型和属性。As you define your schema, you must specify the name, type, and attributes of each field in your index. 字段类型的作用是对该字段中存储的数据进行分类。The field type classifies the data that is stored in that field. 对各个字段设置属性的目的是指定字段的使用方式。Attributes are set on individual fields to specify how the field is used. 下表枚举了可以指定的类型和属性。The following tables enumerate the types and attributes you can specify.

数据类型Data types

类型Type DescriptionDescription
Edm.StringEdm.String 全文搜索可以选择性地标记化(断词、词干提取等)的文本。Text that can optionally be tokenized for full-text search (word-breaking, stemming, and so forth).
Collection(Edm.String)Collection(Edm.String) 全文搜索可以选择性标记化的字符串列表。A list of strings that can optionally be tokenized for full-text search. 理论上,集合中的项目数没有上限,但集合的有效负载大小上限为 16 MB。There is no theoretical upper limit on the number of items in a collection, but the 16 MB upper limit on payload size applies to collections.
Edm.BooleanEdm.Boolean 包含 true/false 值。Contains true/false values.
Edm.Int32Edm.Int32 32 位整数值。32-bit integer values.
Edm.Int64Edm.Int64 64 位整数值。64-bit integer values.
Edm.DoubleEdm.Double 双精度数字数据。Double-precision numeric data.
Edm.DateTimeOffsetEdm.DateTimeOffset 以 OData V4 格式表示的日期时间值(例如 yyyy-MM-ddTHH:mm:ss.fffZyyyy-MM-ddTHH:mm:ss.fff[+/-]HH:mm)。Date time values represented in the OData V4 format (for example, yyyy-MM-ddTHH:mm:ss.fffZ or yyyy-MM-ddTHH:mm:ss.fff[+/-]HH:mm).
Edm.GeographyPointEdm.GeographyPoint 表示地球上的地理位置的点。A point representing a geographic location on the globe.

可在此处找到有关 Azure 认知搜索支持的数据类型的更多详细信息。You can find more detailed information about Azure Cognitive Search's supported data types here.

索引属性Index attributes

索引中正好有一个字段必须指定为唯一标识每个文档的字段。Exactly one field in your index must be the designated as a key field that uniquely identifies each document.

其他属性确定如何在应用程序中使用字段。Other attributes determine how a field is used in an application. 例如,可搜索属性将分配给应包括在全文搜索中的每个字段。For example, the searchable attribute is assigned to every field that should be included in a full text search.

用于生成索引的 Api 具有不同的默认行为。The APIs you use to build an index have varying default behaviors. 对于REST api,默认情况下会启用大多数属性(例如,字符串字段的可搜索和可检索值),并且如果要将其关闭,通常只需要设置它们。For the REST APIs, most attributes are enabled by default (for example, searchable and retrievable are true for string fields) and you often only need to set them if you want to turn them off. 对于 .NET SDK,相反的情况也是如此。For the .NET SDK, the opposite is true. 在未显式设置的任何属性上,默认情况下将禁用相应的搜索行为,除非您专门启用此操作。On any property you do not explicitly set, the default is to disable the corresponding search behavior unless you specifically enable it.

AttributeAttribute DescriptionDescription
key 为每个文档提供唯一 ID 以便查找文档的字符串。A string that provides the unique ID of each document, used for document lookup. 每个索引必须有一个 key。Every index must have one key. 只有一个字段可以是 key,并且此字段类型必须设置为 Edm.String。Only one field can be the key, and its type must be set to Edm.String.
retrievable 指定是否可以在搜索结果中返回字段。Specifies whether a field can be returned in a search result.
filterable 允许在筛选查询中使用字段。Allows the field to be used in filter queries.
Sortable 允许查询使用此字段对搜索结果排序。Allows a query to sort search results using this field.
facetable 允许在 分面导航 结构中使用字段进行用户自主筛选。Allows a field to be used in a faceted navigation structure for user self-directed filtering. 通常,包含重复值的字段更适合分面导航,这些重复值可用于将多个文档(例如,同属一个品牌或服务类别的多个文档)组合在一起。Typically fields containing repetitive values that you can use to group multiple documents together (for example, multiple documents that fall under a single brand or service category) work best as facets.
searchable 将字段标记为可全文搜索。Marks the field as full-text searchable.

索引大小Index size

索引的大小由您上传的文档的大小和索引配置确定,例如是否包括建议器,以及如何在各个字段上设置属性。The size of an index is determined by the size of the documents you upload, plus index configuration, such as whether you include suggesters, and how you set attributes on individual fields. 以下屏幕截图演示了各种属性组合产生的索引存储模式。The following screenshot illustrates index storage patterns resulting from various combinations of attributes.

索引基于内置的房地产示例数据源,可以在门户中对其进行索引和查询。The index is based on the built-in real estate sample data source, which you can index and query in the portal. 尽管未显示索引架构,但可以基于索引名称推断属性。Although the index schemas are not shown, you can infer the attributes based on the index name. 例如,只选择了 realestate-searchable 索引中的 searchable 属性,只选择了 realestate-retrievable 索引中的 retrievable 属性,等等。For example, realestate-searchable index has the searchable attribute selected and nothing else, realestate-retrievable index has the retrievable attribute selected and nothing else, and so forth.

基于属性选择的索引大小Index size based on attribute selection

尽管这些索引变体是人造的,但我们可以参考这些变体来对属性影响存储的方式进行广泛比较。Although these index variants are artificial, we can refer to them for broad comparisons of how attributes affect storage. 设置 retrievable 是否会增大索引大小?Does setting retrievable increase index size? 不。No. 将字段添加到建议器是否会增大索引大小?Does adding fields to a Suggester increase index size? 可以。Yes.

支持筛选和排序的索引的比例比只支持全文搜索的索引要大。Indexes that support filter and sort are proportionally larger than those supporting just full text search. 筛选和排序操作扫描是否完全匹配,并要求存在不完整的文档。Filter and sort operations scan for exact matches, requiring the presence of intact documents. 相比之下,支持全文搜索和模糊搜索的可搜索字段使用倒排索引,而这些索引中填充了空间占用量比整个文档更小的标记化字词。In contrast, searchable fields supporting full-text and fuzzy search use inverted indexes, which are populated with tokenized terms that consume less space than whole documents.

备注

存储体系结构被视为 Azure 认知搜索的实现细节,可能在不通知的情况下更改。Storage architecture is considered an implementation detail of Azure Cognitive Search and could change without notice. 不保证将来仍会保持当前的行为。There is no guarantee that current behavior will persist in the future.

建议Suggesters

建议器是定义要使用索引中的哪些字段来支持搜索中的自动填写或提前键入查询的架构部分。A suggester is a section of the schema that defines which fields in an index are used to support auto-complete or type-ahead queries in searches. 通常,当用户键入搜索查询时,会将部分搜索字符串发送到建议(REST API) ,API 将返回一组建议的文档或短语。Typically, partial search strings are sent to the Suggestions (REST API) while the user is typing a search query, and the API returns a set of suggested documents or phrases.

添加到建议器的字段用于生成自动提示搜索词。Fields added to a suggester are used to build type-ahead search terms. 在索引编制期间创建所有搜索词,并单独存储它们。All of the search terms are created during indexing and stored separately. 有关创建建议器结构的详细信息,请参阅添加建议器For more information about creating a suggester structure, see Add suggesters.

计分配置文件Scoring profiles

评分配置文件是定义自定义评分行为,方便用户影响搜索结果中排名更高的项的架构部分。A scoring profile is a section of the schema that defines custom scoring behaviors that let you influence which items appear higher in the search results. 计分配置文件由字段权重和函数组成。Scoring profiles are made up of field weights and functions. 若要使用它们,请在查询字符串上按名称指定配置文件。To use them, you specify a profile by name on the query string.

在幕后运行的默认计分概要文件,用于为结果集中的每个项目计算搜索分数。A default scoring profile operates behind the scenes to compute a search score for every item in a result set. 可使用内部、未命名的计分概要文件。You can use the internal, unnamed scoring profile. 或者,将 defaultScoringProfile 设置为使用自定义配置文件作为默认配置文件,每当未在查询字符串上指定自定义配置文件时,将调用该配置文件。Alternatively, set defaultScoringProfile to use a custom profile as the default, invoked whenever a custom profile is not specified on the query string.

分析器Analyzers

分析器元素设置用于字段的语言分析器的名称。The analyzers element sets the name of the language analyzer to use for the field. 有关可供你使用的分析器范围的详细信息,请参阅向Azure 认知搜索索引中添加分析器For more information about the range of analyzers available to you, see Adding analyzers to an Azure Cognitive Search index. 分析器仅适用于可搜索字段。Analyzers can only be used with searchable fields. 将分析器分配到字段后,除非重新生成索引,否则无法更改分析器。Once the analyzer is assigned to a field, it cannot be changed unless you rebuild the index.

CORSCORS

默认情况下,客户端 JavaScript 无法调用任何 API,因为浏览器将阻止所有跨域请求。Client-side JavaScript cannot call any APIs by default since the browser will prevent all cross-origin requests. 若要允许对索引进行跨域查询,请通过设置 corsOptions 来启用 CORS(跨域资源共享)。To allow cross-origin queries to your index, enable CORS (Cross-Origin Resource Sharing) by setting the corsOptions attribute. 出于安全原因,只有查询 API 才支持 CORS。For security reasons, only query APIs support CORS.

可为 CORS 设置以下选项:The following options can be set for CORS:

  • allowedOrigins (必需):这是将向其授予对索引的访问权限的来源列表。allowedOrigins (required): This is a list of origins that will be granted access to your index. 这意味着,将允许从这些来源提供的任何 JavaScript 代码查询索引(假设它提供正确的 api-key)。This means that any JavaScript code served from those origins will be allowed to query your index (assuming it provides the correct api-key). 每个来源通常采用 protocol://<fully-qualified-domain-name>:<port> 格式,不过往往会省略 <port>Each origin is typically of the form protocol://<fully-qualified-domain-name>:<port> although <port> is often omitted. 有关更多详细信息,请参阅跨域资源共享 (Wikipedia)See Cross-origin resource sharing (Wikipedia) for more details.

    若要允许访问所有来源,请将 * 作为单个项目包含在 allowedOrigins 数组中。If you want to allow access to all origins, include * as a single item in the allowedOrigins array. 不建议对生产搜索服务采用这种做法,但它在开发和调试中却很有用。This is not recommended practice for production search services but it is often useful for development and debugging.

  • maxAgeInSeconds (可选):浏览器使用此值来确定缓存 CORS 预检响应的持续时间(以秒为单位)。maxAgeInSeconds (optional): Browsers use this value to determine the duration (in seconds) to cache CORS preflight responses. 此值必须是非负整数。This must be a non-negative integer. 此值越大,性能越好,但 CORS 策略更改生效所需的时间也越长。The larger this value is, the better performance will be, but the longer it will take for CORS policy changes to take effect. 如果未设置此值,将使用 5 分钟的默认持续时间。If it is not set, a default duration of 5 minutes will be used.

加密密钥Encryption Key

虽然默认情况下使用 Microsoft 托管的密钥对所有 Azure 认知搜索索引进行加密,但可以将索引配置为在 Key Vault 中使用客户托管的密钥进行加密。While all Azure Cognitive Search indexes are encrypted by default using Microsoft-managed keys, indexes can be configured to be encrypted with customer-managed keys in Key Vault. 若要了解详细信息,请参阅管理 Azure 认知搜索中的加密密钥To learn more, see Manage encryption keys in Azure Cognitive Search.

后续步骤Next steps

了解索引的构成后,可以继续在门户中创建第一个索引。With an understanding of index composition, you can continue in the portal to create your first index.