您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

Azure 中全文搜索的工作原理认知搜索How full text search works in Azure Cognitive Search

本文面向需要更深入了解 Azure 认知搜索中的 Lucene 全文搜索工作原理的开发人员。This article is for developers who need a deeper understanding of how Lucene full text search works in Azure Cognitive Search. 对于文本查询,Azure 认知搜索将在大多数情况下无缝地提供预期结果,但有时你可能会遇到 "关闭" 的结果。For text queries, Azure Cognitive Search will seamlessly deliver expected results in most scenarios, but occasionally you might get a result that seems "off" somehow. 在这种情况下,如果对 Lucene 查询执行的四个阶段(查询分析、词法分析、文档匹配和评分)有一定的背景知识,则有助于确定要对提供所需结果的查询参数或索引配置进行哪些特定的更改。In these situations, having a background in the four stages of Lucene query execution (query parsing, lexical analysis, document matching, scoring) can help you identify specific changes to query parameters or index configuration that will deliver the desired outcome.

备注

Azure 认知搜索使用 Lucene 进行全文搜索,但 Lucene 集成并不完整。Azure Cognitive Search uses Lucene for full text search, but Lucene integration is not exhaustive. 我们会有选择地公开和扩展 Lucene 功能,以实现对 Azure 认知搜索重要的方案。We selectively expose and extend Lucene functionality to enable the scenarios important to Azure Cognitive Search.

体系结构概述和关系图Architecture overview and diagram

处理全文搜索查询时,首先会分析查询文本以提取搜索词。Processing a full text search query starts with parsing the query text to extract search terms. 搜索引擎使用索引来检索包含匹配词的文档。The search engine uses an index to retrieve documents with matching terms. 单个查询词有时会分解并重新构造成新的形式,以便从更大的角度来确定可将哪些内容视为潜在的匹配项。Individual query terms are sometimes broken down and reconstituted into new forms to cast a broader net over what could be considered as a potential match. 然后,根据分配给每个匹配文档的相关性评分将结果集排序。A result set is then sorted by a relevance score assigned to each individual matching document. 排名列表中靠前的匹配项将返回给调用方应用程序。Those at the top of the ranked list are returned to the calling application.

再次指出,查询执行包括四个阶段:Restated, query execution has four stages:

  1. 查询分析Query parsing
  2. 词法分析Lexical analysis
  3. 文档检索Document retrieval
  4. 评分Scoring

以下关系图演示了用于处理搜索请求的组件。The diagram below illustrates the components used to process a search request.

Azure 中的 Lucene 查询体系结构示意图认知搜索

关键组件Key components 功能说明Functional description
查询分析器Query parsers 将查询词与查询运算符区分开来,并创建要发送到搜索引擎的查询结构(查询树)。Separate query terms from query operators and create the query structure (a query tree) to be sent to the search engine.
分析器Analyzers 针对查询词执行词法分析。Perform lexical analysis on query terms. 此过程可能涉及到查询词的转换、删除或扩展。This process can involve transforming, removing, or expanding of query terms.
索引Index 一个有效的数据结构,用于存储和组织从索引文档中提取的可搜索词。An efficient data structure used to store and organize searchable terms extracted from indexed documents.
搜索引擎Search engine 根据倒排索引的内容检索文档并为其评分。Retrieves and scores matching documents based on the contents of the inverted index.

搜索请求的剖析Anatomy of a search request

搜索请求是一个完整的规范,描述了应在结果集中返回哪些内容。A search request is a complete specification of what should be returned in a result set. 最简单的搜索请求形式是不包含任何类型的条件的空查询。In simplest form, it is an empty query with no criteria of any kind. 比较现实的搜索请求示例包含参数、多个查询词,其范围可能限定为某些字段,另外,可能还包含筛选表达式和排序规则。A more realistic example includes parameters, several query terms, perhaps scoped to certain fields, with possibly a filter expression and ordering rules.

以下示例是使用REST API可以发送到 Azure 认知搜索的搜索请求。The following example is a search request you might send to Azure Cognitive Search using the REST API.

POST /indexes/hotels/docs/search?api-version=2019-05-06
{
    "search": "Spacious, air-condition* +\"Ocean view\"",
    "searchFields": "description, title",
    "searchMode": "any",
    "filter": "price ge 60 and price lt 300",
    "orderby": "geo.distance(location, geography'POINT(-159.476235 22.227659)')", 
    "queryType": "full" 
}

对于此请求,搜索引擎将执行以下操作:For this request, the search engine does the following:

  1. 筛选出其中的价格至少为 $60 且小于 $300 的文档。Filters out documents where the price is at least $60 and less than $300.
  2. 执行查询。Executes the query. 在此示例中,搜索查询包括短语和字词:"Spacious, air-condition* +\"Ocean view\""(用户通常不会输入标点,但此示例中包含标点,目的是为了方便解释分析器如何处理标点)。In this example, the search query consists of phrases and terms: "Spacious, air-condition* +\"Ocean view\"" (users typically don't enter punctuation, but including it in the example allows us to explain how analyzers handle it). 对于此查询,搜索引擎将在 searchFields 指定的说明和标题字段中扫描包含“Ocean view”的文档,此外,还会根据字词“spacious”或者以前缀“air-condition”开头的字词执行搜索。For this query, the search engine scans the description and title fields specified in searchFields for documents that contain "Ocean view", and additionally on the term "spacious", or on terms that start with the prefix "air-condition". searchMode 参数用于匹配任一字词(默认);如果未明确指定字词 (+),则匹配所有字词。The searchMode parameter is used to match on any term (default) or all of them, for cases where a term is not explicitly required (+).
  3. 根据与给定地理位置的距离将酒店结果集排序,然后将其返回到调用方应用程序。Orders the resulting set of hotels by proximity to a given geography location, and then returned to the calling application.

本文的大部分内容介绍如何处理搜索查询"Spacious, air-condition* +\"Ocean view\""The majority of this article is about processing of the search query: "Spacious, air-condition* +\"Ocean view\"". 筛选和排序不属于本文的介绍范畴。Filtering and ordering are out of scope. 有关详细信息,请参阅搜索 API 参考文档For more information, see the Search API reference documentation.

阶段 1:查询分析Stage 1: Query parsing

如前所述,查询字符串是请求的第一行:As noted, the query string is the first line of the request:

 "search": "Spacious, air-condition* +\"Ocean view\"", 

查询分析器会将运算符(例如本示例中的 *+)与搜索词区分开来,并将搜索查询解构成受支持类型的子查询The query parser separates operators (such as * and + in the example) from search terms, and deconstructs the search query into subqueries of a supported type:

  • 针对独立字词(例如 spacious)的字词查询term query for standalone terms (like spacious)
  • 针对带引号字词(例如 ocean view)的短语查询phrase query for quoted terms (like ocean view)
  • 针对字词后接前缀运算符 *(例如 air-condition)的前缀查询prefix query for terms followed by a prefix operator * (like air-condition)

有关支持的查询类型的完整列表,请参阅 Lucene 查询语法For a full list of supported query types see Lucene query syntax

与子查询关联的运算符确定是“必须”还是“应该”满足该查询,才将某个文档视为匹配项。Operators associated with a subquery determine whether the query "must be" or "should be" satisfied in order for a document to be considered a match. 例如,由于使用了 + 运算符,+"Ocean view" 表示“必须”满足查询。For example, +"Ocean view" is "must" due to the + operator.

查询分析器将传递给搜索引擎的子查询重新构造成查询树(表示查询的内部结构)。The query parser restructures the subqueries into a query tree (an internal structure representing the query) it passes on to the search engine. 在查询分析的第一个阶段,查询树如下所示。In the first stage of query parsing, the query tree looks like this.

searchmode 设置为 any 的布尔查询

支持的分析器:简单和完整 LuceneSupported parsers: Simple and Full Lucene

Azure 认知搜索公开两种不同的查询语言: simple (默认)和 fullAzure Cognitive Search exposes two different query languages, simple (default) and full. 通过使用搜索请求设置 queryType 参数,可让查询分析器知道你选择的查询语言,这样,它就知道如何解释运算符和语法。By setting the queryType parameter with your search request, you tell the query parser which query language you choose so that it knows how to interpret the operators and syntax. 简单查询语言比较直观而且可靠,通常适合用于按原样解释用户输入,而无需客户端的处理。The Simple query language is intuitive and robust, often suitable to interpret user input as-is without client-side processing. 它支持 Web 搜索引擎中常见的查询运算符。It supports query operators familiar from web search engines. 完整 Lucene 查询语言:可通过设置 queryType=full 获取该语言。它添加了对更多运算符和查询类型(例如通配符、模糊查询、正则表达式和限定字段的查询)的支持,从而扩展了默认的简单查询语言。The Full Lucene query language, which you get by setting queryType=full, extends the default Simple query language by adding support for more operators and query types like wildcard, fuzzy, regex, and field-scoped queries. 例如,在简单查询语法中发送的正则表达式将解释为查询字符串而不是表达式。For example, a regular expression sent in Simple query syntax would be interpreted as a query string and not an expression. 本文中的示例请求使用完整 Lucene 查询语言。The example request in this article uses the Full Lucene query language.

searchMode 对分析器的影响Impact of searchMode on the parser

影响分析的另一个搜索请求参数是 searchMode 参数。Another search request parameter that affects parsing is the searchMode parameter. 该参数控制布尔查询的默认运算符:any(默认)或 all。It controls the default operator for Boolean queries: any (default) or all.

如果 searchMode=any(默认设置),则 spacious 与 air-condition 之间的空间分隔符为 OR (||),因此,示例查询文本等效于:When searchMode=any, which is the default, the space delimiter between spacious and air-condition is OR (||), making the sample query text equivalent to:

Spacious,||air-condition*+"Ocean view" 

显式运算符(例如 +"Ocean view" 中的 +)在布尔查询构造中没有歧义(必须匹配字词)。Explicit operators, such as + in +"Ocean view", are unambiguous in boolean query construction (the term must match). 剩余字词的解释方式不太明确:spacious 和 air-condition。Less obvious is how to interpret the remaining terms: spacious and air-condition. 搜索引擎是否应该根据 ocean view spacious air-condition 查找匹配项?Should the search engine find matches on ocean view and spacious and air-condition? 或者,是否应该查找 ocean view 加上任何一个剩余的字词?Or should it find ocean view plus either one of the remaining terms?

默认情况下 (searchMode=any),搜索引擎采用更广泛的解释。By default (searchMode=any), the search engine assumes the broader interpretation. 应该匹配任一字段,反映“or”的语义。Either field should be matched, reflecting "or" semantics. 上面所示的初始查询树包含两个“should”运算符,显示了默认行为。The initial query tree illustrated previously, with the two "should" operations, shows the default.

假设我们现在设置为 searchMode=allSuppose that we now set searchMode=all. 在这种情况下,空格被解释为“and”运算。In this case, the space is interpreted as an "and" operation. 文档中必须包含每个剩余的字词,才能将该文档视为匹配项。Each of the remaining terms must both be present in the document to qualify as a match. 生成的示例查询将按以下方式解释:The resulting sample query would be interpreted as follows:

+Spacious,+air-condition*+"Ocean view"

此查询的修改查询树如下所示,其中的匹配文档是所有三个子查询的交集:A modified query tree for this query would be as follows, where a matching document is the intersection of all three subqueries:

searchmode 设置为 all 的布尔查询

备注

运行代表性查询时,选择 searchMode=any 而不选择 searchMode=all 是最明智的决定。Choosing searchMode=any over searchMode=all is a decision best arrived at by running representative queries. 经常使用运算符(搜索文档存储时就经常这样做)的用户可能会发现,如果 searchMode=all 能够告知布尔查询构造,则结果会更直观。Users who are likely to include operators (common when searching document stores) might find results more intuitive if searchMode=all informs boolean query constructs. 有关 searchMode 与运算符之间的交互作用的详细信息,请参阅简单查询语法For more about the interplay between searchMode and operators, see Simple query syntax.

阶段 2:词法分析Stage 2: Lexical analysis

构造查询树之后,词法分析器将处理字词查询短语查询Lexical analyzers process term queries and phrase queries after the query tree is structured. 分析器接受分析器提供给它的文本输入,处理文本,并发回标记化的字词,以便在查询树中整合。An analyzer accepts the text inputs given to it by the parser, processes the text, and then sends back tokenized terms to be incorporated into the query tree.

词法分析的最常见形式是语言分析,这种分析可以根据给定语言特定的规则转换查询词:The most common form of lexical analysis is linguistic analysis which transforms query terms based on rules specific to a given language:

  • 将查询词化简为单词的词根形式Reducing a query term to the root form of a word
  • 删除不必要的单词(非索引字,例如英语中的“the”或“and”)Removing non-essential words (stopwords, such as "the" or "and" in English)
  • 将复合词分解为不同的组成部分Breaking a composite word into component parts
  • 转换单词的大小写Lower casing an upper case word

所有这些操作往往会消除用户提供的文本输入与存储在索引中的字词之间的差异。All of these operations tend to erase differences between the text input provided by the user and the terms stored in the index. 此类操作超出了文本处理的范围,需要精通语言本身。Such operations go beyond text processing and require in-depth knowledge of the language itself. 若要添加此语言识别层,Azure 认知搜索支持 Lucene 和 Microsoft 的长语言分析器列表。To add this layer of linguistic awareness, Azure Cognitive Search supports a long list of language analyzers from both Lucene and Microsoft.

备注

根据具体的情景,分析要求时而简单,时而繁琐。Analysis requirements can range from minimal to elaborate depending on your scenario. 可以通过选择某个预定义的分析器或者创建自己的自定义分析器来控制词法分析的复杂性。You can control complexity of lexical analysis by the selecting one of the predefined analyzers or by creating your own custom analyzer. 可将分析器的分析范围限定为可搜索的字段,并且可将分析器指定为字段定义的一部分。Analyzers are scoped to searchable fields and are specified as part of a field definition. 这样,便可以根据每个字段运行不同的词法分析。This allows you to vary lexical analysis on a per-field basis. 如果未指定分析器,将使用标准 Lucene 分析器。Unspecified, the standard Lucene analyzer is used.

在本示例中,分析之前的初始查询树包含字词“Spacious,”,其中使用了大写的 S 以及一个逗号。查询分析器会将逗号解释为查询字词的一部分(逗号不被视为查询语言运算符)。In our example, prior to analysis, the initial query tree has the term "Spacious," with an uppercase "S" and a comma that the query parser interprets as a part of the query term (a comma is not considered a query language operator).

当默认分析器处理该字词时,会将“ocean view”和“spacious”转换为小写,并删除逗号字符。When the default analyzer processes the term, it will lowercase "ocean view" and "spacious", and remove the comma character. 修改后的查询树如下所示:The modified query tree will look as follows:

包含已分析字词的布尔查询

测试分析器的行为Testing analyzer behaviors

可以使用分析 API 测试分析器的行为。The behavior of an analyzer can be tested using the Analyze API. 提供要分析的文本,查看给定的分析器会生成哪些字词。Provide the text you want to analyze to see what terms given analyzer will generate. 例如,若要查看标准分析器如何处理文本“air-condition”,可以发出以下请求:For example, to see how the standard analyzer would process the text "air-condition", you can issue the following request:

{
    "text": "air-condition",
    "analyzer": "standard"
}

标准分析器会将输入文本分解成以下两个标记,使用起始和结束偏移(用于命中项突出显示)以及文本的位置(用于短语匹配)等属性来批注输入文本:The standard analyzer breaks the input text into the following two tokens, annotating them with attributes like start and end offsets (used for hit highlighting) as well as their position (used for phrase matching):

{
  "tokens": [
    {
      "token": "air",
      "startOffset": 0,
      "endOffset": 3,
      "position": 0
    },
    {
      "token": "condition",
      "startOffset": 4,
      "endOffset": 13,
      "position": 1
    }
  ]
}

词法分析的例外情况Exceptions to lexical analysis

词法分析仅适用于需要完整字词的查询类型 – 字词查询或短语查询,Lexical analysis applies only to query types that require complete terms – either a term query or a phrase query. 而不适用于使用不完整字词的查询类型 – 前缀查询、通配符查询、正则表达式查询,或者模糊查询。It doesn’t apply to query types with incomplete terms – prefix query, wildcard query, regex query – or to a fuzzy query. 这些查询类型(包括前缀查询,在本示例中包含字词 air-condition*)将直接添加到查询树,会绕过分析阶段。Those query types, including the prefix query with term air-condition* in our example, are added directly to the query tree, bypassing the analysis stage. 针对这些类型的查询字词执行的唯一转换操作是转换为小写。The only transformation performed on query terms of those types is lowercasing.

阶段 3:文档检索Stage 3: Document retrieval

文档检索是否在索引中查找包含匹配词的文档。Document retrieval refers to finding documents with matching terms in the index. 最好是通过一个示例来理解此阶段。This stage is understood best through an example. 我们从一个采用以下简单架构的酒店索引着手:Let's start with a hotels index having the following simple schema:

{
    "name": "hotels",
    "fields": [
        { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
        { "name": "title", "type": "Edm.String", "searchable": true },
        { "name": "description", "type": "Edm.String", "searchable": true }
    ] 
} 

进一步假设此索引包含以下四个文档:Further assume that this index contains the following four documents:

{
    "value": [
        {
            "id": "1",
            "title": "Hotel Atman",
            "description": "Spacious rooms, ocean view, walking distance to the beach."
        },
        {
            "id": "2",
            "title": "Beach Resort",
            "description": "Located on the north shore of the island of Kauaʻi. Ocean view."
        },
        {
            "id": "3",
            "title": "Playa Hotel",
            "description": "Comfortable, air-conditioned rooms with ocean view."
        },
        {
            "id": "4",
            "title": "Ocean Retreat",
            "description": "Quiet and secluded"
        }
    ]
}

字词的索引编制方式How terms are indexed

了解索引的一些基本知识有助于理解检索。To understand retrieval, it helps to know a few basics about indexing. 存储的单位是一个倒排索引,每个可搜索字段对应一个索引。The unit of storage is an inverted index, one for each searchable field. 在倒排索引中,有一个来自所有文档的所有字词的排序列表。Within an inverted index is a sorted list of all terms from all documents. 每个字词映射到出现该字词的文档列表,以下示例清晰演示了这种映射。Each term maps to the list of documents in which it occurs, as evident in the example below.

要在倒排索引中生成字词,搜索引擎将针对文档内容执行词法分析,这类似于查询处理期间执行的操作:To produce the terms in an inverted index, the search engine performs lexical analysis over the content of documents, similar to what happens during query processing:

  1. 根据分析器的配置,执行将文本输入传递给分析器、转换为小写、去除标点等操作。Text inputs are passed to an analyzer, lower-cased, stripped of punctuation, and so forth, depending on the analyzer configuration.
  2. 令牌是文本分析的输出。Tokens are the output of text analysis.
  3. 将词语添加到索引。Terms are added to the index.

我们经常(但不是非要这样做)使用相同的分析器来执行搜索和索引编制操作,使查询词看上去更像是索引中的字词。It's common, but not required, to use the same analyzers for search and indexing operations so that query terms look more like terms inside the index.

备注

Azure 认知搜索允许您通过其他 indexAnalyzersearchAnalyzer 字段参数为索引和搜索指定不同的分析器。Azure Cognitive Search lets you specify different analyzers for indexing and search via additional indexAnalyzer and searchAnalyzer field parameters. 如果未指定,使用设置分析器analyzer属性用于索引编制和搜索。If unspecified, the analyzer set with the analyzer property is used for both indexing and searching.

示例文档的倒排索引Inverted index for example documents

继续使用前面的示例,对于标题字段,倒排索引如下所示:Returning to our example, for the title field, the inverted index looks like this:

条款Term 文档列表Document list
atmanatman 1
beachbeach 22
hotelhotel 1, 31, 3
oceanocean 44
playaplaya 33
resortresort 33
retreatretreat 44

在标题字段中,只有 hotel 显示在以下两个文档中:1 和 3。In the title field, only hotel shows up in two documents: 1, 3.

对于说明字段,索引如下所示:For the description field, the index is as follows:

条款Term 文档列表Document list
airair 33
andand 44
beachbeach 1
conditionedconditioned 33
comfortablecomfortable 33
distancedistance 1
islandisland 22
kauaʻikauaʻi 22
locatedlocated 22
northnorth 22
oceanocean 1, 2, 31, 2, 3
,步骤总数为of 22
onon 22
quietquiet 44
roomsrooms 1, 31, 3
secludedsecluded 44
shoreshore 22
spaciousspacious 1
thethe 1, 21, 2
toto 1
视图view 1, 2, 31, 2, 3
walkingwalking 1
替换为with 33

根据编制索引的字词匹配查询词Matching query terms against indexed terms

定义倒排索引后,我们继续使用前面的示例查询,了解如何针对该示例查询查找匹配的文档。Given the inverted indices above, let’s return to the sample query and see how matching documents are found for our example query. 回想一下,最终的查询树如下所示:Recall that the final query tree looks like this:

包含已分析字词的布尔查询

在查询执行过程中,会针对可搜索字段单独执行各个查询。During query execution, individual queries are executed against the searchable fields independently.

  • 字词查询“spacious”匹配文档 1 (Hotel Atman)。The TermQuery, "spacious", matches document 1 (Hotel Atman).

  • 前缀查询“air-condition*”不匹配任何文档。The PrefixQuery, "air-condition*", doesn't match any documents.

    这种行为有时会让开发人员感到混淆。This is a behavior that sometimes confuses developers. 尽管字词 air-conditioned 在文档中存在,但它已被默认分析器拆分成两个字词。Although the term air-conditioned exists in the document, it is split into two terms by the default analyzer. 前面已经提到,不会分析包含部分字词的前缀查询。Recall that prefix queries, which contain partial terms, are not analyzed. 因此,将在倒置索引中查找包含前缀“air-condition”的字词,但不会找到。Therefore terms with prefix "air-condition" are looked up in the inverted index and not found.

  • 短语查询“ocean view”将在原始文档中查找字词“ocean”和“view”并检查字词的近似性。The PhraseQuery, "ocean view", looks up the terms "ocean" and "view" and checks the proximity of terms in the original document. 文档 1、2 和 3 的说明字段匹配此查询。Documents 1, 2 and 3 match this query in the description field. 请注意,文档 4 的标题中包含字词 ocean,但不被视为匹配项,因为我们要查找的是短语“ocean view”而不是单个单词。Notice document 4 has the term ocean in the title but isn’t considered a match, as we're looking for the "ocean view" phrase rather than individual words.

备注

搜索查询是独立于 Azure 认知搜索索引中的所有可搜索字段执行的,除非使用 searchFields 参数来限制设置的字段,如示例搜索请求中所示。A search query is executed independently against all searchable fields in the Azure Cognitive Search index unless you limit the fields set with the searchFields parameter, as illustrated in the example search request. 将返回与任一选定字段匹配的文档。Documents that match in any of the selected fields are returned.

总而言之,对于上述查询,匹配的文档为 1、2、3。On the whole, for the query in question, the documents that match are 1, 2, 3.

阶段 4:评分Stage 4: Scoring

将为搜索结果集中的每个文档分配一个相关性评分。Every document in a search result set is assigned a relevance score. 相关性评分的作用是提高能够为搜索查询所表示的用户问题提供最佳答案的文档的排名。The function of the relevance score is to rank higher those documents that best answer a user question as expressed by the search query. 评分是根据匹配的字词的统计属性计算的。The score is computed based on statistical properties of terms that matched. 评分公式的核心是 TF/IDF(字词频率-逆向文档频率)At the core of the scoring formula is TF/IDF (term frequency-inverse document frequency). 在包含不常见和常见字词的查询中,TF/IDF 会提升包含不常见字词的结果。In queries containing rare and common terms, TF/IDF promotes results containing the rare term. 例如,在包含所有 Wikipedia 文章的假想索引中,对于匹配查询 the president 的文档,匹配 president 的文档的相关性被视为高于匹配 the 的文档。For example, in a hypothetical index with all Wikipedia articles, from documents that matched the query the president, documents matching on president are considered more relevant than documents matching on the.

评分示例Scoring example

回想一下与示例查询匹配的三个文档:Recall the three documents that matched our example query:

search=Spacious, air-condition* +"Ocean view"  
{
  "value": [
    {
      "@search.score": 0.25610128,
      "id": "1",
      "title": "Hotel Atman",
      "description": "Spacious rooms, ocean view, walking distance to the beach."
    },
    {
      "@search.score": 0.08951007,
      "id": "3",
      "title": "Playa Hotel",
      "description": "Comfortable, air-conditioned rooms with ocean view."
    },
    {
      "@search.score": 0.05967338,
      "id": "2",
      "title": "Ocean Resort",
      "description": "Located on a cliff on the north shore of the island of Kauai. Ocean view."
    }
  ]
}

文档 1 与查询的匹配程度最高,因为其说明字段中同时出现了字词 spacious 和所需的短语 ocean viewDocument 1 matched the query best because both the term spacious and the required phrase ocean view occur in the description field. 后面的两个文档仅匹配短语 ocean viewThe next two documents match only the phrase ocean view. 你可能会感到惊讶,尽管文档 2 和 3 都匹配相同的查询短语,但它们的相关性评分却不相同。It might be surprising that the relevance score for document 2 and 3 is different even though they matched the query in the same way. 这是因为,评分公式除了包含 TF/IDF 以外,还包含其他组成部分。It's because the scoring formula has more components than just TF/IDF. 在本例中,为文档 3 分配的评分略高,因为其说明更短。In this case, document 3 was assigned a slightly higher score because its description is shorter. 请学习 Lucene 的实际评分公式,了解字段长度和其他因素如何影响相关性评分。Learn about Lucene's Practical Scoring Formula to understand how field length and other factors can influence the relevance score.

某些查询类型(通配符、前缀、正则表达式)始终会给文档总评分贡献一个常量分数。Some query types (wildcard, prefix, regex) always contribute a constant score to the overall document score. 这样,便可以在结果中包含通过查询扩展找到的匹配项,但不会影响排名。This allows matches found through query expansion to be included in the results, but without affecting the ranking.

有一个示例演示了这种行为的原因。An example illustrates why this matters. 通配符搜索(包括前缀搜索)在定义上有歧义,因为输入是一个不完整的字符串,可能会匹配极大量的相异字词(假设输入为“tour*”,这样就会找到“tours”、“tourettes”和“tourmaline”的匹配项)。Wildcard searches, including prefix searches, are ambiguous by definition because the input is a partial string with potential matches on a very large number of disparate terms (consider an input of "tour*", with matches found on “tours”, “tourettes”, and “tourmaline”). 由于这些结果的性质,我们无法合理推断出哪些字词的相关性高于其他字词。Given the nature of these results, there is no way to reasonably infer which terms are more valuable than others. 为此,在为通配符、前缀和正则表达式类型的查询中的结果评分时,我们会忽略字词频率。For this reason, we ignore term frequencies when scoring results in queries of types wildcard, prefix and regex. 在包含不完整和完整字词的多部分搜索请求中,来自不完整输入的结果将与常量评分合并,以免出现偏差并返回潜在的意外匹配项。In a multi-part search request that includes partial and complete terms, results from the partial input are incorporated with a constant score to avoid bias towards potentially unexpected matches.

评分优化Score tuning

可以通过两种方式在 Azure 认知搜索中调整关联分数:There are two ways to tune relevance scores in Azure Cognitive Search:

  1. 评分配置文件可以根据一组规则提升结果排名列表中的文档。Scoring profiles promote documents in the ranked list of results based on a set of rules. 在本示例中,我们可以认为标题字段中匹配的文档的相关性高于说明字段中匹配的文档。In our example, we could consider documents that matched in the title field more relevant than documents that matched in the description field. 此外,如果索引包含每家酒店的价格字段,我们可以根据较低的价格提升文档。Additionally, if our index had a price field for each hotel, we could promote documents with lower price. 详细了解如何将评分配置文件添加到搜索索引Learn more how to add Scoring Profiles to a search index.
  2. 字词提升(只能在完整 Lucene 查询语法中使用)提供可应用到查询树任何部分的提升运算符 ^Term boosting (available only in the Full Lucene query syntax) provides a boosting operator ^ that can be applied to any part of the query tree. 在本示例中,我们可以不搜索前缀 air-condition*,而是搜索确切的字词 air-condition 或前缀,但是,由于在字词查询中应用了提升运算符,与该确切字词匹配的文档会获得更高的排名:air-condition^2||air-condition*。In our example, instead of searching on the prefix air-condition*, one could search for either the exact term air-condition or the prefix, but documents that match on the exact term are ranked higher by applying boost to the term query: *air-condition^2||air-condition**. 详细了解字词提升Learn more about term boosting.

在分布式索引评分Scoring in a distributed index

Azure 认知搜索中的所有索引会自动拆分为多个分片,从而使我们能够在服务纵向扩展或缩减时在多个节点之间快速分布索引。All indexes in Azure Cognitive Search are automatically split into multiple shards, allowing us to quickly distribute the index among multiple nodes during service scale up or scale down. 发出某个搜索请求时,会单独针对每个分片发出该请求。When a search request is issued, it’s issued against each shard independently. 然后,来自每个分片的结果会合并,并按评分排序(如果未定义其他排序)。The results from each shard are then merged and ordered by score (if no other ordering is defined). 必须知道,评分函数根据分片中所有文档内的字词反向文档频率为查询字词频率加权,而不是根据所有分片加权!It is important to know that the scoring function weights query term frequency against its inverse document frequency in all documents within the shard, not across all shards!

这意味着,如果相同的文档驻留在不同的分片中,其相关性评分可能不同。This means a relevance score could be different for identical documents if they reside on different shards. 幸运的是,随着索引中的文档数由于字词分布越来越均匀而不断增多,这种差异往往会消失。Fortunately, such differences tend to disappear as the number of documents in the index grows due to more even term distribution. 无法预料任意给定文档会放置在哪个分片上。It’s not possible to assume on which shard any given document will be placed. 但是,假设文档键不会更改,该文档始终会分配到同一个分片。However, assuming a document key doesn't change, it will always be assigned to the same shard.

一般而言,如果顺序稳定性非常重要,则文档评分并不是用于文档排序的最佳属性。In general, document score is not the best attribute for ordering documents if order stability is important. 例如,假设两个文档具有相同的评分,则无法保证以后运行同一个查询时,会先显示哪个文档。For example, given two documents with an identical score, there is no guarantee which one appears first in subsequent runs of the same query. 文档评分只能让你大致了解某个文档的相关性相对于结果集中其他文档的高低程度。Document score should only give a general sense of document relevance relative to other documents in the results set.

结束语Conclusion

Internet 搜索引擎取得的成功提高了人们对私有数据运行全文搜索的预期。The success of internet search engines has raised expectations for full text search over private data. 对于几乎所有类型的搜索体验,我们现在都会预期引擎理解我们的意图,即使搜索词拼写不当或者不完整。For almost any kind of search experience, we now expect the engine to understand our intent, even when terms are misspelled or incomplete. 我们甚至预期可以根据近似的字词或者我们从未实际指定的同义词执行匹配。We might even expect matches based on near equivalent terms or synonyms that we never actually specified.

从技术角度看,全文搜索非常复杂,需要采用高深的语言分析和系统性方法,通过提取、扩展和转换查询词进行处理并提供相关的结果。From a technical standpoint, full text search is highly complex, requiring sophisticated linguistic analysis and a systematic approach to processing in ways that distill, expand, and transform query terms to deliver a relevant result. 除了固有的复杂性以外,还有许多因素会影响查询的结果。Given the inherent complexities, there are a lot of factors that can affect the outcome of a query. 因此,在尝试处理意外结果时投入一些时间来了解全文搜索的机制可以获得实实在在的好处。For this reason, investing the time to understand the mechanics of full text search offers tangible benefits when trying to work through unexpected results.

本文探讨了 Azure 认知搜索上下文中的全文搜索。This article explored full text search in the context of Azure Cognitive Search. 我们希望本文提供了足够的背景知识,可让你识别潜在的原因和解决常见的查询问题。We hope it gives you sufficient background to recognize potential causes and resolutions for addressing common query problems.

后续步骤Next steps

另请参阅See also

搜索文档 REST APISearch Documents REST API

简单的查询语法Simple query syntax

完整 Lucene 查询语法Full Lucene query syntax

处理搜索结果Handle search results