您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

知识探索服务入门Get started with the Knowledge Exploration Service

在本演练中,使用知识探索服务 (KES) 来创建针对学术出版物的交互式搜索体验引擎。In this walkthrough, you use the Knowledge Exploration Service (KES) to create the engine for an interactive search experience for academic publications. 可以安装命令行工具 kes.exe 以及来自知识探索服务 SDK 的所有示例文件。You can install the command line tool, kes.exe, and all example files from the Knowledge Exploration Service SDK.

学术出版物示例包含 Microsoft 研究人员发表的 1000 篇学术论文样本。The academic publications example contains a sample of 1000 academic papers published by researchers at Microsoft. 每篇论文都与标题、出版年份、作者以及关键词相关联。Each paper is associated with a title, publication year, authors, and keywords. 每个作者在发布时都通过 ID、姓名和附属关系进行表示。Each author is represented by an ID, name, and affiliation at the time of publication. 每个关键字可能与一组同义词相关联(例如,关键字“支持向量机”可能与同义词“svm”相关联)。Each keyword can be associated with a set of synonyms (for example, the keyword "support vector machine" can be associated with the synonym "svm").

定义架构Define the schema

架构描述域中对象的属性结构。The schema describes the attribute structure of the objects in the domain. 它指定每个属性 JSON 文件格式的名称和数据类型。It specifies the name and data type for each attribute in a JSON file format. 以下示例是 Academic.schema 文件的内容。The following example is the content of the file Academic.schema.

{
  "attributes":[
    {"name":"Title", "type":"String"},
    {"name":"Year", "type":"Int32"},
    {"name":"Author", "type":"Composite"},
    {"name":"Author.Id", "type":"Int64", "operations":["equals"]},
    {"name":"Author.Name", "type":"String"},
    {"name":"Author.Affiliation", "type":"String"},
    {"name":"Keyword", "type":"String", "synonyms":"Keyword.syn"}
  ]
}

可以在此处将 Title、Year 和 Keyword 分别定义为字符串、整数和字符串属性。Here, you define Title, Year, and Keyword as a string, integer, and string attribute, respectively. 由于作者是由 ID、名称和附属关系表示,因此可以将 Author 定义为具有三个子属性的复合属性:Author.Id、Author.Name 和 Author.Affiliation。Because authors are represented by ID, name, and affiliation, you define Author as a composite attribute with three sub-attributes: Author.Id, Author.Name, and Author.Affiliation.

默认情况下,属性支持其数据类型可用的所有操作,包括 equals、starts_with 和 is_between。By default, attributes support all operations available for their data type, including equals, starts_with, and is_between. 由于作者 ID 仅在内部作为标识符使用,因此请重写默认值,并将 equals 指定为唯一索引操作。Because author ID is only used internally as an identifier, override the default, and specify equals as the only indexed operation.

对于 Keyword 属性,允许通过在属性定义中指定同义词文件 Keyword.syn 使同义词匹配规范关键字值。For the Keyword attribute, allow synonyms to match the canonical keyword values by specifying the synonym file Keyword.syn in the attribute definition. 此文件包含规范值对和同义词值对列表:This file contains a list of canonical and synonym value pairs:

...
["support vector machine","support vector machines"]
["support vector machine","support vector networks"]
["support vector machine","support vector regression"]
["support vector machine","support vector"]
["support vector machine","svm machine learning"]
["support vector machine","svm"]
["support vector machine","svms"]
["support vector machine","vector machine"]
...

有关架构定义的其他信息,请参阅架构格式For additional information about the schema definition, see Schema Format.

生成数据Generate data

数据文件描述要索引的出版物列表,每一行都以 JSON 格式指定论文的属性值。The data file describes the list of the publications to index, with each line specifying the attribute values of a paper in JSON format. 下面的示例是来自数据文件 Academic.data 的单行,其格式设置具备可读性:The following example is a single line from the data file Academic.data, formatted for readability:

...
{
    "logprob": -16.714,
    "Title": "the world wide telescope",
    "Year": 2001,
    "Author": [
        {
            "Id": 717694024,
            "Name": "alexander s szalay",
            "Affiliation": "microsoft"
        },
        {
            "Id": 2131537204,
            "Name": "jim gray",
            "Affiliation": "microsoft"
        }
    ]
}
...

在此代码段中,分别将论文的 Title 和 Year 属性指定为 JSON 字符串和数字。In this snippet, you specify the Title and Year attribute of the paper as a JSON string and number, respectively. 使用 JSON 数组表示多个值。Multiple values are represented by using JSON arrays. 因为 Author 是复合属性,每个值使用一个由其子属性组成的 JSON 对象来表示。Because Author is a composite attribute, each value is represented by using a JSON object consisting of its sub-attributes. 缺失值的属性,例如本例中的 Keyword,可能会从 JSON 表示形式中排除。Attributes with missing values, such as Keyword in this case, can be excluded from the JSON representation.

若要区分不同论文的可能性,请通过使用内置的 logprob 属性指定相对对数概率。To differentiate the likelihood of different papers, specify the relative log probability by using the built-in logprob attribute. 如果概率 p 在 0 和 1 之间,则将对数概率作为 log(p) 计算,其中 log() 是自然对数函数。Given a probability p between 0 and 1, you compute the log probability as log(p), where log() is the natural log function.

有关详细信息,请参阅数据格式For more information, see Data Format.

生成压缩的二进制索引Build a compressed binary index

获得架构文件和数据文件后,可以通过使用 kes.exe build_index 构建数据对象的压缩二进制索引。After you have a schema file and data file, you can build a compressed binary index of the data objects by using kes.exe build_index. 在此示例中,从输入架构文件 Academic.schema 和数据文件 Academic.data 生成索引文件 Academic.index。In this example, you build the index file Academic.index from the input schema file Academic.schema and data file Academic.data. 请使用以下命令:Use the following command:

kes.exe build_index Academic.schema Academic.data Academic.index

对于 Azure 之外的快速原型制作,kes.exe build_index 可以从包含多达 10,000 个对象的数据文件构建本地小型索引。For rapid prototyping outside of Azure, kes.exe build_index can build small indices locally, from data files containing up to 10,000 objects. 对于更大的数据文件,可以在 Azure 中的 Windows VM 中运行命令,也可以在 Azure 中执行远程生成。For larger data files, you can either run the command from within a Windows VM in Azure, or perform a remote build in Azure. 有关详细信息,请参阅“纵向扩展”。For details, see Scaling up.

使用 XML 语法规范Use an XML grammar specification

语法指定服务可以解释的自然语言查询集,以及这些自然语言查询如何转换为语义查询表达式。The grammar specifies the set of natural language queries that the service can interpret, as well as how these natural language queries are translated into semantic query expressions. 在此示例中,将使用 academic.xml 中指定的语法:In this example, you use the grammar specified in academic.xml:

<grammar root="GetPapers">

  <!-- Import academic data schema-->
  <import schema="Academic.schema" name="academic"/>

  <!-- Define root rule-->
  <rule id="GetPapers">
    <example>papers about machine learning by michael jordan</example>

    papers
    <tag>
      yearOnce = false;
      isBeyondEndOfQuery = false;
      query = All();
    </tag>

    <item repeat="1-" repeat-logprob="-10">
      <!-- Do not complete additional attributes beyond end of query -->
      <tag>AssertEquals(isBeyondEndOfQuery, false);</tag>

      <one-of>
        <!-- about <keyword> -->
        <item logprob="-0.5">
          about <attrref uri="academic#Keyword" name="keyword"/>
          <tag>query = And(query, keyword);</tag>
        </item>

        <!-- by <authorName> [while at <authorAffiliation>] -->
        <item logprob="-1">
          by <attrref uri="academic#Author.Name" name="authorName"/>
          <tag>authorQuery = authorName;</tag>
          <item repeat="0-1" repeat-logprob="-1.5">
            while at <attrref uri="academic#Author.Affiliation" name="authorAffiliation"/>
            <tag>authorQuery = And(authorQuery, authorAffiliation);</tag>
          </item>
          <tag>
            authorQuery = Composite(authorQuery);
            query = And(query, authorQuery);
          </tag>
        </item>

        <!-- written (in|before|after) <year> -->
        <item logprob="-1.5">
          <!-- Allow this grammar path to be traversed only once -->
          <tag>
            AssertEquals(yearOnce, false);
            yearOnce = true;
          </tag>
          <ruleref uri="#GetPaperYear" name="year"/>
          <tag>query = And(query, year);</tag>
        </item>
      </one-of>

      <!-- Determine if current parse position is beyond end of query -->
      <tag>isBeyondEndOfQuery = GetVariable("IsBeyondEndOfQuery", "system");</tag>
    </item>
    <tag>out = query;</tag>
  </rule>

  <rule id="GetPaperYear">
    <tag>year = All();</tag>
    written
    <one-of>
      <item>
        in <attrref uri="academic#Year" name="year"/>
      </item>
      <item>
        before
        <one-of>
          <item>[year]</item>
          <item>
            <attrref uri="academic#Year" op="lt" name="year"/>
          </item>
        </one-of>
      </item>
      <item>
        after
        <one-of>
          <item>[year]</item>
          <item>
            <attrref uri="academic#Year" op="gt" name="year"/>
          </item>
        </one-of>
      </item>
    </one-of>
    <tag>out = year;</tag>
  </rule>
</grammar>

有关语法规范语法的详细信息,请参阅语法格式For more information about the grammar specification syntax, see Grammar Format.

编译语法Compile the grammar

在获得 XML 语法规范后,可以通过使用 kes.exe build_grammar 将其编译为二进制语法。After you have an XML grammar specification, you can compile it into a binary grammar by using kes.exe build_grammar. 注意,如果语法导入一个架构,那么该架构文件需要位于与语法 XML 相同的路径中。Note that if the grammar imports a schema, the schema file needs to be located in the same path as the grammar XML. 在此示例中,从输入 XML 语法文件 Academic.xml 中生成二进制语法文件 Academic.grammar。In this example, you build the binary grammar file Academic.grammar from the input XML grammar file Academic.xml. 请使用以下命令:Use the following command:

kes.exe build_grammar Academic.xml Academic.grammar

在 Web 服务中承载语法和索引Host the grammar and index in a web service

对于快速原型制作,可以使用 kes.exe host_service 在本地计算机的 Web 服务中承载语法和索引。For rapid prototyping, you can host the grammar and index in a web service on the local machine, by using kes.exe host_service. 然后,可以通过 Web API 来访问服务以验证数据的正确性以及语法设计。You can then access the service via web APIs to validate the data correctness and grammar design. 在此示例中,将语法文件 Academic.grammar 和索引文件 Academic.index 承载在 http://localhost:8000/ 中。In this example, you host the grammar file Academic.grammar and index file Academic.index at http://localhost:8000/. 请使用以下命令:Use the following command:

kes.exe host_service Academic.grammar Academic.index --port 8000

这将启动 Web 服务的本地实例。This initiates a local instance of the web service. 可以通过从浏览器访问 http::localhost:<port> 以交互方式测试服务。You can interactively test the service by visiting http::localhost:<port> from a browser. 有关详细信息,请参阅“测试服务”。For more information, see Testing service.

可以直接调用各种 Web API 来测试自然语言解释、查询完成情况、结构化的查询评估和直方图计算。You can also directly call various web APIs to test natural language interpretation, query completion, structured query evaluation, and histogram computation. 若要停止服务,请向 kes.exe host_service 命令提示符输入“quit”或按 Ctrl+C。To stop the service, enter "quit" into the kes.exe host_service command prompt, or press Ctrl+C. 下面是一些示例:Here are some examples:

在 Azure 外部,kes.exe host_service 限制为最多索引 10,000 个对象。Outside of Azure, kes.exe host_service is limited to indices of up to 10,000 objects. 其他限制包括每秒 10 个请求的 API 速率,和在进程自动终止之前的总共 1000 个请求。Other limits include an API rate of 10 requests per second, and a total of 1000 requests before the process automatically terminates. 若要绕过这些限制,请从 Azure 中的 Windows VM 运行命令,或通过使用 kes.exe deploy_service 命令部署到 Azure 云服务。To bypass these restrictions, run the command from within a Windows VM in Azure, or deploy to an Azure cloud service by using the kes.exe deploy_service command. 有关详细信息,请参阅“部署服务”。For details, see Deploying service.

向上扩展以承载更大索引Scale up to host larger indices

在 Azure 外部运行 kes.exe 时,索引限制为 10,000 个对象。When you are running kes.exe outside of Azure, the index is limited to 10,000 objects. 可以通过使用 Azure 生成和承载更大索引。You can build and host larger indices by using Azure. 注册免费试用版Sign up for a free trial. 或者,如果是 Visual Studio 或 MSDN 订户,则可以激活订阅者权益Alternatively, if you subscribe to Visual Studio or MSDN, you can activate subscriber benefits. 每月都将提供一些 Azure 额度。These offer some Azure credits each month.

若要允许 kes.exe 访问 Azure 帐户,请从 Azure 门户下载 Azure 发布设置文件To allow kes.exe access to an Azure account, download the Azure Publish Settings file from the Azure portal. 如果出现系统提示,则登录到所需的 Azure 帐户。If prompted, sign into the desired Azure account. kes.exe 运行的工作目录中,将文件另存为 AzurePublishSettings.xml。Save the file as AzurePublishSettings.xml in the working directory from where kes.exe runs.

有两种方法来生成和承载大型索引。There are two ways to build and host large indices. 第一种是在 Azure 的 Windows VM 中准备架构和数据文件。The first is to prepare the schema and data files in a Windows VM in Azure. 然后运行 kes.exe build_index 以在 VM 上本地生成索引,而不受任何大小限制。Then run kes.exe build_index to build the index locally on the VM, without any size restrictions. 生成的索引可以本地承载在 VM 上,方法是将 kes.exe host_service 用于快速原型制作,同样没有任何限制。The resulting index can be hosted locally on the VM by using kes.exe host_service for rapid prototyping, again without any restrictions. 有关详细步骤,请参阅 Azure VM 教程For detailed steps, see the Azure VM tutorial.

第二种方法是使用带 --remote 参数的 kes.exe build_index 远程执行 Azure 生成。The second method is to perform a remote Azure build, by using kes.exe build_index with the --remote parameter. 这指定了 Azure VM 大小。This specifies an Azure VM size. 当指定了 --remote 参数时,命令会创建一个此大小的临时 Azure VM。When the --remote parameter is specified, the command creates a temporary Azure VM of that size. 然后,它在 VM 上生成索引,将索引上传到目标 blob 存储器中,并在完成时删除 VM。It then builds the index on the VM, uploads the index to the target blob storage, and deletes the VM upon completion. 在生成索引时,Azure 订阅将收取 VM 费用。Your Azure subscription is charged for the cost of the VM while the index is being built.

此远程 Azure 生成功能允许在任何环境中运行 kes.exe build_indexThis remote Azure build capability allows kes.exe build_index to be run in any environment. 在执行远程生成时,输入架构和数据参数可以是本地文件路径,也可以是 Azure blob存储 URL。When you are performing a remote build, the input schema and data arguments can be local file paths or Azure blob storage URLs. 输出索引参数必须是 blob 存储 URL。The output index argument must be a blob storage URL. 若要创建 Azure 存储帐户,请参阅关于 Azure 存储帐户To create an Azure storage account, see About Azure storage accounts. 若要有效地将文件复制到 blob 存储和从中复制文件,请使用 AzCopy 实用工具。To copy files efficiently to and from blob storage, use the AzCopy utility.

在本例中,可以假设已经创建了以下 blob 存储容器: http://account<>.blob.core.windows.net/<container>/。In this example, you can assume that the following blob storage container has already been created: http://<account>.blob.core.windows.net/<container>/. 它包含架构 Academic.schema、引用的同义词文件 Keywords.syn 和全面的数据文件 Academic.full.data。It contains the schema Academic.schema, the referenced synonym file Keywords.syn, and the full-scale data file Academic.full.data. 可以通过以下命令远程生成完整索引:You can build the full index remotely by using the following command:

kes.exe build_index http://<account>.blob.core.windows.net/<container>/Academic.schema http://<account>.blob.core.windows.net/<container>/Academic.full.data http://<account>.blob.core.windows.net/<container>/Academic.full.index --remote <vm_size>

请注意,预配一个临时 VM 来生成索引可能需要 5-10 分钟。Note that it might take 5-10 minutes to provision a temporary VM to build the index. 对于快速原型制作,可以:For rapid prototyping, you can:

分页会减慢生成过程。Paging slows down the build process. 为了避免分页,请使用一个 VM,其大小为 RAM 的三倍,作为生成索引的输入数据文件大小。To avoid paging, use a VM with three times the amount of RAM as the input data file size for index building. 使用 RAM 比承载的索引大小大 1 GB 的 VM。Use a VM with 1 GB more RAM than the index size for hosting. 对于可用的 VM 大小列表,请参阅虚拟机大小For a list of available VM sizes, see Sizes for virtual machines.

部署服务Deploy the service

在拥有语法和索引之后,便可将这项服务部署到 Azure 云服务了。After you have a grammar and index, you are ready to deploy the service to an Azure cloud service. 若要创建一个新的 Azure 云服务,请参阅如何创建和部署云服务To create a new Azure cloud service, see How to create and deploy a cloud service. 此时不要指定部署包。Do not specify a deployment package at this point.

在创建云服务后,可以使用 kes.exe deploy_service 来部署此项服务。When you have created the cloud service, you can use kes.exe deploy_service to deploy the service. Azure 云服务有两个部署槽位:生产槽和过渡槽。An Azure cloud service has two deployment slots: production and staging. 对于接收实时用户流量的服务,应首先部署到过渡槽。For a service that receives live user traffic, you should initially deploy to the staging slot. 等待服务启动并初始化服务本身。Wait for the service to start up and initialize itself. 然后,可以发送一些请求来验证部署,并验证它是否通过了基本测试。Then you can send a few requests to validate the deployment and verify that it passes basic tests.

将过渡槽的内容与生产槽进行交换,这样实时流量就会被定向到新部署的服务。Swap the contents of the staging slot with the production slot, so that live traffic is now directed to the newly deployed service. 可以在使用新数据部署更新后的服务版本时重复这一过程。You can repeat this process when deploying an updated version of the service with new data. 与所有其他 Azure 云服务一样,可以选择使用 Azure 门户来配置自动缩放As with all other Azure cloud services, you can optionally use the Azure portal to configure auto-scaling.

在本例中,将学术索引部署到 <vm_size> VM 中现有云服务的过渡槽中。In this example, you deploy the Academic index to the staging slot of an existing cloud service with <vm_size> VMs. 请使用以下命令:Use the following command:

kes.exe deploy_service http://<account>.blob.core.windows.net/<container>/Academic.grammar http://<account>.blob.core.windows.net/<container>/Academic.index <serviceName> <vm_size> --slot Staging

有关可用的 VM 大小列表,请参阅虚拟机大小For a list of available VM sizes, see Sizes for virtual machines.

在部署服务后,可以调用各种 Web API 来测试自然语言解释、查询完成情况、结构化的查询评估和直方图计算。After you deploy the service, you can call the various web APIs to test natural language interpretation, query completion, structured query evaluation, and histogram computation.

测试服务Test the service

要调试实时服务,请从 Web 浏览器浏览主机。To debug a live service, browse to the host machine from a web browser. 对于通过 host_service 部署的本地服务,请访问 http://localhost:<port>/For a local service deployed via host_service, visit http://localhost:<port>/. 对于通过 deploy_service 部署的 Azure 云服务,请访问 http://<serviceName>.cloudapp.net/For an Azure cloud service deployed via deploy_service, visit http://<serviceName>.cloudapp.net/.

该页面包含一个关于基本 API 调用统计信息的链接,以及在该服务中承载的语法和索引。This page contains a link to information about basic API call statistics, as well as the grammar and index hosted at this service. 该页面还包含一个交互式搜索界面,演示了 Web API 的使用。This page also contains an interactive search interface that demonstrates the use of the web APIs. 在搜索框中输入查询,以查看解释评估calchistogram API 调用的结果。Enter queries into the search box to see the results of the interpret, evaluate, and calchistogram API calls. 该页面的基础 HTML 源代码也可以作为一个示例,说明如何将 Web API 集成到一个应用中,从而创建丰富的、交互式搜索体验。The underlying HTML source of this page also serves as an example of how to integrate the web APIs into an app, to create a rich, interactive search experience.