您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

HDInsight 上的 ML Services 和开放源代码 R 功能简介Introduction to ML Services and open-source R capabilities on HDInsight

备注

2017 年 9 月,Microsoft R Server 以 Microsoft Machine Learning Server 或 ML Server 的新名称发布。In September 2017, Microsoft R Server was released under the new name of Microsoft Machine Learning Server or ML Server. 因此,HDInsight 上的 R Server 群集现称为 HDInsight 上的机器学习服务或 ML Services 群集。Consequently, R Server cluster on HDInsight is now called Machine Learning Services or ML Services cluster on HDInsight. 有关 R Server 名称更改的详细信息,请参阅 Microsoft R Server 现在是 Microsoft Machine Learning ServerFor more information on the R Server name change, see Microsoft R Server is now Microsoft Machine Learning Server.

可以在 Azure 中创建 HDInsight 群集时选择使用 Microsoft Machine Learning Server 部署。Microsoft Machine Learning Server is available as a deployment option when you create HDInsight clusters in Azure. 提供此选项的群集类型名为 ML Services。The cluster type that provides this option is called ML Services. 这项功能可让数据科研人员、统计人员和 R 程序员根据需要访问 HDInsight 上可缩放的分布式分析方法。This capability provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight.

HDInsight 上的 ML Services 提供最新的功能,可针对载入 Azure Blob 或 Data Lake 存储的几乎任何大小的数据集执行基于 R 的分析。ML Services on HDInsight provides the latest capabilities for R-based analytics on datasets of virtually any size, loaded to either Azure Blob or Data Lake storage. 由于 ML Services 群集基于开放源代码 R 构建,因此,构建的基于 R 的应用程序可以利用超过 8000 个任意开放源代码 R 包。Since ML Services cluster is built on open-source R, the R-based applications you build can leverage any of the 8000+ open-source R packages. ScaleR 中的例程(Microsoft 的大数据分析包)同样可用。The routines in ScaleR, Microsoft’s big data analytics package are also available.

群集的边缘节点为连接到群集和运行 R 脚本提供了便捷的位置。The edge node of a cluster provides a convenient place to connect to the cluster and to run your R scripts. 使用边缘节点,可以选择跨边缘节点服务器的各个核心运行 ScaleR 的并行化分布式函数。With an edge node, you have the option of running the parallelized distributed functions of ScaleR across the cores of the edge node server. 还可以通过使用 ScaleR 的 Hadoop Map Reduce 或 Apache Spark 计算上下文跨群集的各个节点运行这些函数。You can also run them across the nodes of the cluster by using ScaleR’s Hadoop Map Reduce or Apache Spark compute contexts.

可以下载分析后生成的模型或预测,以便在本地使用。The models or predictions that result from analysis can be downloaded for on-premises use. 也可以在 Azure 中的其他位置(由其是通过 Azure 机器学习工作室 Web 服务)操作这些模型。They can also be operationalized elsewhere in Azure, in particular through Azure Machine Learning Studio web service.

HDInsight 上的 ML Services 入门Get started with ML Services on HDInsight

若要在 Azure HDInsight 中创建 ML Services 群集,请在使用 Azure 门户创建 HDInsight 群集时选择 ML Services 群集类型。To create an ML Services cluster in Azure HDInsight, select the ML Services cluster type when creating an HDInsight cluster using the Azure portal. ML Services 群集类型包括群集数据节点以及作为基于 ML Services 的分析登录区域的边缘节点上的 ML Services。The ML Services cluster type includes ML Server on the data nodes of the cluster and on an edge node, which serves as a landing zone for ML Services-based analytics. 请参阅 HDInsight 上的 ML Services 入门了解创建群集的详细演练。See Getting Started with ML Services on HDInsight for a walkthrough on how to create the cluster.

为什么选择 HDInsight 中的 ML Services?Why choose ML Services in HDInsight?

HDInsight 中的 ML Services 具有下述优势:ML Services in HDInsight provides the following benefits:

通过 Microsoft 和开放源代码获得 AI 创新AI innovation from Microsoft and open-source

ML Services 包括一组高度可缩放分布式算法,例如 RevoscaleRrevoscalepymicrosoftML,可用于处理超出物理内存大小的数据,并且以分布方式在各种平台上运行。ML Services includes highly scalable, distributed set of algorithms such as RevoscaleR, revoscalepy, and microsoftML that can work on data sizes larger than the size of physical memory, and run on a wide variety of platforms in a distributed manner. 详细了解产品随附的 Microsoft 自定义 R 包Python 包集合。Learn more about the collection of Microsoft's custom R packages and Python packages included with the product.

ML Services 在单个企业级平台上,将这些 Microsoft 创新和来自开放源代码社区的贡献(R、Python 和 AI 工具包)联系在一起。ML Services bridges these Microsoft innovations and contributions coming from the open-source community (R, Python, and AI toolkits) all on top of a single enterprise-grade platform. 任何 R 或 Python 开放源代码机器学习包都可与来自 Microsoft 的任何专属创新配合运行。Any R or Python open-source machine learning package can work side by side with any proprietary innovation from Microsoft.

简单、安全且高度可缩放的操作和管理Simple, secure, and high-scale operationalization and administration

依赖于传统模式和环境的企业在操作化方面投入了许多时间和精力。Enterprises relying on traditional paradigms and environments invest much time and effort towards operationalization. 这导致成本和延迟增大,包括:模型转换时间、让它们保持有效及最新状态的迭代工作、法规审批、通过操作化管理权限。This results in inflated costs and delays including the translation time for models, iterations to keep them valid and current, regulatory approval, and managing permissions through operationalization.

ML Services 提供企业级操作化,这体现在完成机器学习模型后,只需单击几下鼠标就能生成 Web 服务 API。ML Services offers enterprise grade operationalization, in that, after a machine learning model is completed, it takes just a few clicks to generate web services APIs. 这些 Web 服务托管在服务器网格或云中,并且可与业务线应用程序集成。These web services are hosted on a server grid in the cloud and can be integrated with line-of-business applications. 部署到弹性网格的能力可让你根据业务需求,针对批处理和实时评分无缝缩放。The ability to deploy to an elastic grid lets you scale seamlessly with the needs of your business, both for batch and real-time scoring. 有关说明,请参阅使 HDInsight 上的 ML Services 可操作For instructions, see Operationalize ML Services on HDInsight.

HDInsight 上的 ML Services 的主要功能Key features of ML Services on HDInsight

HDInsight 上的 ML Services 包含以下功能。The following features are included in ML Services on HDInsight.

功能类别Feature category DescriptionDescription
支持 RR-enabled 适用于以 R 编写的解决方案的 R 包,R 开源分发版和用于执行脚本的运行时基础结构。R packages for solutions written in R, with an open source distribution of R, and run-time infrastructure for script execution.
支持 PythonPython-enabled 适用于以 Python 编写的解决方案的 Python 模块,Python 开源分发版和用于执行脚本的运行时基础结构。Python modules for solutions written in Python, with an open source distribution of Python, and run-time infrastructure for script execution.
预先训练的模型Pre-trained models 适用于可视化分析和文本情绪分析,随时可用于对提供的数据进行评分。For visual analysis and text sentiment analysis, ready to score data you provide.
部署和使用Deploy and consume 使服务器可操作,将解决方案部署为 Web 服务。Operationalize your server and deploy solutions as a web service.
远程执行Remote execution 在客户端工作站中,通过网络在 ML Services 群集 上启动远程会话。Start remote sessions on ML Services cluster on your network from your client workstation.

适用于 HDInsight 上的 ML Services 的数据存储选项Data storage options for ML Services on HDInsight

HDInsight 群集的 HDFS 文件系统的默认存储可以与 Azure 存储帐户或 Azure Data Lake Store 相关联。Default storage for the HDFS file system of HDInsight clusters can be associated with either an Azure Storage account or an Azure Data Lake Storage. 这种关联可确保在分析过程中,上传到群集存储的任何数据均会持久保存,即使在删除群集后,数据也可供使用。This association ensures that whatever data is uploaded to the cluster storage during analysis is made persistent and the data is available even after the cluster is deleted. 对于所选择的将数据传输到存储的选项有各种工具,包括存储帐户的基于门户的上传工具和 AzCopy 实用程序。There are various tools for handling the data transfer to the storage option that you select, including the portal-based upload facility of the storage account and the AzCopy utility.

无论选择哪个来充当主存储,都可以在群集预配过程中选择启用对附加 Blob 和 Data Lake Store 的访问权限。You have the option of enabling access to additional Blob and Data lake stores during the cluster provisioning process regardless of the primary storage option in use. 有关向额外帐户授予访问权限的信息,请参阅 HDInsight 上的 ML Services 入门See Getting started with ML Services on HDInsight for information on adding access to additional accounts. 要了解有关使用多个存储帐户的详细信息,请参阅适用于 HDInsight 上的 ML Services 的 Azure 存储选项一文。See Azure Storage options for ML Services on HDInsight article to learn more about using multiple storage accounts.

也可以将 Azure 文件服务用作边缘节点上的存储选项。You can also use Azure Files as a storage option for use on the edge node. Azure 文件可让你将 Azure 存储中创建的文件共享装载到 Linux 文件系统。Azure Files enables you to mount a file share that was created in Azure Storage to the Linux file system. 若要深入了解 HDInsight 上的 ML Services 群集的数据存储选项,请参阅 适用于 HDInsight 上的 ML Services 的 Azure 存储选项For more information about these data storage options for ML Services on HDInsight cluster, see Azure Storage options for ML Services on HDInsight.

访问 ML Services 边缘节点Access ML Services edge node

可以使用浏览器连接到边缘节点上的 Microsoft ML Server。You can connect to Microsoft ML Server on the edge node using a browser. 它是在群集创建过程中默认安装的。It is installed by default during cluster creation. 有关详细信息,请参阅 HDInsight 上的 ML Services 入门For more information, see Get stared with ML Services on HDInsight. 还可以使用 SSH/PuTTY 通过命令行连接到群集边缘节点,以访问 R 控制台。You can also connect to the cluster edge node from the command line by using SSH/PuTTY to access the R console.

开发和运行 R 脚本Develop and run R scripts

创建和运行的 R 脚本可以使用 8000 多种开放源代码 R 包中的任何一种,此外,还可以使用 ScaleR 库中可用的并行化分布式例程。The R scripts you create and run can use any of the 8000+ open-source R packages in addition to the parallelized and distributed routines available in the ScaleR library. 一般而言,使用边缘节点上的 ML Services 运行的脚本将在该节点上的 R 解释程序内运行。In general, a script that is run with ML Services on the edge node runs within the R interpreter on that node. 但需要调用计算上下文设置为 Hadoop Map Reduce (RxHadoopMR) 或 Spark (RxSpark) 的 ScaleR 函数的这些步骤除外。The exceptions are those steps that need to call a ScaleR function with a compute context that is set to Hadoop Map Reduce (RxHadoopMR) or Spark (RxSpark). 在这种情况下,函数将以分布方式跨与引用数据关联的群集的数据(任务)节点运行。In this case, the function runs in a distributed fashion across those data (task) nodes of the cluster that are associated with the data referenced. 有关不同计算上下文选项的详细信息,请参阅适用于 HDInsight 上的 ML Services 的计算上下文选项For more information about the different compute context options, see Compute context options for ML Services on HDInsight.

操作模型Operationalize a model

完成数据建模后,可以在 Azure 中或本地操作模型,以便针对新数据执行预测。When your data modeling is complete, you can operationalize the model to make predictions for new data either from Azure or on-premises. 此过程称为评分。This process is known as scoring. 可以在 HDInsight、Azure 机器学习或本地进行评分。Scoring can be done in HDInsight, Azure Machine Learning, or on-premises.

在 HDInsight 中评分Score in HDInsight

若要在 HDInsight 中评分,可以针对已载入存储帐户的新数据文件编写调用模型的 R 函数以进行预测。To score in HDInsight, write an R function that calls your model to make predictions for a new data file that you've loaded to your storage account. 然后将预测保存回到存储帐户。Then, save the predictions back to the storage account. 可以根据需要在群集的边缘节点上运行该例程,或使用计划作业来进行。You can run this routine on-demand on the edge node of your cluster or by using a scheduled job.

在 Azure 机器学习中评分 (AML)Score in Azure Machine Learning (AML)

若要使用 Azure 机器学习进行评分,请使用名为 AzureML 的开放源代码 Azure 机器学习 R 包将模型发布为 Azure Web 服务。To score using Azure Machine Learning, use the open-source Azure Machine Learning R package known as AzureML to publish your model as an Azure web service. 为提供方便,此包已预装在边缘节点上。For convenience, this package is pre-installed on the edge node. 接下来,使用 Azure 机器学习中的工具创建 Web 服务的用户界面,并根据需要调用 Web 服务进行评分。Next, use the facilities in Azure Machine Learning to create a user interface for the web service, and then call the web service as needed for scoring.

如果选择此选项,则必须将所有 ScaleR 模型对象转换成对等的开放源代码模型对象,才可配合 Web 服务使用。If you choose this option, you must convert any ScaleR model objects to equivalent open-source model objects for use with the web service. 使用 ScaleR 强制转换函数,例如适用于装配模型的 as.randomForest() 来完成转换。Use ScaleR coercion functions, such as as.randomForest() for ensemble-based models, for this conversion.

本地评分Score on-premises

要在创建模型之后进行本地评分,可以在 R 中序列化模型,将其下载,将其反序列化,然后使用它进行新数据评分。To score on-premises after creating your model, you can serialize the model in R, download it, de-serialize it, and then use it for scoring new data. 可以使用前面在 HDInsight 中评分所述的方法,或使用 Web 服务对新数据进行评分。You can score new data by using the approach described earlier in Score in HDInsight or by using web services.

维护群集Maintain the cluster

安装和维护 R 包Install and maintain R packages

由于 R 脚本的大多数步骤在边缘节点上运行,因此边缘节点上需要有大部分使用的 R 包。Most of the R packages that you use are required on the edge node since most steps of your R scripts run there. 若要在边缘节点上安装其他 R 包,可以在 R 中使用 install.packages() 方法。To install additional R packages on the edge node, you can use the install.packages() method in R.

如果正在群集中使用 ScaleR 库中的例程,则通常不需要在数据节点上安装其他 R 包。If you are just using routines from the ScaleR library across the cluster, you do not usually need to install additional R packages on the data nodes. 但是,可能需要其他包才能支持在数据节点上使用 rxExec 或 RxDataStep 执行。However, you might need additional packages to support the use of rxExec or RxDataStep execution on the data nodes.

在这种情况下,可以在创建群集之后,使用脚本操作来安装其他包。In such cases, the additional packages can be installed with a script action after you create the cluster. 有关详细信息,请参阅管理 HDInsight 群集中的 ML ServicesFor more information, see Manage ML Services in HDInsight cluster.

更改 Apache Hadoop MapReduce 内存设置Change Apache Hadoop MapReduce memory settings

可以在运行 MapReduce 作业时修改群集,以更改 ML Services 的可用内存量。A cluster can be modified to change the amount of memory that is available to ML Services when it is running a MapReduce job. 若要修改群集,可以通过群集的 Azure 门户边栏选项卡使用 Apache Ambari UI。To modify a cluster, use the Apache Ambari UI that's available through the Azure portal blade for your cluster. 有关如何访问群集的 Ambari UI 的说明,请参阅使用 Ambari Web UI 管理 HDInsight 群集For instructions about how to access the Ambari UI for your cluster, see Manage HDInsight clusters using the Ambari Web UI.

也可以在 RxHadoopMR 的调用中使用 Hadoop 开关更改 ML Services 的可用内存量,如下所示:It is also possible to change the amount of memory that is available to ML Services by using Hadoop switches in the call to RxHadoopMR as follows:

hadoopSwitches = "-libjars /etc/hadoop/conf -Dmapred.job.map.memory.mb=6656"  

缩放群集Scale your cluster

可以通过门户扩展或缩减现有的 HDInsight 上的 ML Services 群集。An existing ML Services cluster on HDInsight can be scaled up or down through the portal. 通过扩展可以获得更多的容量来完成较大的处理任务;也可以在群集空闲时缩减容量。By scaling up, you can gain the additional capacity that you might need for larger processing tasks, or you can scale back a cluster when it is idle. 有关如何缩放群集的说明,请参阅管理 HDInsight 群集For instructions about how to scale a cluster, see Manage HDInsight clusters.

维护系统Maintain the system

在非工作时间,系统将在 HDInsight 群集的基础 Linux VM 上执行维护,以应用 OS 修补程序和其他更新。Maintenance to apply OS patches and other updates is performed on the underlying Linux VMs in an HDInsight cluster during off-hours. 通常,维护操作会在星期一和星期四凌晨 3:30(基于 VM 的本地时间)完成。Typically, maintenance is done at 3:30 AM (based on the local time for the VM) every Monday and Thursday. 执行更新时,每次应该只有不到四分之一的群集会受影响。Updates are performed in such a way that they don't impact more than a quarter of the cluster at a time.

由于头节点是冗余的,且并非所有数据节点都受影响,因此在此时间段运行的所有作业可能会变慢。Since the head nodes are redundant and not all data nodes are impacted, any jobs that are running during this time might slow down. 但是,这些作业应该都可运行完成。However, they should still run to completion. 除非发生需要重建群集的灾难性故障,否则任何自定义软件或本地数据在这些维护事件中都将保留。Any custom software or local data that you have is preserved across these maintenance events unless a catastrophic failure occurs that requires a cluster rebuild.

适用于 HDInsight 上的 ML Services 的 IDE 选项IDE options for ML Services on HDInsight

HDInsight 群集的 Linux 边缘节点是基于 R 的分析的登录区域。The Linux edge node of an HDInsight cluster is the landing zone for R-based analysis. 最新版本的 HDInsight 在边缘节点上提供 RStudio Server 的默认安装,作为基于浏览器的 IDE。Recent versions of HDInsight provide a default installation of RStudio Server on the edge node as a browser-based IDE. 使用 RStudio Server 作为 IDE 来开发和执行 R 脚本,与仅使用 R 控制台相比,可以大幅提高生产力。Use of RStudio Server as an IDE for the development and execution of R scripts can be considerably more productive than just using the R console.

此外,可以安装桌面 IDE,并使用它通过远程 MapReduce 或 Spark 计算上下文来访问群集。Additionally, you can install a desktop IDE and use it to access the cluster through use of a remote MapReduce or Spark compute context. 选项包括 Microsoft 的针对 Visual Studio 的 R 工具 (RTVS)、RStudio 和 Walware 的基于 Eclipse 的 StatETOptions include Microsoft’s R Tools for Visual Studio (RTVS), RStudio, and Walware’s Eclipse-based StatET.

此外,通过 SSH 或 PuTTY 连接后,在 Linux 命令提示符下键入 R 即可访问边缘节点上的 R 控制台。Additionally, you can access the R console on the edge node by typing R at the Linux command prompt after connecting via SSH or PuTTY. 如果在另一个窗口中运行 R 脚本开发的文本编辑器,可根据需要将脚本部分剪切并粘贴到 R 控制台,以便于使用控制台界面。When using the console interface, it is convenient to run a text editor for R script development in another window, and cut and paste sections of your script into the R console as needed.

定价Pricing

包含 ML Services 的 HDInsight 群集的相关价格结构与其他 HDInsight 群集类型类似。The prices that are associated with an ML Services HDInsight cluster are structured similarly to the prices for other HDInsight cluster types. 这些费用以各种名称、数据和边缘节点的基础 VM 大小为基准,加上核心运行小时数附加费。They are based on the sizing of the underlying VMs across the name, data, and edge nodes, with the addition of a core-hour uplift. 有关详细信息,请参阅 HDInsight 定价For more information, see HDInsight pricing.

后续步骤Next steps

若要了解有关如何使用 HDInsight 群集上的 ML Services 的详细信息,请参阅以下主题:To learn more about how to use ML Services on HDInsight clusters, see the following topics: