您现在访问的是微软AZURE全睃版技术文档网站,若需覝访问由世纪互蝔违蝥的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

适用于 HDInsight 上的 ML Services 的计算上下文选项Compute context options for ML Services on HDInsight

Azure HDInsight 上的 ML Services 可设置计算上下文,从而控制执行调用的方式。ML Services on Azure HDInsight controls how calls are executed by setting the compute context. 本文概述了可用于指定可否以及如何跨边缘节点或 HDInsight 群集的核心并行化执行的相关选项。This article outlines the options that are available to specify whether and how execution is parallelized across cores of the edge node or HDInsight cluster.

群集的边缘节点为连接到群集和运行 R 脚本提供了便捷的位置。The edge node of a cluster provides a convenient place to connect to the cluster and to run your R scripts. 使用边缘节点,可以选择跨边缘节点服务器的各个核心上运行 RevoScaleR 的并行化分布式函数。With an edge node, you have the option of running the parallelized distributed functions of RevoScaleR across the cores of the edge node server. 还可以通过使用 RevoScaleR 的 Hadoop Map Reduce 或 Apache Spark 计算上下文在群集的各个节点上运行这些函数。You can also run them across the nodes of the cluster by using RevoScaleR’s Hadoop Map Reduce or Apache Spark compute contexts.

Azure HDInsight 上的 ML ServicesML Services on Azure HDInsight

Azure HDInsight 上的 ML Services 提供最新的基于 R 的分析功能。ML Services on Azure HDInsight provides the latest capabilities for R-based analytics. 它可以使用存储在Azure Blob存储帐户、Data Lake Store 或本地 Linux 文件系统中 Apache Hadoop HDFS 容器中的数据。It can use data that is stored in an Apache Hadoop HDFS container in your Azure Blob storage account, a Data Lake Store, or the local Linux file system. 由于 ML 服务是在开源 R 上构建的,因此你构建的基于 R 的应用程序可以应用任何 8000 + 开源 R 包。Since ML Services is built on open-source R, the R-based applications you build can apply any of the 8000+ open-source R packages. 这些应用程序还可以利用 RevoScaleR(ML Services 附带的 Microsoft 的大数据分析包)中的例程。They can also use the routines in RevoScaleR, Microsoft’s big data analytics package that is included with ML Services.

边缘节点的计算上下文Compute contexts for an edge node

一般而言,在边缘节点上的 ML Services 群集中运行的 R 脚本会在该节点上的 R 解释器内运行。In general, an R script that's run in ML Services cluster on the edge node runs within the R interpreter on that node. 但是,调用 RevoScaleR 函数的步骤例外。The exceptions are those steps that call a RevoScaleR function. RevoScaleR 调用会在计算环境中运行,而计算环境取决于如何设置 RevoScaleR 计算上下文。The RevoScaleR calls run in a compute environment that is determined by how you set the RevoScaleR compute context. 从边缘节点运行 R 脚本时,计算上下文的值可能有:When you run your R script from an edge node, the possible values of the compute context are:

  • 本地顺序 (local)**local sequential (local)
  • 本地并行 (localpar)**local parallel (localpar)
  • Map ReduceMap Reduce
  • SparkSpark

local 和 localpar 选项的区别只体现在 rxExec 调用的执行方式********。The local and localpar options differ only in how rxExec calls are executed. 这两个选项都以并行方式跨所有可用核心执行其他 rx-function 调用,除非使用 RevoScaleR numCoresToUse 选项另外指定,例如,rxOptions(numCoresToUse=6)They both execute other rx-function calls in a parallel manner across all available cores unless specified otherwise through use of the RevoScaleR numCoresToUse option, for example rxOptions(numCoresToUse=6). 并行执行选项提供最佳性能。Parallel execution options offer optimal performance.

下表总结了用于设置调用执行方式的各个计算上下文选项:The following table summarizes the various compute context options to set how calls are executed:

计算上下文Compute context 设置方式How to set 执行上下文Execution context
本地顺序Local sequential rxSetComputeContext('local')rxSetComputeContext('local') 跨边缘节点服务器的核心并行执行,但 rxExec 调用除外(这种调用是串行执行的)Parallelized execution across the cores of the edge node server, except for rxExec calls, which are executed serially
本地并行Local parallel rxSetComputeContext('localpar')rxSetComputeContext('localpar') 跨边缘节点服务器的核心并行执行Parallelized execution across the cores of the edge node server
SparkSpark RxSpark()RxSpark() 通过 Spark 跨 HDI 群集的各个节点并行化分布式执行Parallelized distributed execution via Spark across the nodes of the HDI cluster
Map ReduceMap Reduce RxHadoopMR()RxHadoopMR() 通过 Map Reduce 跨 HDI 群集的各个节点并行化分布式执行Parallelized distributed execution via Map Reduce across the nodes of the HDI cluster

有关确定计算上下文的指导原则Guidelines for deciding on a compute context

有三个选项可提供并行执行,根据分析工作的性质、数据大小和位置进行选择。Which of the three options you choose that provide parallelized execution depends on the nature of your analytics work, the size, and the location of your data. 没有简单的公式可告诉您,使用哪种计算上下文。There's no simple formula that tells you, which compute context to use. 但是,有些指导原则可帮助你做出正确的选择,或至少可以帮助你在运行基准测试之前缩小选择范围。There are, however, some guiding principles that can help you make the right choice, or, at least, help you narrow down your choices before you run a benchmark. 这些指导原则包括:These guiding principles include:

  • 本地 Linux 文件系统比 HDFS 更快。The local Linux file system is faster than HDFS.
  • 如果数据在本地且采用 XDF,则重复分析的速度更快。Repeated analyses are faster if the data is local, and if it's in XDF.
  • 最好从文本数据源中流式传输少量数据。It's preferable to stream small amounts of data from a text data source. 如果数据量较大,请在分析之前将其转换为 XDF。If the amount of data is larger, convert it to XDF before analysis.
  • 对于极大量的数据,可将数据复制或流式传输到边缘节点进行分析所造成的开销将变得难以控制。The overhead of copying or streaming the data to the edge node for analysis becomes unmanageable for very large amounts of data.
  • 在 Hadoop 中进行分析时,Apache Spark 比 Map Reduce 更快。ApacheSpark is faster than Map Reduce for analysis in Hadoop.

鉴于这些原则,有一些用于选择计算上下文的常规经验规则,如下面部分所示。Given these principles, the following sections offer some general rules of thumb for selecting a compute context.

LocalLocal

  • 如果要分析的数据量较小,并且不需要重复的分析,请使用locallocalpar将其直接流式传输到分析例程。If the amount of data to analyze is small and doesn't require repeated analysis, then stream it directly into the analysis routine using local or localpar.
  • 如果要分析的数据量较小或者大小适中并且需要重复分析,可将其复制到本地文件系统,导入到 XDF,然后通过 local 或 localpar 进行分析****。If the amount of data to analyze is small or medium-sized and requires repeated analysis, then copy it to the local file system, import it to XDF, and analyze it via local or localpar.

Apache SparkApache Spark

  • 如果要分析的数据量较大,可使用 RxHiveData 或 RxParquetData 将它导入到 Spark DataFrame,或导入到 HDFS 中的 XDF(除非存储有问题),然后通过 Spark 计算上下文进行分析********。If the amount of data to analyze is large, then import it to a Spark DataFrame using RxHiveData or RxParquetData, or to XDF in HDFS (unless storage is an issue), and analyze it using the Spark compute context.

Apache Hadoop Map ReduceApache Hadoop Map Reduce

  • 仅当遇到 Spark 计算上下文的无法解决问题时,才使用地图减计算上下文,因为它的速度通常较慢。Use the Map Reduce compute context only if you come across an insurmountable problem with the Spark compute context since it's generally slower.

关于 rxSetComputeContext 的内联帮助Inline help on rxSetComputeContext

有关 RevoScaleR 计算上下文的详细信息和示例,请参阅 R 中关于 rxSetComputeContext 方法的内联帮助,例如:For more information and examples of RevoScaleR compute contexts, see the inline help in R on the rxSetComputeContext method, for example:

> ?rxSetComputeContext

也可以参阅 Machine Learning Server 文档中的分布式计算概述You can also refer to the Distributed computing overview in Machine Learning Server documentation.

后续步骤Next steps

本文概述了可跨边缘节点的核心或 HDInsight 群集中指定是否并行化或如何并行化执行的相关选项。In this article, you learned about the options that are available to specify whether and how execution is parallelized across cores of the edge node or HDInsight cluster. 若要详细了解如何通过 HDInsight 群集使用 ML Services,请参阅以下主题:To learn more about how to use ML Services with HDInsight clusters, see the following topics: