您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

快速入门:使用 RStudio Server 在 Azure HDInsight 中的 ML Services 群集上执行 R 脚本Quickstart: Execute an R script on an ML Services cluster in Azure HDInsight using RStudio Server

Azure HDInsight 上的 ML Services 允许 R 脚本使用 Apache Spark 和 Apache Hadoop MapReduce 运行分布式计算。ML Services on Azure HDInsight allows R scripts to use Apache Spark and Apache Hadoop MapReduce to run distributed computations. ML Services 可设置计算上下文,从而控制执行调用的方式。ML Services controls how calls are executed by setting the compute context. 群集的边缘节点为连接到群集和运行 R 脚本提供了便捷的位置。The edge node of a cluster provides a convenient place to connect to the cluster and to run your R scripts. 使用边缘节点,可以选择跨边缘节点服务器的各个核心上运行 RevoScaleR 的并行化分布式函数。With an edge node, you have the option of running the parallelized distributed functions of RevoScaleR across the cores of the edge node server. 还可以通过使用 RevoScaleR 的 Hadoop Map Reduce 或 Apache Spark 计算上下文在群集的各个节点上运行这些函数。You can also run them across the nodes of the cluster by using RevoScaleR’s Hadoop Map Reduce or Apache Spark compute contexts.

本快速入门介绍如何使用 RStudio Server 运行 R 脚本,以演示如何使用 Spark 进行分布式 R 计算。In this quickstart, you learn how to run an R script with RStudio Server that demonstrates using Spark for distributed R computations. 你将定义计算上下文以在边缘节点上本地执行计算,然后重新分布在 HDInsight 群集中的节点上。You will define a compute context to perform computations locally on an edge node, and again distributed across the nodes in the HDInsight cluster.

先决条件Prerequisite

HDInsight 上的机器学习服务群集。An ML Services cluster on HDInsight. 参阅使用 Azure 门户创建 Apache Hadoop 群集,并选择“机器学习服务”作为“群集类型”。 See Create Apache Hadoop clusters using the Azure portal and select ML Services for Cluster type.

连接到 RStudio ServerConnect to RStudio Server

RStudio Server 在群集的边缘节点上运行。RStudio Server runs on the cluster’s edge node. 转到以下 URL,其中 CLUSTERNAME 是创建的机器学习服务群集的名称:Go to the following URL where CLUSTERNAME is the name of the ML Services cluster you created:

https://CLUSTERNAME.azurehdinsight.net/rstudio/

首次登录时需要进行两次身份验证。The first time you sign in you need to authenticate twice. 对于第一个身份验证提示,请提供群集管理员登录名和密码,默认为 adminFor the first authentication prompt, provide the cluster Admin login and password, default is admin. 对于第二个身份验证提示,请提供 SSH 登录名和密码,默认为 sshuserFor the second authentication prompt, provide the SSH login and password, default is sshuser. 后续登录只需提供 SSH 凭据。Subsequent sign-ins only require the SSH credentials.

连接后,屏幕应如以下屏幕截图所示:Once you are connected, your screen should resemble the following screenshot:

R Studio Web 控制台概述

使用计算上下文Use a compute context

  1. 在 RStudio Server 中,使用以下代码将示例数据加载到 HDInsight 的默认存储中:From RStudio Server, use the following code to load example data into the default storage for HDInsight:

    # Set the HDFS (WASB) location of example data
     bigDataDirRoot <- "/example/data"
    
     # create a local folder for storing data temporarily
     source <- "/tmp/AirOnTimeCSV2012"
     dir.create(source)
    
     # Download data to the tmp folder
     remoteDir <- "https://packages.revolutionanalytics.com/datasets/AirOnTimeCSV2012"
     download.file(file.path(remoteDir, "airOT201201.csv"), file.path(source, "airOT201201.csv"))
     download.file(file.path(remoteDir, "airOT201202.csv"), file.path(source, "airOT201202.csv"))
     download.file(file.path(remoteDir, "airOT201203.csv"), file.path(source, "airOT201203.csv"))
     download.file(file.path(remoteDir, "airOT201204.csv"), file.path(source, "airOT201204.csv"))
     download.file(file.path(remoteDir, "airOT201205.csv"), file.path(source, "airOT201205.csv"))
     download.file(file.path(remoteDir, "airOT201206.csv"), file.path(source, "airOT201206.csv"))
     download.file(file.path(remoteDir, "airOT201207.csv"), file.path(source, "airOT201207.csv"))
     download.file(file.path(remoteDir, "airOT201208.csv"), file.path(source, "airOT201208.csv"))
     download.file(file.path(remoteDir, "airOT201209.csv"), file.path(source, "airOT201209.csv"))
     download.file(file.path(remoteDir, "airOT201210.csv"), file.path(source, "airOT201210.csv"))
     download.file(file.path(remoteDir, "airOT201211.csv"), file.path(source, "airOT201211.csv"))
     download.file(file.path(remoteDir, "airOT201212.csv"), file.path(source, "airOT201212.csv"))
    
     # Set directory in bigDataDirRoot to load the data into
     inputDir <- file.path(bigDataDirRoot,"AirOnTimeCSV2012")
    
     # Make the directory
     rxHadoopMakeDir(inputDir)
    
     # Copy the data from source to input
     rxHadoopCopyFromLocal(source, bigDataDirRoot)
    

    此步骤可能需要大约 8 分钟才能完成。This step may take around 8 minutes to complete.

  2. 创建一些数据信息并定义两个数据源。Create some data info and define two data sources. 在 RStudio 中输入以下代码:Enter the following code in RStudio:

    # Define the HDFS (WASB) file system
     hdfsFS <- RxHdfsFileSystem()
    
     # Create info list for the airline data
     airlineColInfo <- list(
          DAY_OF_WEEK = list(type = "factor"),
          ORIGIN = list(type = "factor"),
          DEST = list(type = "factor"),
          DEP_TIME = list(type = "integer"),
          ARR_DEL15 = list(type = "logical"))
    
     # get all the column names
     varNames <- names(airlineColInfo)
    
     # Define the text data source in hdfs
     airOnTimeData <- RxTextData(inputDir, colInfo = airlineColInfo, varsToKeep = varNames, fileSystem = hdfsFS)
    
     # Define the text data source in local system
     airOnTimeDataLocal <- RxTextData(source, colInfo = airlineColInfo, varsToKeep = varNames)
    
     # formula to use
     formula = "ARR_DEL15 ~ ORIGIN + DAY_OF_WEEK + DEP_TIME + DEST"
    
  3. 使用本地计算上下文对数据运行逻辑回归。Run a logistic regression over the data using the local compute context. 在 RStudio 中输入以下代码:Enter the following code in RStudio:

    # Set a local compute context
     rxSetComputeContext("local")
    
     # Run a logistic regression
     system.time(
        modelLocal <- rxLogit(formula, data = airOnTimeDataLocal)
     )
    
     # Display a summary
     summary(modelLocal)
    

    计算应该在大约 7 分钟内完成。The computations should complete in about 7 minutes. 应会看到以类似于以下代码段的行结尾的输出:You should see output that ends with lines similar to the following snippet:

    Data: airOnTimeDataLocal (RxTextData Data Source)
     File name: /tmp/AirOnTimeCSV2012
     Dependent variable(s): ARR_DEL15
     Total independent variables: 634 (Including number dropped: 3)
     Number of valid observations: 6005381
     Number of missing observations: 91381
     -2*LogLikelihood: 5143814.1504 (Residual deviance on 6004750 degrees of freedom)
    
     Coefficients:
                      Estimate Std. Error z value Pr(>|z|)
      (Intercept)   -3.370e+00  1.051e+00  -3.208  0.00134 **
      ORIGIN=JFK     4.549e-01  7.915e-01   0.575  0.56548
      ORIGIN=LAX     5.265e-01  7.915e-01   0.665  0.50590
      ......
      DEST=SHD       5.975e-01  9.371e-01   0.638  0.52377
      DEST=TTN       4.563e-01  9.520e-01   0.479  0.63172
      DEST=LAR      -1.270e+00  7.575e-01  -1.676  0.09364 .
      DEST=BPT         Dropped    Dropped Dropped  Dropped
    
      ---
    
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
      Condition number of final variance-covariance matrix: 11904202
      Number of iterations: 7
    
  4. 使用 Spark 上下文运行相同的逻辑回归。Run the same logistic regression using the Spark context. Spark 上下文将处理进程分布到 HDInsight 群集的所有辅助角色节点上。The Spark context distributes the processing over all the worker nodes in the HDInsight cluster. 在 RStudio 中输入以下代码:Enter the following code in RStudio:

    # Define the Spark compute context
     mySparkCluster <- RxSpark()
    
     # Set the compute context
     rxSetComputeContext(mySparkCluster)
    
     # Run a logistic regression
     system.time(  
        modelSpark <- rxLogit(formula, data = airOnTimeData)
     )
    
     # Display a summary
     summary(modelSpark)
    

    计算应该在大约 5 分钟内完成。The computations should complete in about 5 minutes.

清理资源Clean up resources

完成本快速入门后,可以删除群集。After you complete the quickstart, you may want to delete the cluster. 有了 HDInsight,便可以将数据存储在 Azure 存储中,因此可以在群集不用时安全地删除群集。With HDInsight, your data is stored in Azure Storage, so you can safely delete a cluster when it is not in use. 此外,还需要为 HDInsight 群集付费,即使不用也是如此。You are also charged for an HDInsight cluster, even when it is not in use. 由于群集费用数倍于存储空间费用,因此在群集不用时删除群集可以节省费用。Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they are not in use.

若要删除群集,请参阅使用浏览器、PowerShell 或 Azure CLI 删除 HDInsight 群集To delete a cluster, see Delete an HDInsight cluster using your browser, PowerShell, or the Azure CLI.

后续步骤Next steps

本快速入门介绍了如何使用 RStudio Server 运行 R 脚本,以演示如何使用 Spark 进行分布式 R 计算。In this quickstart, you learned how to run an R script with RStudio Server that demonstrated using Spark for distributed R computations. 继续阅读下一篇文章,了解可用于指定是否以及如何跨边缘节点或 HDInsight 群集的核心并行执行的选项。Advance to the next article to learn the options that are available to specify whether and how execution is parallelized across cores of the edge node or HDInsight cluster.

备注

此页介绍 RStudio 软件的功能。This page describes features of RStudio software. Microsoft Azure HDInsight 与 RStudio, Inc. 没有关联Microsoft Azure HDInsight is not affiliated with RStudio, Inc.