您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

教程:使用 Azure Batch 运行并行 R 模拟Tutorial: Run a parallel R simulation with Azure Batch

使用 doAzureParallel 包大规模运行并行 R 工作负荷。该包是一种轻型 R 包,允许直接从 R 会话使用 Azure Batch。Run your parallel R workloads at scale using doAzureParallel, a lightweight R package that allows you to use Azure Batch directly from your R session. doAzureParallel 包在常用 foreach R 包的基础上生成。The doAzureParallel package is built on top of the popular foreach R package. doAzureParallel 执行 foreach 循环的每个迭代,将其作为 Azure Batch 任务提交。doAzureParallel takes each iteration of the foreach loop and submits it as an Azure Batch task.

本教程介绍如何部署 Batch 池,然后直接在 RStudio 中通过 Azure Batch 运行并行 R 作业。This tutorial shows you how to deploy a Batch pool and run a parallel R job in Azure Batch directly within RStudio. 学习如何:You learn how to:

  • 安装 doAzureParallel 并将其配置为访问 Batch 帐户和存储帐户Install doAzureParallel and configure it to access your Batch and storage accounts
  • 创建一个 Batch 池,作为 R 会话的并行后端Create a Batch pool as a parallel backend for your R session
  • 在池中运行示例并行模拟Run a sample parallel simulation on the pool

先决条件Prerequisites

登录 AzureSign in to Azure

https://portal.azure.com 中登录 Azure 门户。Sign in to the Azure portal at https://portal.azure.com.

获取帐户凭据Get account credentials

就此示例来说,需为 Batch 帐户和存储帐户提供凭据。For this example, you need to provide credentials for your Batch and Storage accounts. 若要获取所需凭据,一种直接的方法是使用 Azure 门户。A straightforward way to get the necessary credentials is in the Azure portal. (也可使用 Azure API 或命令行工具来获取这些凭据。)(You can also get these credentials using the Azure APIs or command-line tools.)

  1. 单击“所有服务” > “Batch 帐户”,然后单击 Batch 帐户的名称。Click All services > Batch accounts, and then click the name of your Batch account.

  2. 若要查看 Batch 凭据,请单击“密钥” 。To see the Batch credentials, click Keys. 将“Batch 帐户”、“URL”和“主访问密钥”的值复制到文本编辑器。 Copy the values of Batch account, URL, and Primary access key to a text editor.

  3. 若要查看存储帐户名称和密钥,请单击“存储帐户” 。To see the Storage account name and keys, click Storage account. 将“存储帐户名称”和“Key1”的值复制到文本编辑器。 Copy the values of Storage account name and Key1 to a text editor.

安装 doAzureParallelInstall doAzureParallel

在 RStudio 控制台中安装 doAzureParallel GitHub 包In the RStudio console, install the doAzureParallel GitHub package. 以下命令在当前 R 会话中下载并安装该包及其依赖项:The following commands download and install the package and its dependencies in your current R session:

# Install the devtools package  
install.packages("devtools") 

# Install rAzureBatch package
devtools::install_github("Azure/rAzureBatch") 

# Install the doAzureParallel package 
devtools::install_github("Azure/doAzureParallel") 
 
# Load the doAzureParallel library 
library(doAzureParallel) 

安装可能需要数分钟。Installation can take several minutes.

若要使用以前获得的帐户凭据来配置 doAzureParallel,请在工作目录中生成名为 credentials.json 的配置文件:To configure doAzureParallel with the account credentials you obtained previously, generate a configuration file called credentials.json in your working directory:

generateCredentialsConfig("credentials.json") 

使用 Batch 帐户和存储帐户的名称和密钥来填充此文件。Populate this file with your Batch and storage account names and keys. 保留 githubAuthenticationToken 设置不变。Leave the githubAuthenticationToken setting unchanged.

完成后,凭据文件如下所示:When complete, the credentials file looks similar to the following:

{
  "batchAccount": {
    "name": "mybatchaccount",
    "key": "xxxxxxxxxxxxxxxxE+yXrRvJAqT9BlXwwo1CwF+SwAYOxxxxxxxxxxxxxxxx43pXi/gdiATkvbpLRl3x14pcEQ==",
    "url": "https://mybatchaccount.mybatchregion.batch.azure.com"
  },
  "storageAccount": {
    "name": "mystorageaccount",
    "key": "xxxxxxxxxxxxxxxxy4/xxxxxxxxxxxxxxxxfwpbIC5aAWA8wDu+AFXZB827Mt9lybZB1nUcQbQiUrkPtilK5BQ=="
  },
  "githubAuthenticationToken": ""
}

保存文件。Save the file. 然后运行以下命令,设置当前 R 会话的凭据:Then, run the following command to set the credentials for your current R session:

setCredentials("credentials.json") 

创建 Batch 池Create a Batch pool

doAzureParallel 包括一个函数,用于生成运行并行 R 作业所需的 Azure Batch 池(群集)。doAzureParallel includes a function to generate an Azure Batch pool (cluster) to run parallel R jobs. 这些节点运行基于 Ubuntu 的 Azure 数据科学虚拟机The nodes run an Ubuntu-based Azure Data Science Virtual Machine. Microsoft R Open 和常用 R 包已预装在此映像上。Microsoft R Open and popular R packages are pre-installed on this image. 可以查看或自定义某些群集设置,例如节点的数量和大小。You can view or customize certain cluster settings, such as the number and size of the nodes.

若要在工作目录中生成群集配置 JSON 文件,请执行以下操作:To generate a cluster configuration JSON file in your working directory:

generateClusterConfig("cluster.json")

打开要查看默认配置的文件,其中包括 3 个专用节点和 3 个低优先级节点。Open the file to view the default configuration, which includes 3 dedicated nodes and 3 low-priority nodes. 这些设置只是示例,可以进行试验或修改。These settings are just examples that you can experiment with or modify. 专用节点为池保留。Dedicated nodes are reserved for your pool. 低优先级节点在 Azure 有剩余 VM 容量时以优惠价提供。Low-priority nodes are offered at a reduced price from surplus VM capacity in Azure. 如果 Azure 没有足够的容量,低优先级节点会变得不可用。Low-priority nodes become unavailable if Azure does not have enough capacity.

对于本教程,请将配置更改如下:For this tutorial, change the configuration as follows:

  • maxTasksPerNode 增加到 2,以便充分利用每个节点上的两个核心Increase the maxTasksPerNode to 2, to take advantage of both cores on each node
  • dedicatedNodes 设置为 0,以便尝试适用于 Batch 的低优先级 VM。Set dedicatedNodes to 0, so you can try the low-priority VMs available for Batch. lowPriorityNodesmin 设置为 5Set the min of lowPriorityNodes to 5. 并将 max 设置为 10,或者根据需要选择更小的数字。and the max to 10, or choose smaller numbers if desired.

其余设置保留默认值,然后保存文件。Leave defaults for the remaining settings, and save the file. 如下图所示:It should look similar to the following:

{
  "name": "myPoolName",
  "vmSize": "Standard_D2_v2",
  "maxTasksPerNode": 2,
  "poolSize": {
    "dedicatedNodes": {
      "min": 0,
      "max": 0
    },
    "lowPriorityNodes": {
      "min": 5,
      "max": 10
    },
    "autoscaleFormula": "QUEUE"
  },
  "containerImage": "rocker/tidyverse:latest",
  "rPackages": {
    "cran": [],
    "github": [],
    "bioconductor": []
  },
  "commandLine": []
}

现在创建群集。Now create the cluster. Batch 会立即创建池,但分配和启动计算节点则需要数分钟。Batch creates the pool immediately, but it takes a few minutes to allocate and start the compute nodes. 在群集可用以后,将其注册为 R 会话的并行后端。After the cluster is available, register it as the parallel backend for your R session.

# Create your cluster if it does not exist; this takes a few minutes
cluster <- makeCluster("cluster.json") 
  
# Register your parallel backend 
registerDoAzureParallel(cluster) 
  
# Check that the nodes are running 
getDoParWorkers() 

输出显示 doAzureParallel 的“执行辅助角色”数。Output shows the number of "execution workers" for doAzureParallel. 此数是节点数乘以 maxTasksPerNode 值。This number is the number of nodes multiplied by the value of maxTasksPerNode. 如果已如前所述修改群集配置,则此数为 10If you modified the cluster configuration as described previously, the number is 10.

运行并行模拟Run a parallel simulation

创建群集以后,即可使用注册的并行后端(Azure Batch 池)运行 foreach 循环。Now that your cluster is created, you are ready to run your foreach loop with your registered parallel backend (Azure Batch pool). 例如,可以运行 Monte Carlo 财务模拟,先在本地使用标准的 foreach 循环,然后使用 Batch 运行 foreach。As an example, run a Monte Carlo financial simulation, first locally using a standard foreach loop, and then running foreach with Batch. 此示例为简化版,可以通过模拟 5 年后的大量不同结果来预测股票价格。This example is a simplified version of predicting a stock price by simulating a large number of different outcomes after 5 years.

假设 Contoso Corporation 的股票以开盘价为基础,每天的价格平均起来是上一天的 1.001 倍,但其波动性(标准方差)为 0.01。Suppose that the stock of Contoso Corporation gains on average 1.001 times its opening price each day, but has a volatility (standard deviation) of 0.01. 假设起始价为 $100,使用 Monte Carlo 定价模拟来计算出 5 年后 Contoso 的股票价格。Given a starting price of $100, use a Monte Carlo pricing simulation to figure out Contoso's stock price after 5 years.

Monte Carlo 模拟的参数:Parameters for the Monte Carlo simulation:

mean_change = 1.001 
volatility = 0.01 
opening_price = 100 

若要模拟收盘价,请定义以下函数:To simulate closing prices, define the following function:

getClosingPrice <- function() { 
  days <- 1825 # ~ 5 years 
  movement <- rnorm(days, mean=mean_change, sd=volatility) 
  path <- cumprod(c(opening_price, movement)) 
  closingPrice <- path[days] 
  return(closingPrice) 
} 

首先,使用标准的 foreach 循环与 %do% 关键字在本地运行 10,000 次模拟:First run 10,000 simulations locally using a standard foreach loop with the %do% keyword:

start_s <- Sys.time() 
# Run 10,000 simulations in series 
closingPrices_s <- foreach(i = 1:10, .combine='c') %do% { 
  replicate(1000, getClosingPrice()) 
} 
end_s <- Sys.time() 

绘制收盘价直方图,显示结果分布情况:Plot the closing prices in a histogram to show the distribution of outcomes:

hist(closingPrices_s)

输出与下面类似:Output is similar to the following:

收盘价分布情况

一次本地模拟最多几秒钟即可完成:A local simulation completes in a few seconds or less:

difftime(end_s, start_s) 

使用线性估算,1 千万个结果在本地的估计运行时间大约为 30 分钟:Estimated runtime for 10 million outcomes locally, using a linear approximation, is around 30 minutes:

1000 * difftime(end_s, start_s, unit = "min") 

现在使用 foreach%dopar% 关键字运行代码,比较一下在 Azure 中运行 1 千万次模拟需要多长时间。Now run the code using foreach with the %dopar% keyword to compare how long it takes to run 10 million simulations in Azure. 若要使用 Batch 进行并行模拟,请针对这 100,000 次模拟运行 100 次迭代:To parallelize the simulation with Batch, run 100 iterations of 100,000 simulations:

# Optimize runtime. Chunking allows running multiple iterations on a single R instance.
opt <- list(chunkSize = 10) 
start_p <- Sys.time()  
closingPrices_p <- foreach(i = 1:100, .combine='c', .options.azure = opt) %dopar% { 
  replicate(100000, getClosingPrice()) 
} 
end_p <- Sys.time() 

此模拟将任务分发到 Batch 池中的节点。The simulation distributes tasks to the nodes in the Batch pool. 在 Azure 门户中,可以查看池的热度地图中的活动。You can see the activity in the heat map for the pool in the Azure portal]. 转到“Batch 帐户” > “myBatchAccount”。Go to Batch accounts > myBatchAccount. 单击“池” > “myPoolName”。Click Pools > myPoolName.

运行并行 R 任务的池的热度地图

几分钟后,模拟完成。After a few minutes, the simulation finishes. 包会自动合并结果,并将它们从节点向下拉取。The package automatically merges the results and pulls them down from the nodes. 然后,你就可以在 R 会话中使用结果。Then, you are ready to use the results in your R session.

hist(closingPrices_p) 

输出与下面类似:Output is similar to the following:

收盘价分布情况

并行模拟耗时多久?How long did the parallel simulation take?

difftime(end_p, start_p, unit = "min")  

可以看到,与在本地运行模拟预计需要的时间相比,在 Batch 池中运行模拟可以大幅提高性能。You should see that running the simulation on the Batch pool gives you a significant increase in performance over the expected time to run the simulation locally.

清理资源Clean up resources

作业在完成后自动删除。The job is deleted automatically after it completes. 如果不再需要群集,请调用 doAzureParallel 包中的 stopCluster 函数将其删除:When the cluster is longer needed, call the stopCluster function in the doAzureParallel package to delete it:

stopCluster(cluster)

后续步骤Next steps

本教程介绍了如何:In this tutorial, you learned about how to:

安装 doAzureParallel 并将其配置为访问 Batch 帐户和存储帐户Install doAzureParallel and configure it to access your Batch and storage accounts

  • 创建一个 Batch 池,作为 R 会话的并行后端Create a Batch pool as a parallel backend for your R session
  • 在池中运行示例并行模拟Run a sample parallel simulation on the pool

有关 doAzureParallel 的详细信息,请查看 GitHub 上的文档和示例。For more information about doAzureParallel, see the documentation and samples on GitHub.