如何使用 spark.lapply 将 R 代码并行化How To parallelize R code with spark.lapply

R 代码的并行化很难,因为 R 代码在驱动程序和 R 数据上运行。帧不会分布。Parallelization of R code is difficult, because R code runs on the driver and R data.frames are not distributed. 通常,在本地运行并转换为在 Apache Spark 上运行的现有 R 代码。Often, there is existing R code that is run locally and that is converted to run on Apache Spark. 在其他情况下,一些用于高级统计分析和机器学习技术的 SparkR 函数可能不支持分布式计算。In other cases, some SparkR functions used for advanced statistical analysis and machine learning techniques may not support distributed computing. 在这种情况下,可以使用 SparkR UDF API 在群集中分布所需的工作负荷。In such cases, the SparkR UDF API can be used to distribute the desired workload across a cluster.

示例用例:你想要针对相同的数据(例如,超级参数优化)对多个机器学习模型进行定型。Example use case: You want to train multiple machine learning models on the same data, for example for hyper parameter tuning. 如果数据集适合每个辅助角色,则使用 SparkR UDF API 一次训练多个版本的模型可能会更有效。If the data set fits on each worker, it may be more efficient to use the SparkR UDF API to train several versions of the model at once.

spark.lapply函数使你能够在多个辅助角色上执行相同的任务,方法是对元素列表运行函数。The spark.lapply function enables you to perform the same task on multiple workers, by running a function over a list of elements. 对于列表中的每个元素:For each element in a list:

  1. 将函数发送到辅助角色。Send the function to a worker.
  2. 执行函数。Execute the function.
  3. 将所有工作线程的结果作为列表返回给驱动程序。Return the result of all workers as a list to the driver.

在下面的示例中,支持矢量机模型适合于 iris 具有3折交叉验证的数据集,而成本从0.5 增加到1,增量为0.1。In the following example, a support vector machine model is fit on the iris dataset with 3-fold cross validation while the cost is varied from 0.5 to 1 by increments of 0.1. 输出是一个列表,其中包含各种开销参数的模型的摘要。The output is a list with the summary of the models for the various cost parameters.

library(SparkR)

spark.lapply(seq(0.5, 1, by = 0.1), function(x) {
  library(e1071)
  model <- svm(Species ~ ., iris, cost = x, cross = 3)
  summary(model)
})

备注

必须在所有辅助角色上安装包。You must install packages on all workers.