SparkR 概述SparkR overview

SparkR 是一个 R 包,它提供用于从 R Apache Spark 的轻型前端。 SparkR 还支持使用 MLlib 的分布式机器学习。SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR also supports distributed machine learning using MLlib.

笔记本中的 SparkRSparkR in notebooks

  • 对于 Spark 2.0 及更高版本,无需将对象显式传递 sqlContext 给每个函数调用。For Spark 2.0 and above, you do not need to explicitly pass a sqlContext object to every function call. 本文使用新的语法。This article uses the new syntax. 有关旧语法示例,请参阅 SparkR 1.6 概述For old syntax examples, see SparkR 1.6 overview.
  • 对于 Spark 2.2 及更高版本,默认情况下,笔记本不再导入 SparkR,因为 SparkR 函数与其他常用包中名称类似的函数冲突。For Spark 2.2 and above, notebooks no longer import SparkR by default because SparkR functions were conflicting with similarly named functions from other popular packages. 若要使用 SparkR,你可以 library(SparkR) 在笔记本中调用。To use SparkR you can call library(SparkR) in your notebooks. 已配置 SparkR 会话,并且所有 SparkR 函数都将使用现有会话与连接的群集通信。The SparkR session is already configured, and all SparkR functions will talk to your attached cluster using the existing session.

Spark 中的 SparkR-提交作业SparkR in spark-submit jobs

你可以在 Azure Databricks 上运行使用 SparkR 的脚本,将其作为 spark 提交作业,并进行少量代码修改。You can run scripts that use SparkR on Azure Databricks as spark-submit jobs, with minor code modifications. 有关示例,请参阅 创建并运行 R 脚本的 spark 提交作业For an example, refer to Create and run a spark-submit job for R scripts.

创建 SparkR DataFramesCreate SparkR DataFrames

可以从本地 R data.frame 、数据源或使用 SPARK SQL 查询创建数据帧。You can create a DataFrame from a local R data.frame, from a data source, or using a Spark SQL query.

从本地 R data.frameFrom a local R data.frame

创建数据帧的最简单方法是将本地 R 转换为 data.frame SparkDataFrameThe simplest way to create a DataFrame is to convert a local R data.frame into a SparkDataFrame. 具体来说,我们可以使用 createDataFrame 并在本地 R 中传递 data.frame 来创建 SparkDataFrameSpecifically we can use createDataFrame and pass in the local R data.frame to create a SparkDataFrame. 与大多数其他 SparkR 函数一样, createDataFrame Spark 2.0 中的语法发生了更改。Like most other SparkR functions, createDataFrame syntax changed in Spark 2.0. 可以在代码段下面中查看此示例。You can see examples of this in the code snippet bellow. 有关更多示例,请参阅 createDataFrameRefer to createDataFrame for more examples.

df <- createDataFrame(faithful)

# Displays the content of the DataFrame to stdout

使用数据源 APIUsing the data source API

从数据源创建数据帧的常规方法为 read.dfThe general method for creating a DataFrame from a data source is read.df. 此方法获取要加载的文件的路径和数据源的类型。This method takes the path for the file to load and the type of data source. SparkR 支持本机读取 CSV、JSON、text 和 Parquet 文件。SparkR supports reading CSV, JSON, text, and Parquet files natively.

diamondsDF <- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", source = "csv", header="true", inferSchema = "true")

SparkR 会自动从 CSV 文件推断架构。SparkR automatically infers the schema from the CSV file.

使用 Spark 包添加数据源连接器Adding a data source connector with Spark Packages

通过 Spark 包,可以找到常用文件格式(如 Avro)的数据源连接器。Through Spark Packages you can find data source connectors for popular file formats such as Avro. 例如,使用 spark-avro 包 加载 avro 文件。As an example, use the spark-avro package to load an Avro file. Spark-avro 包的可用性取决于群集的 映像版本The availability of the spark-avro package depends on your cluster’s image version. 请参阅 Avro fileSee Avro file.

首先获取现有 data.frame ,转换为 Spark 数据帧,并将其另存为 Avro 文件。First take an existing data.frame, convert to a Spark DataFrame, and save it as an Avro file.

irisDF <- createDataFrame(iris)
write.df(irisDF, source = "com.databricks.spark.avro", path = "dbfs:/tmp/iris.avro", mode = "overwrite")

验证 Avro 文件是否已保存:To verify that an Avro file was saved:

%fs ls /tmp/iris

现在再次使用 spark-avro 包读取数据。Now use the spark-avro package again to read back the data.

irisDF2 <- read.df(path = "/tmp/iris.avro", source = "com.databricks.spark.avro")

数据源 API 还可用于将 DataFrames 保存到多种文件格式。The data source API can also be used to save DataFrames into multiple file formats. 例如,可以使用将上一示例中的数据帧保存到 Parquet 文件中 write.dfFor example, you can save the DataFrame from the previous example to a Parquet file using write.df.

write.df(irisDF2, path="dbfs:/tmp/iris.parquet", source="parquet", mode="overwrite")
%fs ls dbfs:/tmp/people.parquet

从 Spark SQL 查询From a Spark SQL query

还可以使用 Spark SQL 查询创建 SparkR DataFrames。You can also create SparkR DataFrames using Spark SQL queries.

# Register earlier df as temp view
createOrReplaceTempView(people, "peopleTemp")
# Create a df consisting of only the 'age' column using a Spark SQL query
age <- sql("SELECT age FROM peopleTemp")

age 为 SparkDataFrame。age is a SparkDataFrame.

数据帧操作DataFrame operations

Spark DataFrames 支持多种函数来处理结构化数据。Spark DataFrames support a number of functions to do structured data processing. 下面是一些基本示例。Here are some basic examples. 可在 API 文档中找到完整列表。A complete list can be found in the API docs.

选择行和列Select rows and columns

# Import SparkR package if this is a new notebook

# Create DataFrame
df <- createDataFrame(faithful)
# Select only the "eruptions" column
head(select(df, df$eruptions))
# You can also pass in column name as strings
head(select(df, "eruptions"))
# Filter the DataFrame to only retain rows with wait times shorter than 50 mins
head(filter(df, df$waiting < 50))

分组和聚合Grouping and aggregation

SparkDataFrames 支持多种常用函数,以便在分组后聚合数据。SparkDataFrames support a number of commonly used functions to aggregate data after grouping. 例如,可以计算每个等待时间在忠实数据集中出现的次数。For example you can count the number of times each waiting time appears in the faithful dataset.

head(count(groupBy(df, df$waiting)))
# You can also sort the output from the aggregation to get the most common waiting times
waiting_counts <- count(groupBy(df, df$waiting))
head(arrange(waiting_counts, desc(waiting_counts$count)))

列操作Column operations

SparkR 提供了许多可直接应用于数据处理和聚合的列的函数。SparkR provides a number of functions that can be directly applied to columns for data processing and aggregation. 下面的示例演示如何使用基本算术函数。The following example shows the use of basic arithmetic functions.

# Convert waiting time from hours to seconds.
# You can assign this to a new column in the same DataFrame
df$waiting_secs <- df$waiting * 60

机器学习 Machine learning

SparkR 公开大多数 MLLib 算法。SparkR exposes most of MLLib algorithms. 在这种情况下,SparkR 使用 MLlib 来定型模型。Under the hood, SparkR uses MLlib to train the model.

下面的示例演示如何使用 SparkR 生成高斯 GLM 模型。The following example shows how to build a gaussian GLM model using SparkR. 若要运行线性回归,请将 "族" 设置为 "gaussian"To run linear regression, set family to "gaussian". 若要运行逻辑回归,请将 "系列" 设置为 "binomial"To run logistic regression, set family to "binomial". 当使用 SparkML GLM SparkR 时,会自动对分类特征进行一次热编码,以便不需要手动执行此操作。When using SparkML GLM SparkR automatically performs one-hot encoding of categorical features so that it does not need to be done manually. 除了 String 和 Double 类型功能以外,还可以适应 MLlib 矢量功能,以便与其他 MLlib 组件兼容。Beyond String and Double type features, it is also possible to fit over MLlib Vector features, for compatibility with other MLlib components.

# Create the DataFrame
df <- createDataFrame(iris)

# Fit a linear model over the dataset.
model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")

# Model coefficients are returned in a similar format to R's native glm().

有关教程,请参阅 SPARKR ML 教程For tutorials, see SparkR ML tutorials.