SparkR 1.6 概觀SparkR 1.6 overview

注意

如需最新 SparkR 程式庫的相關資訊,請參閱適用於 R 開發人員的 Azure DatabricksFor information about the latest SparkR library, see the Azure Databricks for R developers.

SparkR 是一個 R 封裝,可提供輕量前端以使用 R 的 Apache Spark。從 Spark 1.5.1 開始,SparkR 提供了分散式資料框架執行功能,可支援類似于 R 資料框架和 dplyr) 但在大型資料集上的作業,如選取、篩選和匯總 (。SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. Starting with Spark 1.5.1, SparkR provides a distributed DataFrame implementation that supports operations like selection, filtering, and aggregation (similar to R data frames and dplyr) but on large datasets. SparkR 也支援使用 MLlib 的分散式機器學習。SparkR also supports distributed machine learning using MLlib.

建立 SparkR 資料框架Creating SparkR DataFrames

應用程式可以從本機 R 資料框架、資料來源或使用 Spark SQL 查詢來建立資料框架。Applications can create DataFrames from a local R data frame, from data sources, or using Spark SQL queries.

建立資料框架最簡單的方式,就是將本機 R 資料框架轉換成 SparkR 資料框架。The simplest way to create a DataFrame is to convert a local R data frame into a SparkR DataFrame. 具體而言,我們可以使用 [建立資料框架] 並傳入本機 R 資料框架來建立 SparkR 資料框架。Specifically we can use create a DataFrame and pass in the local R data frame to create a SparkR DataFrame. 例如,下列儲存格會使用 R 的忠實呈現資料集來建立資料框架。As an example, the following cell creates a DataFrame using the faithful dataset from R.

df <- createDataFrame(sqlContext, faithful)

# Displays the content of the DataFrame to stdout
head(df)

從使用 Spark SQL 的資料來源From data sources using Spark SQL

從資料來源建立資料框架的一般方法是 read df。The general method for creating DataFrames from data sources is read.df. 這個方法會採用 SQLCoNtext,這是要載入之檔案的路徑和資料來源的類型。This method takes in the SQLContext, the path for the file to load and the type of data source. SparkR 支援以原生方式讀取 JSON 和 Parquet 檔案,而透過 Spark 套件,您可以針對 CSV 和 Avro 等熱門檔案格式尋找資料來源連接器。SparkR supports reading JSON and Parquet files natively and through Spark Packages you can find data source connectors for popular file formats like CSV and Avro.

%fs rm dbfs:/tmp/people.json
%fs put dbfs:/tmp/people.json
'{"age": 10, "name": "John"}
{"age": 20, "name": "Jane"}
{"age": 30, "name": "Andy"}'
people <- read.df(sqlContext, "dbfs:/tmp/people.json", source="json")

SparkR 會從 JSON 檔案自動推斷架構。SparkR automatically infers the schema from the JSON file.

printSchema(people)
display(people)

搭配 Spark 封裝使用資料來源連接器Using data source connectors with Spark Packages

例如,我們將使用 Spark CSV 封裝來載入 CSV 檔案。As an example, we will use the Spark CSV package to load a CSV file. 您可以在這裡 Databricks 以尋找 Spark 套件的清單You can find a list of Spark Packages by Databricks here.

diamonds <- read.df(sqlContext, "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
                    source = "com.databricks.spark.csv", header="true", inferSchema = "true")
head(diamonds)

資料來源 API 也可以用來將資料框架儲存成多個檔案格式。The data sources API can also be used to save out DataFrames into multiple file formats. 例如,我們可以使用 write df 將上一個範例中的資料框架儲存至 Parquet 檔案For example we can save the DataFrame from the previous example to a Parquet file using write.df

%fs rm -r dbfs:/tmp/people.parquet
write.df(people, path="dbfs:/tmp/people.parquet", source="parquet", mode="overwrite")
%fs ls dbfs:/tmp/people.parquet

從 Spark SQL 查詢From Spark SQL queries

您也可以使用 Spark SQL 查詢來建立 SparkR 資料框架。You can also create SparkR DataFrames using Spark SQL queries.

# Register earlier df as temp view
createOrReplaceTempView(people, "peopleTemp")
# Create a df consisting of only the 'age' column using a Spark SQL query
age <- sql(sqlContext, "SELECT age FROM peopleTemp")
head(age)
# Resulting df is a SparkR df
str(age)

資料框架作業DataFrame operations

SparkR 資料框架支援一些函數來執行結構化資料處理。SparkR DataFrames support a number of functions to do structured data processing. 在此我們會包含一些基本範例,並在API檔中找到完整的清單。Here we include some basic examples and a complete list can be found in the API docs.

選取資料列和資料行Selecting rows and columns

# Create DataFrame
df <- createDataFrame(sqlContext, faithful)
df
# Select only the "eruptions" column
head(select(df, df$eruptions))
# You can also pass in column name as strings
head(select(df, "eruptions"))
# Filter the DataFrame to only retain rows with wait times shorter than 50 mins
head(filter(df, df$waiting < 50))

群組和彙總Grouping and aggregation

SparkR 資料框架支援一些常用的函式,以便在群組後匯總資料。SparkR DataFrames support a number of commonly used functions to aggregate data after grouping. 例如,我們可以計算每個等候時間出現在忠實呈現資料集中的次數。For example we can count the number of times each waiting time appears in the faithful dataset.

head(count(groupBy(df, df$waiting)))
# We can also sort the output from the aggregation to get the most common waiting times
waiting_counts <- count(groupBy(df, df$waiting))
head(arrange(waiting_counts, desc(waiting_counts$count)))

資料行作業Column operations

SparkR 提供一些函數,可直接套用至資料行,以進行資料處理和匯總。SparkR provides a number of functions that can be directly applied to columns for data processing and aggregation. 下列範例示範如何使用基本算術函數。The following example shows the use of basic arithmetic functions.

# Convert waiting time from hours to seconds.
# You can assign this to a new column in the same DataFrame
df$waiting_secs <- df$waiting * 60
head(df)

機器學習Machine learning

從 Spark 1.5 開始,SparkR 可讓您使用 glm ( # A1 函式,透過 SparkR 資料框架來調整一般化線性模型。As of Spark 1.5, SparkR allows the fitting of generalized linear models over SparkR DataFrames using the glm() function. 在幕後,SparkR 會使用 MLlib 來訓練指定系列的模型。Under the hood, SparkR uses MLlib to train a model of the specified family. 我們支援適用于模型調整的 R 公式運算子子集,包括 ' ~ '、'. '、' + ' 和 '-'。We support a subset of the available R formula operators for model fitting, including ‘~’, ‘.’, ‘+’, and ‘-‘.

實際上,SparkR 會自動執行分類功能的一種熱編碼,因此不需要手動完成。Under the hood, SparkR automatically performs one-hot encoding of categorical features so that it does not need to be done manually. 除了字串和 Double 類型的功能之外,也可以配合 MLlib 向量功能,以與其他 MLlib 元件相容。Beyond String and Double type features, it is also possible to fit over MLlib Vector features, for compatibility with other MLlib components.

下列範例示範如何使用 SparkR 建立高斯 GLM 模型。The following example shows the use of building a gaussian GLM model using SparkR. 若要執行線性回歸,請將「家族」設定為「高斯」。To run Linear Regression, set family to “gaussian”. 若要執行羅吉斯回歸,請將 [系列] 設定為 [二項式]。To run Logistic Regression, set family to “binomial”.

# Create the DataFrame
df <- createDataFrame(sqlContext, iris)

# Fit a linear model over the dataset.
model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")

# Model coefficients are returned in a similar format to R's native glm().
summary(model)

將本機 R 資料框架轉換成 SparkR 資料框架Converting local R data frames to SparkR DataFrames

您可以使用將 createDataFrame 本機 R 資料框架轉換成 SparkR 資料框架。You can use createDataFrame to convert local R data frames to SparkR DataFrames.

# Create SparkR DataFrame using localDF
convertedSparkDF <- createDataFrame(sqlContext, localDF)
str(convertedSparkDF)
# Another example: Create SparkR DataFrame with a local R data frame
anotherSparkDF <- createDataFrame(sqlContext, data.frame(surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"),
                                                         nationality = c("US", "Australia", "US", "UK", "Australia"),
                                                         deceased = c("yes", rep("no", 4))))
count(anotherSparkDF)