開始使用 Apache SparkGet started with Apache Spark

本教學課程模組可協助您快速開始使用 Apache Spark。This tutorial module helps you to get started quickly with using Apache Spark. 我們將簡短討論重要概念,讓您可以立即開始撰寫您的第一個 Apache Spark 應用程式。We discuss key concepts briefly, so you can get right down to writing your first Apache Spark application. 在本指南的其他教學課程課程模組中,您將有機會深入探索您所選擇的文章。In the other tutorial modules in this guide, you will have the opportunity to go deeper into the article of your choice.

在本教學課程課程模組中,您將會瞭解:In this tutorial module, you will learn:

我們也會提供您可以匯入的 範例筆記本 ,以存取和執行模組中包含的所有程式碼範例。We also provide sample notebooks that you can import to access and run all of the code examples included in the module.

規格需求Requirements

完成 快速入門:使用 Azure 入口網站在 Azure Databricks 上執行 Spark 作業Complete Quickstart: Run a Spark job on Azure Databricks using the Azure portal.

Spark 介面 Spark interfaces

您應該瞭解三個重要的 Apache Spark 介面:彈性分散式資料集、資料框架和資料集。There are three key Apache Spark interfaces that you should know about: Resilient Distributed Dataset, DataFrame, and Dataset.

  • 復原的分散式資料集:第一個 Apache Spark 的抽象概念是 (RDD) 的復原分散式資料集。Resilient Distributed Dataset: The first Apache Spark abstraction was the Resilient Distributed Dataset (RDD). 它是一種資料物件序列的介面,其中包含一或多個在叢集) (的電腦集合中的類型。It is an interface to a sequence of data objects that consist of one or more types that are located across a collection of machines (a cluster). 您可以透過各種方式建立 Rdd,而且可以使用「最低層級」 API。RDDs can be created in a variety of ways and are the “lowest level” API available. 雖然這是 Apache Spark 的原始資料結構,但您應該將焦點放在資料框架 API,這是 RDD 功能的超集合。While this is the original data structure for Apache Spark, you should focus on the DataFrame API, which is a superset of the RDD functionality. RDD API 適用于 JAVA、Python 和 Scala 語言。The RDD API is available in the Java, Python, and Scala languages.
  • 資料框架:這些與您可能熟悉的 pandas Python 程式庫和 R 語言中的資料框架概念類似。DataFrame: These are similar in concept to the DataFrame you may be familiar with in the pandas Python library and the R language. 資料框架 API 適用于 JAVA、Python、R 和 Scala 語言。The DataFrame API is available in the Java, Python, R, and Scala languages.
  • Dataset:資料框架和 RDD 的組合。Dataset: A combination of DataFrame and RDD. 它提供可在 Rdd 中使用的具型別介面,同時提供資料框架的便利性。It provides the typed interface that is available in RDDs while providing the convenience of the DataFrame. 您可以使用 JAVA 和 Scala 語言提供資料集 API。The Dataset API is available in the Java and Scala languages.

在許多案例中,特別是在資料框架和資料集中內嵌效能優化時,將不需要使用 Rdd。In many scenarios, especially with the performance optimizations embedded in DataFrames and Datasets, it will not be necessary to work with RDDs. 但請務必瞭解 RDD 抽象概念,因為:But it is important to understand the RDD abstraction because:

  • RDD 是一種基礎結構,可讓 Spark 快速執行,並提供資料歷程。The RDD is the underlying infrastructure that allows Spark to run so fast and provide data lineage.
  • 如果您想要深入瞭解 Spark 的更高階元件,則可能需要使用 Rdd。If you are diving into more advanced components of Spark, it may be necessary to use RDDs.
  • SPARK UI參考 rdd 中的視覺效果。The visualizations within the Spark UI reference RDDs.

當您開發 Spark 應用程式時,您通常會使用 資料框架教學 課程和 資料集教學課程。When you develop Spark applications, you typically use DataFrames tutorial and Datasets tutorial.

撰寫您的第一個 Apache Spark 應用程式 Write your first Apache Spark application

若要撰寫您的第一個 Apache Spark 應用程式,請將程式碼新增至 Azure Databricks 筆記本的儲存格。To write your first Apache Spark application, you add code to the cells of an Azure Databricks notebook. 此範例使用 Python。This example uses Python. 如需詳細資訊,您也可以參考 Apache Spark 快速入門手冊For more information, you can also reference the Apache Spark Quick Start Guide.

第一個命令會列出 Databricks 檔案系統中資料夾的內容:This first command lists the contents of a folder in the Databricks File System:

# Take a look at the file system
display(dbutils.fs.ls("/databricks-datasets/samples/docs/"))

資料夾內容Folder contents

下一個命令會使用 sparkSparkSession 可在每個筆記本中使用)來讀取 README.md 文字檔,並建立名為的資料框架 textFileThe next command uses spark, the SparkSession available in every notebook, to read the README.md text file and create a DataFrame named textFile:

textFile = spark.read.text("/databricks-datasets/samples/docs/README.md")

若要計算文字檔的行數,請將 count 動作套用至資料框架:To count the lines of the text file, apply the count action to the DataFrame:

textFile.count()

文字檔行計數Text file line count

您可能會注意到,第二個命令(讀取文字檔)不會在第三個命令執行時產生任何輸出 countOne thing you may notice is that the second command, reading the text file, does not generate any output while the third command, performing the count, does. 原因是第一個命令是 轉換 ,第二個則是 動作The reason for this is that the first command is a transformation while the second one is an action. 轉換是 延遲 的,只有在執行動作時才會執行。Transformations are lazy and run only when an action is run. 這可讓 Spark 針對效能進行優化 (例如,在聯結) 之前執行篩選,而不是以循序執行命令。This allows Spark to optimize for performance (for example, run a filter prior to a join), instead of running commands serially. 如需轉換和動作的完整清單,請參閱 Apache Spark 程式設計指南: 轉換動作For a complete list of transformations and actions, refer to the Apache Spark Programming Guide: Transformations and Actions.

Azure Databricks 資料集 Azure Databricks datasets

Azure Databricks 包含工作區中的各種 資料集 ,可供您用來瞭解 Spark 或測試演算法。Azure Databricks includes a variety of datasets within the workspace that you can use to learn Spark or test out algorithms. 您將會在快速入門手冊中看到這些。You’ll see these throughout the getting started guide. 資料集可在資料夾中取得 /databricks-datasetsThe datasets are available in the /databricks-datasets folder.

筆記本 Notebooks

若要存取這些程式碼範例及其他程式碼範例,請匯入下列其中一個筆記本。To access these code examples and more, import the one of the following notebooks.

Apache Spark 快速入門 Python 筆記本Apache Spark Quick Start Python notebook

取得筆記本Get notebook

Apache Spark 快速入門 Scala 筆記本Apache Spark Quick Start Scala notebook

取得筆記本Get notebook