Apache Spark 入门Get started with Apache Spark

本教程模块帮助您快速开始使用 Apache Spark。This tutorial module helps you to get started quickly with using Apache Spark. 我们简要讨论了重要概念,因此你可以立即开始编写你的第一个 Apache Spark 应用程序。We discuss key concepts briefly, so you can get right down to writing your first Apache Spark application. 在本指南的其他教程模块中,你将有机会更深入地了解所选文章。In the other tutorial modules in this guide, you will have the opportunity to go deeper into the article of your choice.

在本教程模块中,你将学习:In this tutorial module, you will learn:

我们还提供了一些 示例笔记本 ,你可以将其导入以访问和运行模块中包含的所有代码示例。We also provide sample notebooks that you can import to access and run all of the code examples included in the module.

要求Requirements

完成 快速入门:使用 Azure 门户在 Azure Databricks 上运行 Spark 作业Complete Quickstart: Run a Spark job on Azure Databricks using the Azure portal.

Spark 接口 Spark interfaces

您应该了解以下三个关键 Apache Spark 接口:弹性分布式数据集、数据帧和数据集。There are three key Apache Spark interfaces that you should know about: Resilient Distributed Dataset, DataFrame, and Dataset.

  • 弹性分布式数据集:第一 Apache Spark 抽象是弹性分布式数据集 (RDD) 。Resilient Distributed Dataset: The first Apache Spark abstraction was the Resilient Distributed Dataset (RDD). 它是一种数据对象序列的接口,其中包含一个或多个位于 (群集) 计算机集合中的类型。It is an interface to a sequence of data objects that consist of one or more types that are located across a collection of machines (a cluster). 可以通过多种方式创建 Rdd,并提供 "最低级别" API。RDDs can be created in a variety of ways and are the “lowest level” API available. 尽管这是 Apache Spark 的原始数据结构,但你应将精力集中在数据帧 API 上,该 API 是 RDD 功能的超集。While this is the original data structure for Apache Spark, you should focus on the DataFrame API, which is a superset of the RDD functionality. 可在 Java、Python 和 Scala 语言中使用 RDD API。The RDD API is available in the Java, Python, and Scala languages.
  • 数据帧:这些内容在概念上类似于你可能在 pandas Python 库和 R 语言中熟悉的数据帧。DataFrame: These are similar in concept to the DataFrame you may be familiar with in the pandas Python library and the R language. 可在 Java、Python、R 和 Scala 语言中使用数据帧 API。The DataFrame API is available in the Java, Python, R, and Scala languages.
  • 数据集:数据帧和 RDD 的组合。Dataset: A combination of DataFrame and RDD. 它提供 Rdd 中提供的类型化接口,同时提供数据帧的便利。It provides the typed interface that is available in RDDs while providing the convenience of the DataFrame. 数据集 API 以 Java 和 Scala 语言提供。The Dataset API is available in the Java and Scala languages.

在许多情况下,尤其是在 DataFrames 和数据集中嵌入性能优化时,不需要使用 Rdd。In many scenarios, especially with the performance optimizations embedded in DataFrames and Datasets, it will not be necessary to work with RDDs. 但了解 RDD 抽象非常重要,因为:But it is important to understand the RDD abstraction because:

  • RDD 是底层基础结构,使 Spark 能够快速运行,并提供数据沿袭。The RDD is the underlying infrastructure that allows Spark to run so fast and provide data lineage.
  • 如果深入了解 Spark 的更高级组件,可能需要使用 Rdd。If you are diving into more advanced components of Spark, it may be necessary to use RDDs.
  • SPARK UI引用 rdd 中的可视化对象。The visualizations within the Spark UI reference RDDs.

开发 Spark 应用程序时,通常使用 DataFrames 教程数据集教程When you develop Spark applications, you typically use DataFrames tutorial and Datasets tutorial.

编写第一个 Apache Spark 应用程序 Write your first Apache Spark application

若要编写第一个 Apache Spark 应用程序,请将代码添加到 Azure Databricks 笔记本的单元格。To write your first Apache Spark application, you add code to the cells of an Azure Databricks notebook. 此示例使用 Python。This example uses Python. 有关详细信息,还可以引用 Apache Spark 快速入门指南For more information, you can also reference the Apache Spark Quick Start Guide.

此第一个命令列出 Databricks 文件系统中文件夹的内容:This first command lists the contents of a folder in the Databricks File System:

# Take a look at the file system
display(dbutils.fs.ls("/databricks-datasets/samples/docs/"))

文件夹内容Folder contents

下一个命令使用 sparkSparkSession 每个笔记本中的可用,读取 README.md 文本文件并创建名为的数据帧 textFileThe next command uses spark, the SparkSession available in every notebook, to read the README.md text file and create a DataFrame named textFile:

textFile = spark.read.text("/databricks-datasets/samples/docs/README.md")

若要统计文本文件的行数,请将操作应用于 count 数据帧:To count the lines of the text file, apply the count action to the DataFrame:

textFile.count()

文本文件行计数Text file line count

你可能会注意到,第二个命令(读取文本文件)不会生成任何输出,而第三个命令执行,则执行 countOne thing you may notice is that the second command, reading the text file, does not generate any output while the third command, performing the count, does. 出现这种情况的原因是第一个命令是 转换 ,而第二个命令是 操作The reason for this is that the first command is a transformation while the second one is an action. 转换是 延迟 的,仅在运行操作时运行。Transformations are lazy and run only when an action is run. 这允许 Spark 优化性能 (例如,在联接) 之前运行筛选器,而不是按顺序运行命令。This allows Spark to optimize for performance (for example, run a filter prior to a join), instead of running commands serially. 有关转换和操作的完整列表,请参阅 Apache Spark 编程指南: 转换操作For a complete list of transformations and actions, refer to the Apache Spark Programming Guide: Transformations and Actions.

Azure Databricks 数据集 Azure Databricks datasets

Azure Databricks 包括工作区中的各种数据集,你可以使用这些 数据集 来了解 Spark 或测试输出算法。Azure Databricks includes a variety of datasets within the workspace that you can use to learn Spark or test out algorithms. 在入门指南中可以看到这些。You’ll see these throughout the getting started guide. 这些数据集在文件夹中可用 /databricks-datasetsThe datasets are available in the /databricks-datasets folder.

笔记本 Notebooks

若要访问这些代码示例,请导入以下笔记本之一。To access these code examples and more, import the one of the following notebooks.

Apache Spark 快速入门 Python 笔记本Apache Spark Quick Start Python notebook

获取笔记本Get notebook

Apache Spark 快速入门 Scala 笔记本Apache Spark Quick Start Scala notebook

获取笔记本Get notebook