您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

快速入门:使用 Web 工具在 Azure Synapse Analytics 中创建无服务器 Apache Spark 池Quickstart: Create a serverless Apache Spark pool in Azure Synapse Analytics using web tools

本快速入门介绍如何使用 Web 工具在 Azure Synapse 中创建无服务器 Apache Spark 池。In this quickstart, you learn how to create a serverless Apache Spark pool in Azure Synapse using web tools. 然后,介绍如何连接到 Apache Spark 池并针对文件和表运行 Spark SQL 查询。You then learn to connect to the Apache Spark pool and run Spark SQL queries against files and tables. 通过 Apache Spark 可以使用内存处理进行快速数据分析和群集计算。Apache Spark enables fast data analytics and cluster computing using in-memory processing. 有关 Azure Synapse 中 Spark 的信息,请参阅概述:Azure Synapse 上的 Apache SparkFor information on Spark in Azure Synapse, see Overview: Apache Spark on Azure Synapse.

重要

不管是否正在使用 Spark 实例,它们都会按分钟按比例计费。Billing for Spark instances is prorated per minute, whether you are using them or not. 请务必在用完 Spark 实例后将其关闭,或设置较短的超时。Be sure to shutdown your Spark instance after you have finished using it, or set a short timeout. 有关详细信息,请参阅本文的 清理资源 部分。For more information, see the Clean up resources section of this article.

如果没有 Azure 订阅,请在开始之前创建一个免费帐户If you don't have an Azure subscription, create a free account before you begin.

先决条件Prerequisites

登录到 Azure 门户Sign in to the Azure portal

登录 Azure 门户Sign in to the Azure portal.

如果还没有 Azure 订阅,可以在开始前创建一个免费帐户If you don't have an Azure subscription, create a free account before you begin.

创建笔记本Create a notebook

笔记本是支持各种编程语言的交互式环境。A notebook is an interactive environment that supports various programming languages. 使用笔记本可与数据交互,将代码和 Markdown、文本相结合,以及执行简单的可视化操作。The notebook allows you to interact with your data, combine code with markdown, text, and perform simple visualizations.

  1. 在要使用的 Azure Synapse 工作区的 Azure 门户视图中,选择“启动 Synapse Studio”。From the Azure portal view for the Azure Synapse workspace you want to use, select Launch Synapse Studio.

  2. 启动 Synapse Studio 后,选择“开发”。Once Synapse Studio has launched, select Develop. 然后,选择“+”图标以新增资源。Then, select the "+" icon to add a new resource.

  3. 然后选择“笔记本”。From there, select Notebook. 随即会创建并打开一个具有自动生成的名称的新笔记本。A new notebook is created and opened with an automatically generated name.

    新建笔记本New notebook

  4. 在“属性”窗口中提供笔记本的名称。In the Properties window, provide a name for the notebook.

  5. 在工具栏上单击“发布”。On the toolbar, click Publish.

  6. 如果工作区中只有一个 Apache Spark 池,则默认选择该池。If there is only one Apache Spark pool in your workspace, then it's selected by default. 如果未选择任何池,请使用下拉箭头选择合适的 Apache Spark 池。Use the drop-down to select the correct Apache Spark pool if none is selected.

  7. 单击“添加代码”。Click Add code. 默认语言为 PysparkThe default language is Pyspark. 你将混合使用 Pyspark 和 Spark SQL,因此默认选择是适当的。You are going to use a mix of Pyspark and Spark SQL, so the default choice is fine. 其他支持的语言是适用于 Spark 的 Scala 和 .NET。Other supported languages are Scala and .NET for Spark.

  8. 接下来,创建一个用于操作的简单 Spark 数据帧对象。Next you create a simple Spark DataFrame object to manipulate. 在本例中,你将在代码中创建该对象。In this case, you create it from code. 有三行和三列:There are three rows and three columns:

    new_rows = [('CA',22, 45000),("WA",35,65000) ,("WA",50,85000)]
    demo_df = spark.createDataFrame(new_rows, ['state', 'age', 'salary'])
    demo_df.show()
    
  9. 现在,使用以下方法之一运行代码单元:Now run the cell using one of the following methods:

    • Shift + EnterPress SHIFT + ENTER.

    • 选择单元左侧的蓝色播放图标。Select the blue play icon to the left of the cell.

    • 选择工具栏上的“全部运行”按钮。Select the Run all button on the toolbar.

      创建数据帧对象

  10. 如果 Apache Spark 池实例尚未运行,它会自动启动。If the Apache Spark pool instance isn't already running, it is automatically started. 在运行的单元下面,以及在笔记本底部的状态面板上,都可以看到 Apache Spark 池实例的状态。You can see the Apache Spark pool instance status below the cell you are running and also on the status panel at the bottom of the notebook. 启动池需要 2-5 分钟时间,具体取决于池的大小。Depending on the size of pool, starting should take 2-5 minutes. 代码运行完成后,单元下面会显示有关运行该代码花费了多长时间及其执行情况的信息。Once the code has finished running, information below the cell displays showing how long it took to run and its execution. 在输出单元中可以看到输出。In the output cell, you see the output.

    执行单元后的输出

  11. 现在,数据会存在于一个数据帧中,从该数据帧中可以通过多种不同的方式使用这些数据。The data now exists in a DataFrame from there you can use the data in many different ways. 在本快速入门的余下部分,需要以不同的格式使用这些数据。You are going to need it in different formats for the rest of this quickstart.

  12. 在另一个单元中输入并运行以下代码,以创建一个 Spark 表、一个 CSV 文件和一个 Parquet 文件,它们都包含数据的副本:Enter the code below in another cell and run it, this creates a Spark table, a CSV, and a Parquet file all with copies of the data:

     demo_df.createOrReplaceTempView('demo_df')
     demo_df.write.csv('demo_df', mode='overwrite')
     demo_df.write.parquet('abfss://<<TheNameOfAStorageAccountFileSystem>>@<<TheNameOfAStorageAccount>>.dfs.core.windows.net/demodata/demo_df', mode='overwrite')
    

    如果使用存储资源管理器,可以查看上述两种不同的文件编写方式的影响。If you use the storage explorer, it's possible to see the impact of the two different ways of writing a file used above. 如果未指定文件系统,则会使用默认文件系统,在本例中为 default>user>trusted-service-user>demo_dfWhen no file system is specified then the default is used, in this case default>user>trusted-service-user>demo_df. 数据将保存到指定的文件系统的位置。The data is saved to the location of the specified file system.

    请注意,在使用“csv”和“parquet”格式的情况下,写入操作创建了一个包含许多已分区文件的目录。Notice in both the "csv" and "parquet" formats, write operations a directory is created with many partitioned files.

    存储资源管理器的输出视图Storage explorer view of the output

    突出显示 default > demodata > demo_df 路径的屏幕截图。Screenshot that highlights default > demodata > demo_df path.

运行 Spark SQL 语句Run Spark SQL statements

结构化查询语言 (SQL) 是用于查询和定义数据的最常见且最广泛使用的语言。Structured Query Language (SQL) is the most common and widely used language for querying and defining data. Spark SQL 作为 Apache Spark 的扩展使用,可使用熟悉的 SQL 语法处理结构化数据。Spark SQL functions as an extension to Apache Spark for processing structured data, using the familiar SQL syntax.

  1. 将以下代码粘贴到空单元中,然后运行代码。Paste the following code in an empty cell, and then run the code. 该命令将列出池中的表。The command lists the tables on the pool.

    %%sql
    SHOW TABLES
    

    将 Notebook 与 Azure Synapse Apache Spark 池配合使用时,将获得预设 sqlContext,可以使用该预设通过 Spark SQL 运行查询。When you use a Notebook with your Azure Synapse Apache Spark pool, you get a preset sqlContext that you can use to run queries using Spark SQL. %%sql 告知笔记本要使用预设 sqlContext 来运行查询。%%sql tells the notebook to use the preset sqlContext to run the query. 默认情况下,该查询检索所有 Azure Synapse Apache Spark 池包含的系统表中的前 10 行。The query retrieves the top 10 rows from a system table that comes with all Azure Synapse Apache Spark pools by default.

  2. 运行另一个查询,请查看 demo_df 中的数据。Run another query to see the data in demo_df.

    %%sql
    SELECT * FROM demo_df
    

    该代码生成两个输出单元,其中一个包含数据结果,另一个显示作业视图。The code produces two output cells, one that contains data results the other, which shows the job view.

    默认情况下,结果视图会显示一个网格。By default the results view shows a grid. 但是,网格下面会提供一个视图切换器,用于在网格视图与图形视图之间进行切换。But, there is a view switcher underneath the grid that allows the view to switch between grid and graph views.

    Azure Synapse Spark 中的查询输出Query output in Azure Synapse Spark

  3. 在“视图”切换器中,选择“图表”。In the View switcher, select Chart.

  4. 选择最右侧的“视图选项”图标。Select the View options icon from the far right-hand side.

  5. 在“图表类型”字段中选择“条形图”。In the Chart type field, select "bar chart".

  6. 在“X 轴列”字段中选择“省/市/自治区”。In the X-axis column field, select "state".

  7. 在“Y 轴列”字段中选择“工资”。In the Y-axis column field, select "salary".

  8. 在“聚合”字段中,选择“平均”。In the Aggregation field, select to "AVG".

  9. 选择“应用”。Select Apply.

    Azure Synapse Spark 中的图表输出Chart output in Azure Synapse Spark

  10. 运行 SQL 时可以获得相同的体验,但不需要切换语言。It is possible to get the same experience of running SQL but without having to switch languages. 为此,可将上面的 SQL 单元替换为以下 PySpark 单元,其输出体验是相同的,因为使用了 display 命令:You can do this by replacing the SQL cell above with this PySpark cell, the output experience is the same because the display command is used:

    display(spark.sql('SELECT * FROM demo_df'))
    
  11. 对于前面执行的每个单元,可以选择转到“History Server”和“监视”。Each of the cells that previously executed had the option to go to History Server and Monitoring. 单击相应的链接会转到用户体验的不同组成部分。Clicking the links takes you to different parts of the User Experience.

备注

某些 Apache Spark 官方文档依赖于使用 Spark 控制台,但该控制台在 Synapse Spark 中不可用。Some of the Apache Spark official documentation relies on using the Spark console, which is not available on Synapse Spark. 请改用笔记本IntelliJ 体验。Use the notebook or IntelliJ experiences instead.

清理资源Clean up resources

Azure Synapse 在 Azure Data Lake Storage 中保存数据。Azure Synapse saves your data in Azure Data Lake Storage. 可以安全关闭未在使用的 Spark 实例。You can safely let a Spark instance shut down when it is not in use. 只要无服务器 Apache Spark 池正在运行,即使不使用它,也会产生费用。You are charged for a serverless Apache Spark pool as long as it is running, even when it is not in use.

由于池的费用是存储费用的许多倍,关闭未在使用的 Spark 实例可以节省费用。Since the charges for the pool are many times more than the charges for storage, it makes economic sense to let Spark instances shut down when they are not in use.

为了确保关闭 Spark 实例,请结束任何已连接的会话(笔记本)。To ensure the Spark instance is shut down, end any connected sessions(notebooks). 达到 Apache Spark 池中指定的空闲时间时,池将会关闭。The pool shuts down when the idle time specified in the Apache Spark pool is reached. 也可以在笔记本底部的状态栏中选择“结束会话”。You can also select end session from the status bar at the bottom of the notebook.

后续步骤Next steps

本快速入门介绍了如何创建无服务器 Apache Spark 池和运行基本的 Spark SQL 查询。In this quickstart, you learned how to create a serverless Apache Spark pool and run a basic Spark SQL query.