在 Visual Studio Code 中的 SQL Server 大数据群集上提交 Spark 作业Submit Spark jobs on SQL Server big data cluster in Visual Studio Code

了解如何使用适用于 Visual Studio Code 的 Spark & Hive Tools 来创建和提交 Apache Spark 的 PySpark 脚本,首先我们将介绍如何在 Visual Studio Code 中安装 Spark & Hive Tools,然后演示如何将作业提交到 Spark。Learn how to use Spark & Hive Tools for Visual Studio Code to create and submit PySpark scripts for Apache Spark, first we'll describe how to install the Spark & Hive tools in Visual Studio Code and then we'll walk through how to submit jobs to Spark.

Spark & Hive Tools 可以安装在 Visual Studio Code 支持的平台上,包括 Windows、Linux 和 macOS。Spark & Hive Tools can be installed on platforms that are supported by Visual Studio Code, which include Windows, Linux, and macOS. 下面介绍了不同平台的必备条件。Below you'll find the prerequisites for different platforms.

先决条件Prerequisites

完成本文中的步骤需要以下各项:The following items are required for completing the steps in this article:

安装 Spark & Hive ToolsInstall Spark & Hive Tools

完成必备条件后,可以安装适用于 Visual Studio Code 的 Spark & Hive Tools。After you have completed the prerequisites, you can install Spark & Hive Tools for Visual Studio Code. 完成以下步骤以安装 Spark & Hive Tools:Complete the following steps to install Spark & Hive Tools:

  1. 打开 Visual Studio Code。Open Visual Studio Code.

  2. 从菜单栏中,导航到“查看” > “扩展” 。From the menu bar, navigate to View > Extensions.

  3. 在搜索框中,输入“Spark & Hive” 。In the search box, enter Spark & Hive.

  4. 从搜索结果中选择“Spark & Hive Tools”,然后选择“安装” 。Select Spark & Hive Tools from the search results, and then select Install.

    安装扩展

  5. 需要时重新加载。Reload when needed.

打开工作文件夹Open work folder

完成以下步骤以打开工作文件夹,并在 Visual Studio Code 中创建一个文件:Complete the following steps to open a work folder, and create a file in Visual Studio Code:

  1. 从菜单栏中,导航到“文件” > “打开文件夹...” > “C:\SQLBDC\SQLBDCexample”,然后选择“选择文件夹”按钮 。From the menu bar, navigate to File > Open Folder... > C:\SQLBDC\SQLBDCexample, then select the Select Folder button. 该文件夹显示在左侧的“资源管理器”视图中 。The folder appears in the Explorer view on the left.

  2. 在“资源管理器”视图中,选择文件夹“SQLBDCexample”,然后选择工作文件夹旁边的“新建文件”图标 。From the Explorer view, select the folder, SQLBDCexample, and then the New File icon next to the work folder.

    新建文件

  3. 使用 .py(Spark 脚本)文件扩展名命名新文件。Name the new file with the .py (Spark script) file extension. 此示例使用 HelloWorld.py 。This example uses HelloWorld.py.

  4. 将以下代码复制并粘贴到脚本文件中:Copy and paste the following code into the script file:

     import sys
     from operator import add
     from pyspark.sql import SparkSession, Row
    
     spark = SparkSession\
         .builder\
         .appName("PythonWordCount")\
         .getOrCreate()
    
     data = [Row(col1='pyspark and spark', col2=1), Row(col1='pyspark', col2=2), Row(col1='spark vs hadoop', col2=2), Row(col1='spark', col2=2), Row(col1='hadoop', col2=2)]
     df = spark.createDataFrame(data)
     lines = df.rdd.map(lambda r: r[0])
    
     counters = lines.flatMap(lambda x: x.split(' ')) \
         .map(lambda x: (x, 1)) \
         .reduceByKey(add)
    
     output = counters.collect()
     sortedCollection = sorted(output, key = lambda r: r[1], reverse = True)
    
     for (word, count) in sortedCollection:
         print("%s: %i" % (word, count))
    

在从 Visual Studio Code 将脚本提交到群集之前,需要链接 SQL Server 大数据群集。Before you can submit scripts to your clusters from Visual Studio Code, you need to link a SQL Server big data cluster.

  1. 从菜单栏导航到“查看” > “命令面板…”,然后输入“Spark / Hive: Link a Cluster”。From the menu bar navigate to View > Command Palette..., and enter Spark / Hive: Link a Cluster.

    链接群集命令

  2. 选择链接群集类型“SQL Server 大数据” 。Select linked cluster type SQL Server Big Data.

  3. 输入 SQL Server 大数据终结点。Enter SQL Server Big Data endpoint.

  4. 输入 SQL Server 大数据群集用户名。Enter SQL Server Big Data Cluster user name.

  5. 输入用户管理员的密码。Enter password for user admin.

  6. 设置群集的显示名称(可选)。Set the display name of the cluster (Optional).

  7. 列出群集,查看“输出”视图以进行验证 。List clusters, review OUTPUT view for verification.

列出群集List clusters

  1. 从菜单栏导航到“查看” > “命令面板…”,然后输入“Spark / Hive: List Cluster”。From the menu bar navigate to View > Command Palette..., and enter Spark / Hive: List Cluster.

  2. 检查“输出”视图 。Review the OUTPUT view. 该视图将显示链接群集。The view will show your linked cluster(s).

    设置默认群集配置

设置默认群集Set default cluster

  1. 如果已关闭,请重新打开之前创建的文件夹“SQLBDCexample” 。Re-Open the folder SQLBDCexample created earlier if closed.

  2. 选择之前创建的文件“HelloWorld.py”,它将在脚本编辑器中打开 。Select the file HelloWorld.py created earlier and it will open in the script editor.

  3. 如果尚未链接群集,请将其链接。Link a cluster if you haven't yet done so.

  4. 右键单击脚本编辑器,然后选择“Spark / Hive: Set Default Cluster”。Right-click the script editor, and select Spark / Hive: Set Default Cluster.

  5. 选择一个群集作为当前脚本文件的默认群集。Select a cluster as the default cluster for the current script file. 这些工具会自动更新配置文件“.VSCode \ settings.json” 。The tools automatically update the configuration file .VSCode\settings.json.

    设置默认群集配置

提交交互式 PySpark 查询Submit interactive PySpark queries

可以按照以下步骤提交交互式 PySpark 查询:You can submit interactive PySpark queries by following the steps below:

  1. 如果已关闭,请重新打开之前创建的文件夹“SQLBDCexample” 。Reopen the folder SQLBDCexample created earlier if closed.

  2. 选择之前创建的文件“HelloWorld.py”,它将在脚本编辑器中打开 。Select the file HelloWorld.py created earlier and it will open in the script editor.

  3. 如果尚未链接群集,请将其链接。Link a cluster if you haven't yet done so.

  4. 选择所有代码并右键单击脚本编辑器,选择“Spark: PySpark Interactive”以提交查询,或使用快捷键 Ctrl+Alt+I 。Choose all the code and right-click the script editor, select Spark: PySpark Interactive to submit the query, or use shortcut Ctrl + Alt + I.

    pyspark 交互式上下文菜单

  5. 如果尚未指定默认群集,请选择群集。Select the cluster if you haven't specified a default cluster. 几分钟后,“Python Interactive”结果将显示在新选项卡中 。利用这些工具可以使用上下文菜单提交代码块而不是整个脚本文件。After a few moments, the Python Interactive results appear in a new tab. The tools also allow you to submit a block of code instead of the whole script file using the context menu.

    pyspark 交互式 python 交互式窗口

  6. 输入“%%info”,然后按 Shift+Enter 查看作业信息 。Enter "%%info", and then press Shift + Enter to view job information. (可选)(Optional)

    查看作业信息

    备注

    如果在设置中取消选中“Python 扩展已启用”(选中默认设置),则提交的 pyspark 交互结果将使用旧窗口 。When Python Extension Enabled is unchecked in the settings (The default setting is checked), the submitted pyspark interaction results will use the old window.

    pyspark 交互式 python 扩展已禁用

提交 PySpark 批处理作业Submit PySpark batch job

  1. 如果已关闭,请重新打开之前创建的文件夹“SQLBDCexample” 。Reopen the folder SQLBDCexample created earlier if closed.

  2. 选择之前创建的文件“HelloWorld.py”,它将在脚本编辑器中打开 。Select the file HelloWorld.py created earlier and it will open in the script editor.

  3. 如果尚未链接群集,请将其链接。Link a cluster if you haven't yet done so.

  4. 右键单击脚本编辑器,然后选择“Spark: PySpark Batch”,或使用快捷方式 Ctrl+Alt+H 。Right-click the script editor, and then select Spark: PySpark Batch, or use shortcut Ctrl + Alt + H.

  5. 如果尚未指定默认群集,请选择群集。Select the cluster if you haven't specified a default cluster. 提交 Python 作业后,提交日志将显示在 Visual Studio Code 的“输出”窗口中 。After you submit a Python job, submission logs appear in the OUTPUT window in Visual Studio Code. 还会显示“Spark UI URL”和“Yarn UI URL” 。The Spark UI URL and Yarn UI URL are shown as well. 你可以在 Web 浏览器中打开 URL 以跟踪作业状态。You can open the URL in a web browser to track the job status.

    提交 Python 作业结果

Apache Livy 配置Apache Livy configuration

支持 Apache Livy 配置,在工作空间文件夹中的 .VSCode\settings.json 中可以设置该配置 。Apache Livy configuration is supported, it can be set at the .VSCode\settings.json in the work space folder. 目前,Livy 配置仅支持 Python 脚本。Currently, Livy configuration only supports Python script. 更多详细信息,请参阅 Livy 自述文件More details, see Livy README.

如何触发 Livy 配置 How to trigger Livy configuration

方法 1Method 1

  1. 从菜单栏中,导航到“文件” > “首选项” > “设置” 。From the menu bar, navigate to File > Preferences > Settings.
  2. 在“搜索设置”文本框中输入“HDInsight Job Sumission: Livy Conf”。In the Search settings text box enter HDInsight Job Sumission: Livy Conf.
  3. 选择“在 settings.json 中编辑”以获取相关搜索结果 。Select Edit in settings.json for the relevant search result.

方法 2Method 2

提交文件,注意 .vscode 文件夹会自动添加到工作文件夹中。Submit a file, notice the .vscode folder is added automatically to the work folder. 可以通过单击 .vscode\settings.json 找到 Livy 配置。You can find the Livy configuration by clicking .vscode\settings.json.

  • 项目设置:The project settings:

    Livy 配置

备注

对于设置“driverMomory”和“executorMomry”,请使用单位设置值,例如 1g 或 1024m 。For settings driverMomory and executorMomry, set the value with unit, for example 1g or 1024m.

支持的 Livy 配置Supported Livy configurations

POST /批处理POST /batches

请求正文Request body

namename descriptiondescription typetype
文件file 包含要执行的应用程序的文件File containing the application to execute 路径(必需)path (required)
proxyUserproxyUser 运行作业时要模拟的用户User to impersonate when running the job 字符串string
classNameclassName 应用程序 Java/Spark 主类Application Java/Spark main class 字符串string
argsargs 应用程序的命令行参数Command line arguments for the application 字符串列表list of strings
jarsjars 将在本次会话中使用的 jarsjars to be used in this session 字符串列表List of string
pyFilespyFiles 将在本次会话中使用的 Python 文件Python files to be used in this session 字符串列表List of string
filesfiles 将在本次会话中使用的文件files to be used in this session 字符串列表List of string
driverMemorydriverMemory 用于驱动程序进程的内存量Amount of memory to use for the driver process 字符串string
driverCoresdriverCores 用于驱动程序进程的内核数Number of cores to use for the driver process intint
executorMemoryexecutorMemory 每个执行程序进程使用的内存量Amount of memory to use per executor process 字符串string
executorCoresexecutorCores 每个执行程序使用的内核数Number of cores to use for each executor intint
numExecutorsnumExecutors 为此会话启动的执行程序数Number of executors to launch for this session intint
archivesarchives 将在本次会话中使用的存档Archives to be used in this session 字符串列表List of string
queuequeue 提交到的 YARN 队列的名称The name of the YARN queue to which submitted 字符串string
namename 会话的名称The name of this session 字符串string
confconf Spark 配置属性Spark configuration properties key=val 的映射Map of key=val

响应正文Response Body

创建的批处理对象。The created batch object.

namename descriptiondescription typetype
idid 会话 IDThe session id intint
appIdappId 此会话的应用程序 IDThe application id of this session StringString
appInfoappInfo 详细的应用程序信息The detailed application info key=val 的映射Map of key=val
loglog 日志行The log lines 字符串列表list of strings
statestate 批处理状态The batch state 字符串string

备注

提交脚本时,分配的 Livy 配置将显示在输出窗格中。The assigned Livy config will display in output pane when submit script.

其他功能Additional features

适用于 Visual Studio Code 的 Spark & Hive 支持以下功能:Spark & Hive for Visual Studio Code supports the following features:

  • IntelliSense 自动完成IntelliSense autocomplete. 弹出关键字、方法、变量等的建议。Suggestions pop up for keyword, methods, variables, and so on. 不同的图标代表不同类型的对象。Different icons represent different types of objects.

    适用于 Visual Studio Code 的 Spark & Hive Tools IntelliSense 对象类型

  • IntelliSense 错误标记IntelliSense error marker. 语言服务强调了 Hive 脚本的编辑错误。The language service underlines the editing errors for the Hive script.

  • 语法突出显示Syntax highlights. 语言服务使用不同的颜色来区分变量、关键字、数据类型、函数等。The language service uses different colors to differentiate variables, keywords, data type, functions, and so on.

    适用于 Visual Studio Code 的 Spark & Hive Tools 语法突出显示

  1. 从菜单栏导航到“查看” > “命令面板…”,然后输入“Spark / Hive: Unlink a Cluster”。From the menu bar navigate to View > Command Palette..., and then enter Spark / Hive: Unlink a Cluster.

  2. 选择要取消链接的群集。Select cluster to unlink.

  3. 查看“输出”视图以进行验证 。Review OUTPUT view for verification.

后续步骤Next steps

有关 SQL Server 大数据群集和相关方案的详细信息,请参阅 SQL Server 大数据群集SQL Server Big Data ClustersFor more information on SQL Server big data cluster and related scenarios, See SQL Server 大数据群集SQL Server Big Data Clusters.