在 Visual Studio Code 中的 SQL Server 大数据群集上提交 Spark 作业Submit Spark jobs on SQL Server big data cluster in Visual Studio Code
了解如何使用适用于 Visual Studio Code 的 Spark & Hive Tools 来创建和提交 Apache Spark 的 PySpark 脚本,首先我们将介绍如何在 Visual Studio Code 中安装 Spark & Hive Tools,然后演示如何将作业提交到 Spark。Learn how to use Spark & Hive Tools for Visual Studio Code to create and submit PySpark scripts for Apache Spark, first we'll describe how to install the Spark & Hive tools in Visual Studio Code and then we'll walk through how to submit jobs to Spark.
Spark & Hive Tools 可以安装在 Visual Studio Code 支持的平台上,包括 Windows、Linux 和 macOS。Spark & Hive Tools can be installed on platforms that are supported by Visual Studio Code, which include Windows, Linux, and macOS. 下面介绍了不同平台的必备条件。Below you'll find the prerequisites for different platforms.
先决条件Prerequisites
完成本文中的步骤需要以下各项:The following items are required for completing the steps in this article:
- SQL Server 大数据群集。A SQL Server big data cluster. 请参阅 SQL Server 大数据群集SQL Server Big Data Clusters。See SQL Server 大数据群集SQL Server Big Data Clusters.
- Visual Studio Code。Visual Studio Code.
- Mono。Mono. Mono 仅适用于 Linux 和 macOS。Mono is only required for Linux and macOS.
- 为 Visual Studio Code 设置 PySpark 交互式环境。Set up PySpark interactive environment for Visual Studio Code.
- 名为“SQLBDCexample”的本地目录 。A local directory named SQLBDCexample. 本文使用“C:\SQLBDC\SQLBDCexample” 。This article uses C:\SQLBDC\SQLBDCexample.
安装 Spark & Hive ToolsInstall Spark & Hive Tools
完成必备条件后,可以安装适用于 Visual Studio Code 的 Spark & Hive Tools。After you have completed the prerequisites, you can install Spark & Hive Tools for Visual Studio Code. 完成以下步骤以安装 Spark & Hive Tools:Complete the following steps to install Spark & Hive Tools:
打开 Visual Studio Code。Open Visual Studio Code.
从菜单栏中,导航到“查看” > “扩展” 。From the menu bar, navigate to View > Extensions.
在搜索框中,输入“Spark & Hive” 。In the search box, enter Spark & Hive.
从搜索结果中选择“Spark & Hive Tools”,然后选择“安装” 。Select Spark & Hive Tools from the search results, and then select Install.
需要时重新加载。Reload when needed.
打开工作文件夹Open work folder
完成以下步骤以打开工作文件夹,并在 Visual Studio Code 中创建一个文件:Complete the following steps to open a work folder, and create a file in Visual Studio Code:
从菜单栏中,导航到“文件” > “打开文件夹...” > “C:\SQLBDC\SQLBDCexample”,然后选择“选择文件夹”按钮 。From the menu bar, navigate to File > Open Folder... > C:\SQLBDC\SQLBDCexample, then select the Select Folder button. 该文件夹显示在左侧的“资源管理器”视图中 。The folder appears in the Explorer view on the left.
在“资源管理器”视图中,选择文件夹“SQLBDCexample”,然后选择工作文件夹旁边的“新建文件”图标 。From the Explorer view, select the folder, SQLBDCexample, and then the New File icon next to the work folder.
使用
.py
(Spark 脚本)文件扩展名命名新文件。Name the new file with the.py
(Spark script) file extension. 此示例使用 HelloWorld.py 。This example uses HelloWorld.py.将以下代码复制并粘贴到脚本文件中:Copy and paste the following code into the script file:
import sys from operator import add from pyspark.sql import SparkSession, Row spark = SparkSession\ .builder\ .appName("PythonWordCount")\ .getOrCreate() data = [Row(col1='pyspark and spark', col2=1), Row(col1='pyspark', col2=2), Row(col1='spark vs hadoop', col2=2), Row(col1='spark', col2=2), Row(col1='hadoop', col2=2)] df = spark.createDataFrame(data) lines = df.rdd.map(lambda r: r[0]) counters = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(add) output = counters.collect() sortedCollection = sorted(output, key = lambda r: r[1], reverse = True) for (word, count) in sortedCollection: print("%s: %i" % (word, count))
链接 SQL Server 大数据群集Link a SQL Server big data cluster
在从 Visual Studio Code 将脚本提交到群集之前,需要链接 SQL Server 大数据群集。Before you can submit scripts to your clusters from Visual Studio Code, you need to link a SQL Server big data cluster.
从菜单栏导航到“查看” > “命令面板…”,然后输入“Spark / Hive: Link a Cluster”。From the menu bar navigate to View > Command Palette..., and enter Spark / Hive: Link a Cluster.
选择链接群集类型“SQL Server 大数据” 。Select linked cluster type SQL Server Big Data.
输入 SQL Server 大数据终结点。Enter SQL Server Big Data endpoint.
输入 SQL Server 大数据群集用户名。Enter SQL Server Big Data Cluster user name.
输入用户管理员的密码。Enter password for user admin.
设置群集的显示名称(可选)。Set the display name of the cluster (Optional).
列出群集,查看“输出”视图以进行验证 。List clusters, review OUTPUT view for verification.
列出群集List clusters
从菜单栏导航到“查看” > “命令面板…”,然后输入“Spark / Hive: List Cluster”。From the menu bar navigate to View > Command Palette..., and enter Spark / Hive: List Cluster.
检查“输出”视图 。Review the OUTPUT view. 该视图将显示链接群集。The view will show your linked cluster(s).
设置默认群集Set default cluster
如果已关闭,请重新打开之前创建的文件夹“SQLBDCexample” 。Re-Open the folder SQLBDCexample created earlier if closed.
选择之前创建的文件“HelloWorld.py”,它将在脚本编辑器中打开 。Select the file HelloWorld.py created earlier and it will open in the script editor.
如果尚未链接群集,请将其链接。Link a cluster if you haven't yet done so.
右键单击脚本编辑器,然后选择“Spark / Hive: Set Default Cluster”。Right-click the script editor, and select Spark / Hive: Set Default Cluster.
选择一个群集作为当前脚本文件的默认群集。Select a cluster as the default cluster for the current script file. 这些工具会自动更新配置文件“.VSCode \ settings.json” 。The tools automatically update the configuration file .VSCode\settings.json.
提交交互式 PySpark 查询Submit interactive PySpark queries
可以按照以下步骤提交交互式 PySpark 查询:You can submit interactive PySpark queries by following the steps below:
如果已关闭,请重新打开之前创建的文件夹“SQLBDCexample” 。Reopen the folder SQLBDCexample created earlier if closed.
选择之前创建的文件“HelloWorld.py”,它将在脚本编辑器中打开 。Select the file HelloWorld.py created earlier and it will open in the script editor.
如果尚未链接群集,请将其链接。Link a cluster if you haven't yet done so.
选择所有代码并右键单击脚本编辑器,选择“Spark: PySpark Interactive”以提交查询,或使用快捷键 Ctrl+Alt+I 。Choose all the code and right-click the script editor, select Spark: PySpark Interactive to submit the query, or use shortcut Ctrl + Alt + I.
如果尚未指定默认群集,请选择群集。Select the cluster if you haven't specified a default cluster. 几分钟后,“Python Interactive”结果将显示在新选项卡中 。利用这些工具可以使用上下文菜单提交代码块而不是整个脚本文件。After a few moments, the Python Interactive results appear in a new tab. The tools also allow you to submit a block of code instead of the whole script file using the context menu.
输入“%%info”,然后按 Shift+Enter 查看作业信息 。Enter "%%info", and then press Shift + Enter to view job information. (可选)(Optional)
备注
如果在设置中取消选中“Python 扩展已启用”(选中默认设置),则提交的 pyspark 交互结果将使用旧窗口 。When Python Extension Enabled is unchecked in the settings (The default setting is checked), the submitted pyspark interaction results will use the old window.
提交 PySpark 批处理作业Submit PySpark batch job
如果已关闭,请重新打开之前创建的文件夹“SQLBDCexample” 。Reopen the folder SQLBDCexample created earlier if closed.
选择之前创建的文件“HelloWorld.py”,它将在脚本编辑器中打开 。Select the file HelloWorld.py created earlier and it will open in the script editor.
如果尚未链接群集,请将其链接。Link a cluster if you haven't yet done so.
右键单击脚本编辑器,然后选择“Spark: PySpark Batch”,或使用快捷方式 Ctrl+Alt+H 。Right-click the script editor, and then select Spark: PySpark Batch, or use shortcut Ctrl + Alt + H.
如果尚未指定默认群集,请选择群集。Select the cluster if you haven't specified a default cluster. 提交 Python 作业后,提交日志将显示在 Visual Studio Code 的“输出”窗口中 。After you submit a Python job, submission logs appear in the OUTPUT window in Visual Studio Code. 还会显示“Spark UI URL”和“Yarn UI URL” 。The Spark UI URL and Yarn UI URL are shown as well. 你可以在 Web 浏览器中打开 URL 以跟踪作业状态。You can open the URL in a web browser to track the job status.
Apache Livy 配置Apache Livy configuration
支持 Apache Livy 配置,在工作空间文件夹中的 .VSCode\settings.json 中可以设置该配置 。Apache Livy configuration is supported, it can be set at the .VSCode\settings.json in the work space folder. 目前,Livy 配置仅支持 Python 脚本。Currently, Livy configuration only supports Python script. 更多详细信息,请参阅 Livy 自述文件。More details, see Livy README.
如何触发 Livy 配置 How to trigger Livy configuration
方法 1Method 1
- 从菜单栏中,导航到“文件” > “首选项” > “设置” 。From the menu bar, navigate to File > Preferences > Settings.
- 在“搜索设置”文本框中输入“HDInsight Job Sumission: Livy Conf”。In the Search settings text box enter HDInsight Job Sumission: Livy Conf.
- 选择“在 settings.json 中编辑”以获取相关搜索结果 。Select Edit in settings.json for the relevant search result.
方法 2Method 2
提交文件,注意 .vscode
文件夹会自动添加到工作文件夹中。Submit a file, notice the .vscode
folder is added automatically to the work folder. 可以通过单击 .vscode\settings.json
找到 Livy 配置。You can find the Livy configuration by clicking .vscode\settings.json
.
项目设置:The project settings:
备注
对于设置“driverMomory”和“executorMomry”,请使用单位设置值,例如 1g 或 1024m 。For settings driverMomory and executorMomry, set the value with unit, for example 1g or 1024m.
支持的 Livy 配置Supported Livy configurations
POST /批处理POST /batches
请求正文Request body
namename | descriptiondescription | typetype |
---|---|---|
文件file | 包含要执行的应用程序的文件File containing the application to execute | 路径(必需)path (required) |
proxyUserproxyUser | 运行作业时要模拟的用户User to impersonate when running the job | 字符串string |
classNameclassName | 应用程序 Java/Spark 主类Application Java/Spark main class | 字符串string |
argsargs | 应用程序的命令行参数Command line arguments for the application | 字符串列表list of strings |
jarsjars | 将在本次会话中使用的 jarsjars to be used in this session | 字符串列表List of string |
pyFilespyFiles | 将在本次会话中使用的 Python 文件Python files to be used in this session | 字符串列表List of string |
filesfiles | 将在本次会话中使用的文件files to be used in this session | 字符串列表List of string |
driverMemorydriverMemory | 用于驱动程序进程的内存量Amount of memory to use for the driver process | 字符串string |
driverCoresdriverCores | 用于驱动程序进程的内核数Number of cores to use for the driver process | intint |
executorMemoryexecutorMemory | 每个执行程序进程使用的内存量Amount of memory to use per executor process | 字符串string |
executorCoresexecutorCores | 每个执行程序使用的内核数Number of cores to use for each executor | intint |
numExecutorsnumExecutors | 为此会话启动的执行程序数Number of executors to launch for this session | intint |
archivesarchives | 将在本次会话中使用的存档Archives to be used in this session | 字符串列表List of string |
queuequeue | 提交到的 YARN 队列的名称The name of the YARN queue to which submitted | 字符串string |
namename | 会话的名称The name of this session | 字符串string |
confconf | Spark 配置属性Spark configuration properties | key=val 的映射Map of key=val |
响应正文Response Body
创建的批处理对象。The created batch object.
namename | descriptiondescription | typetype |
---|---|---|
idid | 会话 IDThe session id | intint |
appIdappId | 此会话的应用程序 IDThe application id of this session | StringString |
appInfoappInfo | 详细的应用程序信息The detailed application info | key=val 的映射Map of key=val |
loglog | 日志行The log lines | 字符串列表list of strings |
statestate | 批处理状态The batch state | 字符串string |
备注
提交脚本时,分配的 Livy 配置将显示在输出窗格中。The assigned Livy config will display in output pane when submit script.
其他功能Additional features
适用于 Visual Studio Code 的 Spark & Hive 支持以下功能:Spark & Hive for Visual Studio Code supports the following features:
IntelliSense 自动完成。IntelliSense autocomplete. 弹出关键字、方法、变量等的建议。Suggestions pop up for keyword, methods, variables, and so on. 不同的图标代表不同类型的对象。Different icons represent different types of objects.
IntelliSense 错误标记。IntelliSense error marker. 语言服务强调了 Hive 脚本的编辑错误。The language service underlines the editing errors for the Hive script.
语法突出显示。Syntax highlights. 语言服务使用不同的颜色来区分变量、关键字、数据类型、函数等。The language service uses different colors to differentiate variables, keywords, data type, functions, and so on.
取消链接群集Unlink cluster
从菜单栏导航到“查看” > “命令面板…”,然后输入“Spark / Hive: Unlink a Cluster”。From the menu bar navigate to View > Command Palette..., and then enter Spark / Hive: Unlink a Cluster.
选择要取消链接的群集。Select cluster to unlink.
查看“输出”视图以进行验证 。Review OUTPUT view for verification.
后续步骤Next steps
有关 SQL Server 大数据群集和相关方案的详细信息,请参阅 SQL Server 大数据群集SQL Server Big Data Clusters。For more information on SQL Server big data cluster and related scenarios, See SQL Server 大数据群集SQL Server Big Data Clusters.