您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

教程:使用 Azure Databricks 提取、转换和加载数据Tutorial: Extract, transform, and load data by using Azure Databricks

本教程使用 Azure Databricks 执行 ETL(提取、转换和加载数据)操作。In this tutorial, you perform an ETL (extract, transform, and load data) operation by using Azure Databricks. 将数据从 Azure Data Lake Storage Gen2 提取到 Azure Databricks 中,在 Azure Databricks 中对数据运行转换操作,并将转换的数据加载到 Azure SQL 数据仓库中。You extract data from Azure Data Lake Storage Gen2 into Azure Databricks, run transformations on the data in Azure Databricks, and load the transformed data into Azure SQL Data Warehouse.

本教程中的步骤使用 Azure Databricks 的 SQL 数据仓库连接器将数据传输到 Azure Databricks。The steps in this tutorial use the SQL Data Warehouse connector for Azure Databricks to transfer data to Azure Databricks. 而此连接器又使用 Azure Blob 存储来临时存储在 Azure Databricks 群集和 Azure SQL 数据仓库之间传输的数据。This connector, in turn, uses Azure Blob Storage as temporary storage for the data being transferred between an Azure Databricks cluster and Azure SQL Data Warehouse.

下图演示了应用程序流:The following illustration shows the application flow:

Azure Databricks 与 Data Lake Store 和 SQL 数据仓库Azure Databricks with Data Lake Store and SQL Data Warehouse

本教程涵盖以下任务:This tutorial covers the following tasks:

  • 创建 Azure Databricks 服务。Create an Azure Databricks service.
  • 在 Azure Databricks 中创建 Spark 群集。Create a Spark cluster in Azure Databricks.
  • 在 Data Lake Storage Gen2 帐户中创建文件系统。Create a file system in the Data Lake Storage Gen2 account.
  • 上传示例数据到 Azure Data Lake Storage Gen2 帐户。Upload sample data to the Azure Data Lake Storage Gen2 account.
  • 创建服务主体。Create a service principal.
  • 从 Azure Data Lake Storage Gen2 帐户中提取数据。Extract data from the Azure Data Lake Storage Gen2 account.
  • 在 Azure Databricks 中转换数据。Transform data in Azure Databricks.
  • 将数据载入 Azure SQL 数据仓库。Load data into Azure SQL Data Warehouse.

如果还没有 Azure 订阅,可以在开始前创建一个免费帐户If you don’t have an Azure subscription, create a free account before you begin.

备注

不能使用 Azure 免费试用订阅完成本教程 。This tutorial cannot be carried out using Azure Free Trial Subscription. 如果你有免费帐户,请转到个人资料并将订阅更改为“即用即付” 。If you have a free account, go to your profile and change your subscription to pay-as-you-go. 有关详细信息,请参阅 Azure 免费帐户For more information, see Azure free account. 然后,移除支出限制,并为你所在区域的 vCPU 请求增加配额Then, remove the spending limit, and request a quota increase for vCPUs in your region. 创建 Azure Databricks 工作区时,可以选择“试用版(高级 - 14天免费 DBU)” 定价层,让工作区访问免费的高级 Azure Databricks DBU 14 天。When you create your Azure Databricks workspace, you can select the Trial (Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free Premium Azure Databricks DBUs for 14 days.

先决条件Prerequisites

在开始本教程之前,完成以下任务:Complete these tasks before you begin this tutorial:

收集所需信息Gather the information that you need

确保完成本教程的先决条件。Make sure that you complete the prerequisites of this tutorial.

在开始之前,应具有以下这些信息项:Before you begin, you should have these items of information:

✔️Azure SQL 数据仓库的数据库名称、数据库服务器名称、用户名和密码。The database name, database server name, user name, and password of your Azure SQL Data warehouse.

✔️blob 存储帐户的访问密钥。The access key of your blob storage account.

✔️Data Lake Storage Gen2 存储帐户的名称。The name of your Data Lake Storage Gen2 storage account.

✔️订阅的租户 ID。The tenant ID of your subscription.

✔️向 Azure Active Directory (Azure AD) 注册的应用的应用程序 ID。The application ID of the app that you registered with Azure Active Directory (Azure AD).

✔️向 Azure AD 注册的应用的身份验证密钥。The authentication key for the app that you registered with Azure AD.

创建 Azure Databricks 服务Create an Azure Databricks service

在本部分中,你将使用 Azure 门户创建 Azure Databricks 服务。In this section, you create an Azure Databricks service by using the Azure portal.

  1. 在 Azure 门户菜单中,选择“创建资源” 。From the Azure portal menu, select Create a resource.

    在 Azure 门户中创建资源

    然后,选择“分析” > “Azure Databricks” 。Then, select Analytics > Azure Databricks.

    在 Azure 门户上创建 Azure Databricks

  2. 在“Azure Databricks 服务” 下,提供以下值来创建 Databricks 服务:Under Azure Databricks Service, provide the following values to create a Databricks service:

    属性Property 说明Description
    工作区名称Workspace name 为 Databricks 工作区提供一个名称。Provide a name for your Databricks workspace.
    订阅Subscription 从下拉列表中选择自己的 Azure 订阅。From the drop-down, select your Azure subscription.
    资源组Resource group 指定是要创建新的资源组还是使用现有的资源组。Specify whether you want to create a new resource group or use an existing one. 资源组是用于保存 Azure 解决方案相关资源的容器。A resource group is a container that holds related resources for an Azure solution. 有关详细信息,请参阅 Azure 资源组概述For more information, see Azure Resource Group overview.
    位置Location 选择“美国西部 2” 。Select West US 2. 有关其他可用区域,请参阅各区域推出的 Azure 服务For other available regions, see Azure services available by region.
    定价层Pricing Tier 选择“标准” 。Select Standard.
  3. 创建帐户需要几分钟时间。The account creation takes a few minutes. 若要监视操作状态,请查看顶部的进度栏。To monitor the operation status, view the progress bar at the top.

  4. 选择“固定到仪表板” ,然后选择“创建” 。Select Pin to dashboard and then select Create.

在 Azure Databricks 中创建 Spark 群集Create a Spark cluster in Azure Databricks

  1. 在 Azure 门户中,转到所创建的 Databricks 服务,然后选择“启动工作区”。 In the Azure portal, go to the Databricks service that you created, and select Launch Workspace.

  2. 系统随后会将你重定向到 Azure Databricks 门户。You're redirected to the Azure Databricks portal. 在门户中选择“群集”。 From the portal, select Cluster.

    Azure 上的 DatabricksDatabricks on Azure

  3. 在“新建群集”页中,提供用于创建群集的值。 In the New cluster page, provide the values to create a cluster.

    在 Azure 上创建 Databricks Spark 群集Create Databricks Spark cluster on Azure

  4. 填写以下字段的值,对于其他字段接受默认值:Fill in values for the following fields, and accept the default values for the other fields:

    • 输入群集的名称。Enter a name for the cluster.

    • 请务必选中“在不活动超过 __ 分钟后终止” 复选框。Make sure you select the Terminate after __ minutes of inactivity check box. 如果未使用群集,则请提供一个持续时间(以分钟为单位),超过该时间后群集会被终止。If the cluster isn't being used, provide a duration (in minutes) to terminate the cluster.

    • 选择“创建群集”。 Select Create cluster. 群集运行后,可将笔记本附加到该群集,并运行 Spark 作业。After the cluster is running, you can attach notebooks to the cluster and run Spark jobs.

在 Azure Data Lake Storage Gen2 帐户中创建文件系统Create a file system in the Azure Data Lake Storage Gen2 account

在本部分中,你将在 Azure Databricks 工作区中创建一个 Notebook,然后运行代码片段来配置存储帐户In this section, you create a notebook in Azure Databricks workspace and then run code snippets to configure the storage account

  1. Azure 门户中,转到你创建的 Azure Databricks 服务,然后选择“启动工作区”。 In the Azure portal, go to the Azure Databricks service that you created, and select Launch Workspace.

  2. 在左侧选择“工作区” 。On the left, select Workspace. 工作区下拉列表中,选择创建 > 笔记本From the Workspace drop-down, select Create > Notebook.

    在 Databricks 中创建笔记本Create a notebook in Databricks

  3. 在“创建 Notebook”对话框中,输入 Notebook 的名称。 In the Create Notebook dialog box, enter a name for the notebook. 选择“Scala”作为语言,然后选择前面创建的 Spark 群集。 Select Scala as the language, and then select the Spark cluster that you created earlier.

    在 Databricks 中提供笔记本的详细信息Provide details for a notebook in Databricks

  4. 选择“创建” 。Select Create.

  5. 以下代码块设置 Spark 会话中访问的任何 ADLS Gen 2 帐户的默认服务主体凭据。The following code block sets default service principal credentials for any ADLS Gen 2 account accessed in the Spark session. 第二个代码块会将帐户名称追加到该设置,从而指定特定的 ADLS Gen 2 帐户的凭据。The second code block appends the account name to the setting to specify credentials for a specific ADLS Gen 2 account. 将任一代码块复制并粘贴到 Azure Databricks 笔记本的第一个单元格中。Copy and paste either code block into the first cell of your Azure Databricks notebook.

    会话配置Session configuration

    val appID = "<appID>"
    val password = "<password>"
    val tenantID = "<tenant-id>"
    
    spark.conf.set("fs.azure.account.auth.type", "OAuth")
    spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.conf.set("fs.azure.account.oauth2.client.id", "<appID>")
    spark.conf.set("fs.azure.account.oauth2.client.secret", "<password>")
    spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<tenant-id>/oauth2/token")
    spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
    

    帐户配置Account configuration

    val storageAccountName = "<storage-account-name>"
    val appID = "<app-id>"
    val password = "<password>"
    val fileSystemName = "<file-system-name>"
    val tenantID = "<tenant-id>"
    
    spark.conf.set("fs.azure.account.auth.type." + storageAccountName + ".dfs.core.windows.net", "OAuth")
    spark.conf.set("fs.azure.account.oauth.provider.type." + storageAccountName + ".dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.conf.set("fs.azure.account.oauth2.client.id." + storageAccountName + ".dfs.core.windows.net", "" + appID + "")
    spark.conf.set("fs.azure.account.oauth2.client.secret." + storageAccountName + ".dfs.core.windows.net", "" + password + "")
    spark.conf.set("fs.azure.account.oauth2.client.endpoint." + storageAccountName + ".dfs.core.windows.net", "https://login.microsoftonline.com/" + tenantID + "/oauth2/token")
    spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
    dbutils.fs.ls("abfss://" + fileSystemName  + "@" + storageAccountName + ".dfs.core.windows.net/")
    spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")
    
  6. 在此代码块中,请将 <app-id><password><tenant-id><storage-account-name> 占位符值替换为在完成本教程的先决条件时收集的值。In this code block, replace the <app-id>, <password>, <tenant-id>, and <storage-account-name> placeholder values in this code block with the values that you collected while completing the prerequisites of this tutorial. <file-system-name> 占位符值替换为你想要为文件系统指定的任何名称。Replace the <file-system-name> placeholder value with whatever name you want to give the file system.

    • <app-id><password> 来自在创建服务主体的过程中向 active directory 注册的应用。The <app-id>, and <password> are from the app that you registered with active directory as part of creating a service principal.

    • <tenant-id> 来自你的订阅。The <tenant-id> is from your subscription.

    • <storage-account-name> 是 Azure Data Lake Storage Gen2 存储帐户的名称。The <storage-account-name> is the name of your Azure Data Lake Storage Gen2 storage account.

  7. SHIFT + ENTER 键,运行此块中的代码。Press the SHIFT + ENTER keys to run the code in this block.

将示例数据引入 Azure Data Lake Storage Gen2 帐户Ingest sample data into the Azure Data Lake Storage Gen2 account

开始学习本部分之前,必须完成以下先决条件:Before you begin with this section, you must complete the following prerequisites:

将以下代码输入到 Notebook 单元格中:Enter the following code into a notebook cell:

%sh wget -P /tmp https://raw.githubusercontent.com/Azure/usql/master/Examples/Samples/Data/json/radiowebsite/small_radio_json.json

在单元格中,按 SHIFT + ENTER 来运行代码 。In the cell, press SHIFT + ENTER to run the code.

现在,请在此单元格下方的新单元格中输入以下代码,将括号中出现的值替换为此前使用的相同值:Now in a new cell below this one, enter the following code, and replace the values that appear in brackets with the same values you used earlier:

dbutils.fs.cp("file:///tmp/small_radio_json.json", "abfss://" + fileSystemName + "@" + storageAccountName + ".dfs.core.windows.net/")

在单元格中,按 SHIFT + ENTER 来运行代码 。In the cell, press SHIFT + ENTER to run the code.

从 Azure Data Lake Storage Gen2 帐户中提取数据Extract data from the Azure Data Lake Storage Gen2 account

  1. 现在可以将示例 json 文件加载为 Azure Databricks 中的数据帧。You can now load the sample json file as a data frame in Azure Databricks. 将以下代码粘贴到新单元格中。Paste the following code in a new cell. 将括号中显示的占位符替换为你的值。Replace the placeholders shown in brackets with your values.

    val df = spark.read.json("abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/small_radio_json.json")
    
  2. SHIFT + ENTER 键,运行此块中的代码。Press the SHIFT + ENTER keys to run the code in this block.

  3. 运行以下代码来查看数据帧的内容:Run the following code to see the contents of the data frame:

    df.show()
    

    会显示类似于以下代码片段的输出:You see an output similar to the following snippet:

    +---------------------+---------+---------+------+-------------+----------+---------+-------+--------------------+------+--------+-------------+---------+--------------------+------+-------------+------+
    |               artist|     auth|firstName|gender|itemInSession|  lastName|   length|  level|            location|method|    page| registration|sessionId|                song|status|           ts|userId|
    +---------------------+---------+---------+------+-------------+----------+---------+-------+--------------------+------+--------+-------------+---------+--------------------+------+-------------+------+
    | El Arrebato         |Logged In| Annalyse|     F|            2|Montgomery|234.57914| free  |  Killeen-Temple, TX|   PUT|NextSong|1384448062332|     1879|Quiero Quererte Q...|   200|1409318650332|   309|
    | Creedence Clearwa...|Logged In|   Dylann|     M|            9|    Thomas|340.87138| paid  |       Anchorage, AK|   PUT|NextSong|1400723739332|       10|        Born To Move|   200|1409318653332|    11|
    | Gorillaz            |Logged In|     Liam|     M|           11|     Watts|246.17751| paid  |New York-Newark-J...|   PUT|NextSong|1406279422332|     2047|                DARE|   200|1409318685332|   201|
    ...
    ...
    

    现在,你已将数据从 Azure Data Lake Storage Gen2 提取到 Azure Databricks 中。You have now extracted the data from Azure Data Lake Storage Gen2 into Azure Databricks.

在 Azure Databricks 中转换数据Transform data in Azure Databricks

原始示例数据 small_radio_json.json 文件捕获某个电台的听众,有多个不同的列。The raw sample data small_radio_json.json file captures the audience for a radio station and has a variety of columns. 在此部分,请对该数据进行转换,仅检索数据集中的特定列。In this section, you transform the data to only retrieve specific columns from the dataset.

  1. 首先,仅从已创建的 dataframe 检索 firstNamelastNamegenderlocationlevel 列。First, retrieve only the columns firstName, lastName, gender, location, and level from the dataframe that you created.

    val specificColumnsDf = df.select("firstname", "lastname", "gender", "location", "level")
    specificColumnsDf.show()
    

    接收的输出如以下代码片段所示:You receive output as shown in the following snippet:

    +---------+----------+------+--------------------+-----+
    |firstname|  lastname|gender|            location|level|
    +---------+----------+------+--------------------+-----+
    | Annalyse|Montgomery|     F|  Killeen-Temple, TX| free|
    |   Dylann|    Thomas|     M|       Anchorage, AK| paid|
    |     Liam|     Watts|     M|New York-Newark-J...| paid|
    |     Tess|  Townsend|     F|Nashville-Davidso...| free|
    |  Margaux|     Smith|     F|Atlanta-Sandy Spr...| free|
    |     Alan|     Morse|     M|Chicago-Napervill...| paid|
    |Gabriella|   Shelton|     F|San Jose-Sunnyval...| free|
    |   Elijah|  Williams|     M|Detroit-Warren-De...| paid|
    |  Margaux|     Smith|     F|Atlanta-Sandy Spr...| free|
    |     Tess|  Townsend|     F|Nashville-Davidso...| free|
    |     Alan|     Morse|     M|Chicago-Napervill...| paid|
    |     Liam|     Watts|     M|New York-Newark-J...| paid|
    |     Liam|     Watts|     M|New York-Newark-J...| paid|
    |   Dylann|    Thomas|     M|       Anchorage, AK| paid|
    |     Alan|     Morse|     M|Chicago-Napervill...| paid|
    |   Elijah|  Williams|     M|Detroit-Warren-De...| paid|
    |  Margaux|     Smith|     F|Atlanta-Sandy Spr...| free|
    |     Alan|     Morse|     M|Chicago-Napervill...| paid|
    |   Dylann|    Thomas|     M|       Anchorage, AK| paid|
    |  Margaux|     Smith|     F|Atlanta-Sandy Spr...| free|
    +---------+----------+------+--------------------+-----+
    
  2. 可以进一步转换该数据,将 level 列重命名为 subscription_typeYou can further transform this data to rename the column level to subscription_type.

    val renamedColumnsDF = specificColumnsDf.withColumnRenamed("level", "subscription_type")
    renamedColumnsDF.show()
    

    接收的输出如以下代码片段所示。You receive output as shown in the following snippet.

    +---------+----------+------+--------------------+-----------------+
    |firstname|  lastname|gender|            location|subscription_type|
    +---------+----------+------+--------------------+-----------------+
    | Annalyse|Montgomery|     F|  Killeen-Temple, TX|             free|
    |   Dylann|    Thomas|     M|       Anchorage, AK|             paid|
    |     Liam|     Watts|     M|New York-Newark-J...|             paid|
    |     Tess|  Townsend|     F|Nashville-Davidso...|             free|
    |  Margaux|     Smith|     F|Atlanta-Sandy Spr...|             free|
    |     Alan|     Morse|     M|Chicago-Napervill...|             paid|
    |Gabriella|   Shelton|     F|San Jose-Sunnyval...|             free|
    |   Elijah|  Williams|     M|Detroit-Warren-De...|             paid|
    |  Margaux|     Smith|     F|Atlanta-Sandy Spr...|             free|
    |     Tess|  Townsend|     F|Nashville-Davidso...|             free|
    |     Alan|     Morse|     M|Chicago-Napervill...|             paid|
    |     Liam|     Watts|     M|New York-Newark-J...|             paid|
    |     Liam|     Watts|     M|New York-Newark-J...|             paid|
    |   Dylann|    Thomas|     M|       Anchorage, AK|             paid|
    |     Alan|     Morse|     M|Chicago-Napervill...|             paid|
    |   Elijah|  Williams|     M|Detroit-Warren-De...|             paid|
    |  Margaux|     Smith|     F|Atlanta-Sandy Spr...|             free|
    |     Alan|     Morse|     M|Chicago-Napervill...|             paid|
    |   Dylann|    Thomas|     M|       Anchorage, AK|             paid|
    |  Margaux|     Smith|     F|Atlanta-Sandy Spr...|             free|
    +---------+----------+------+--------------------+-----------------+
    

将数据加载到 Azure SQL 数据仓库Load data into Azure SQL Data Warehouse

在此部分,请将转换的数据上传到 Azure SQL 数据仓库中。In this section, you upload the transformed data into Azure SQL Data Warehouse. 使用适用于 Azure Databricks 的 Azure SQL 数据仓库连接器直接上传 dataframe,在 SQL 数据仓库中作为表来存储。You use the Azure SQL Data Warehouse connector for Azure Databricks to directly upload a dataframe as a table in a SQL data warehouse.

如前所述,SQL 数据仓库连接器使用 Azure Blob 存储作为临时存储,以便将数据从 Azure Databricks 上传到 Azure SQL 数据仓库。As mentioned earlier, the SQL Data Warehouse connector uses Azure Blob storage as temporary storage to upload data between Azure Databricks and Azure SQL Data Warehouse. 因此,一开始请提供连接到存储帐户所需的配置。So, you start by providing the configuration to connect to the storage account. 必须已经按照本文先决条件部分的要求创建了帐户。You must already have already created the account as part of the prerequisites for this article.

  1. 提供从 Azure Databricks 访问 Azure 存储帐户所需的配置。Provide the configuration to access the Azure Storage account from Azure Databricks.

    val blobStorage = "<blob-storage-account-name>.blob.core.windows.net"
    val blobContainer = "<blob-container-name>"
    val blobAccessKey =  "<access-key>"
    
  2. 指定一个在 Azure Databricks 和 Azure SQL 数据仓库之间移动数据时需要使用的临时文件夹。Specify a temporary folder to use while moving data between Azure Databricks and Azure SQL Data Warehouse.

    val tempDir = "wasbs://" + blobContainer + "@" + blobStorage +"/tempDirs"
    
  3. 运行以下代码片段,以便在配置中存储 Azure Blob 存储访问密钥。Run the following snippet to store Azure Blob storage access keys in the configuration. 此操作可确保不需将访问密钥以纯文本形式存储在笔记本中。This action ensures that you don't have to keep the access key in the notebook in plain text.

    val acntInfo = "fs.azure.account.key."+ blobStorage
    sc.hadoopConfiguration.set(acntInfo, blobAccessKey)
    
  4. 提供连接到 Azure SQL 数据仓库实例所需的值。Provide the values to connect to the Azure SQL Data Warehouse instance. 必须已经将 SQL 数据仓库作为先决条件创建。You must have created a SQL data warehouse as a prerequisite. 为 dwServer 使用完全限定的服务器名称 。Use the fully qualified server name for dwServer. 例如,<servername>.database.windows.netFor example, <servername>.database.windows.net.

    //SQL Data Warehouse related settings
    val dwDatabase = "<database-name>"
    val dwServer = "<database-server-name>"
    val dwUser = "<user-name>"
    val dwPass = "<password>"
    val dwJdbcPort =  "1433"
    val dwJdbcExtraOptions = "encrypt=true;trustServerCertificate=true;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
    val sqlDwUrl = "jdbc:sqlserver://" + dwServer + ":" + dwJdbcPort + ";database=" + dwDatabase + ";user=" + dwUser+";password=" + dwPass + ";$dwJdbcExtraOptions"
    val sqlDwUrlSmall = "jdbc:sqlserver://" + dwServer + ":" + dwJdbcPort + ";database=" + dwDatabase + ";user=" + dwUser+";password=" + dwPass
    
  5. 运行以下代码片段来加载转换的 dataframe renamedColumnsDF,在 SQL 数据仓库中作为表来存储。 此代码片段在 SQL 数据库中创建名为 SampleTable 的表。This snippet creates a table called SampleTable in the SQL database.

    spark.conf.set(
        "spark.sql.parquet.writeLegacyFormat",
        "true")
    
    renamedColumnsDF.write.format("com.databricks.spark.sqldw").option("url", sqlDwUrlSmall).option("dbtable", "SampleTable")       .option( "forward_spark_azure_storage_credentials","True").option("tempdir", tempDir).mode("overwrite").save()
    

    备注

    此示例使用 forward_spark_azure_storage_credentials 标志,该标志导致 SQL 数据仓库使用访问密钥访问 blob 存储中的数据。This sample uses the forward_spark_azure_storage_credentials flag, which causes SQL Data Warehouse to access data from blob storage using an Access Key. 这是唯一支持的身份验证方法。This is the only supported method of authentication.

    如果将 Azure Blob 存储限制为选择虚拟网络,则 SQL 数据仓库需要托管服务标识,而不是访问密钥If your Azure Blob Storage is restricted to select virtual networks, SQL Data Warehouse requires Managed Service Identity instead of Access Keys. 这将导致错误“此请求无权执行此操作”。This will cause the error "This request is not authorized to perform this operation."

  6. 连接到 SQL 数据库,验证是否看到名为 SampleTable 的数据库。Connect to the SQL database and verify that you see a database named SampleTable.

    验证示例表Verify the sample table

  7. 运行一个 select 查询,验证表的内容。Run a select query to verify the contents of the table. 该表的数据应该与 renamedColumnsDF dataframe 相同。The table should have the same data as the renamedColumnsDF dataframe.

    验证示例表内容Verify the sample table content

清理资源Clean up resources

完成本教程后,可以终止群集。After you finish the tutorial, you can terminate the cluster. 在 Azure Databricks 工作区的左侧选择“群集” 。From the Azure Databricks workspace, select Clusters on the left. 对于要终止的群集,请将鼠标指向“操作” 下面的省略号 (...),然后选择“终止” 图标。For the cluster to terminate, under Actions, point to the ellipsis (...) and select the Terminate icon.

停止 Databricks 群集Stop a Databricks cluster

如果不手动终止群集,但在创建群集时选中了“在不活动 __ 分钟后终止” 复选框,则该群集会自动停止。If you don't manually terminate the cluster, it automatically stops, provided you selected the Terminate after __ minutes of inactivity check box when you created the cluster. 在这种情况下,如果群集保持非活动状态超过指定的时间,则会自动停止。In such a case, the cluster automatically stops if it's been inactive for the specified time.

后续步骤Next steps

本教程介绍了如何:In this tutorial, you learned how to:

  • 创建 Azure Databricks 服务Create an Azure Databricks service
  • 在 Azure Databricks 中创建 Spark 群集Create a Spark cluster in Azure Databricks
  • 在 Azure Databricks 中创建 NotebookCreate a notebook in Azure Databricks
  • 从 Data Lake Storage Gen2 帐户提取数据Extract data from a Data Lake Storage Gen2 account
  • 在 Azure Databricks 中转换数据Transform data in Azure Databricks
  • 将数据加载到 Azure SQL 数据仓库Load data into Azure SQL Data Warehouse

请继续学习下一教程,了解如何使用 Azure 事件中心将实时数据流式传输到 Azure Databricks 中。Advance to the next tutorial to learn about streaming real-time data into Azure Databricks using Azure Event Hubs.