快速入門:使用 Azure 入口網站在 Azure Databricks 工作區上執行 Spark 作業Quickstart: Run a Spark job on Azure Databricks Workspace using the Azure portal

在本快速入門中,您會使用 Azure 入口網站建立具有 Apache Spark 叢集的 Azure Databricks 工作區。In this quickstart, you use the Azure portal to create an Azure Databricks workspace with an Apache Spark cluster. 您會在叢集上執行作業,並使用自訂圖形產生波士頓安全資料的即時報表。You run a job on the cluster and use custom charts to produce real-time reports from Boston safety data.

PrerequisitesPrerequisites

  • Azure 訂用帳戶 - 建立免費帳戶Azure subscription - create one for free. 本教學課程不適用 Azure 免費試用版的訂用帳戶This tutorial cannot be carried out using Azure Free Trial Subscription. 如果您有免費帳戶,請移至您的設定檔,並將訂用帳戶變更為 隨用隨付If you have a free account, go to your profile and change your subscription to pay-as-you-go. 如需詳細資訊,請參閱 Azure 免費帳戶For more information, see Azure free account. 然後,為您所在區域的 vCPU 移除消費限制要求增加配額Then, remove the spending limit, and request a quota increase for vCPUs in your region. 當您建立 Azure Databricks 工作區時,您可以選取 [試用版 (進階 - 14 天的免費 DBU)] 定價層,讓工作區可免費存取進階 Azure Databricks DBU 14 天。When you create your Azure Databricks workspace, you can select the Trial (Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free Premium Azure Databricks DBUs for 14 days.

  • 登入 Azure 入口網站Sign in to the Azure portal.

注意

如果您想要在握有美國政府合規性認證 (如 FedRAMP High) 的 Azure 商業雲端中建立 Azure Databricks 工作區,請與 Microsoft 或 Databricks 的代表連絡,以取得這項體驗的存取權。If you want to create an Azure Databricks workspace in the Azure Commercial Cloud that holds US Government compliance certifications like FedRAMP High, please reach out to your Microsoft or Databricks representative to gain access to this experience.

建立 Azure Databricks 工作區Create an Azure Databricks workspace

在本節中,您會使用 Azure 入口網站或 Azure CLI 建立 Azure Databricks 工作區。In this section, you create an Azure Databricks workspace using the Azure portal or the Azure CLI.

  1. 在 Azure 入口網站中,選取 [建立資源] > [分析] > [Azure Databricks]。In the Azure portal, select Create a resource > Analytics > Azure Databricks.

    Azure 入口網站上的 DatabricksDatabricks on Azure portal

  2. 在 [Azure Databricks 服務] 底下,提供值以建立 Databricks 工作區。Under Azure Databricks Service, provide the values to create a Databricks workspace.

    建立 Azure Databricks 工作區Create an Azure Databricks workspace

    提供下列值:Provide the following values:

    屬性Property 描述Description
    工作區名稱Workspace name 提供您 Databricks 工作區的名稱Provide a name for your Databricks workspace
    訂用帳戶Subscription 從下拉式清單中選取您的 Azure 訂用帳戶。From the drop-down, select your Azure subscription.
    資源群組Resource group 指定您是要建立新的資源群組,還是使用現有資源群組。Specify whether you want to create a new resource group or use an existing one. 資源群組是存放 Azure 方案相關資源的容器。A resource group is a container that holds related resources for an Azure solution. 如需詳細資訊,請參閱 Azure 資源群組概觀For more information, see Azure Resource Group overview.
    位置Location 選取 [美國西部 2] 。Select West US 2. 如需其他可用的區域,請參閱依區域提供的 Azure 服務For other available regions, see Azure services available by region.
    定價層Pricing Tier 選擇 [標準]、[進階] 或 [試用]。Choose between Standard, Premium, or Trial. 如需這些定價層的詳細資訊,請參閱 Databricks 定價頁面For more information on these tiers, see Databricks pricing page.
  3. 選取 [檢閱 + 建立],然後選取 [建立]。Select Review + Create, and then Create. 工作區建立需要幾分鐘的時間。The workspace creation takes a few minutes. 在工作區建立期間,您可以在 [通知] 中檢視部署狀態。During workspace creation, you can view the deployment status in Notifications. 此程序完成後,使用者帳戶便會自動新增為工作區中的管理使用者。Once this process is finished, your user account is automatically added as an admin user in the workspace.

    Databricks 部署圖格Databricks deployment tile

    當工作區部署失敗時,工作區仍會以失敗狀態建立。When a workspace deployment fails, the workspace is still created in a failed state. 刪除失敗的工作區,並建立可解決部署錯誤的新工作區。Delete the failed workspace and create a new workspace that resolves the deployment errors. 當您刪除失敗的工作區時,也會一併刪除受控資源群組和任何已成功部署的資源。When you delete the failed workspace, the managed resource group and any successfully deployed resources are also deleted.

在 Databricks 中建立 Spark 叢集Create a Spark cluster in Databricks

注意

若要使用免費帳戶建立 Azure Databricks 叢集,在建立叢集之前,請先移至您的設定檔,並將訂用帳戶變更為 隨用隨付To use a free account to create the Azure Databricks cluster, before creating the cluster, go to your profile and change your subscription to pay-as-you-go. 如需詳細資訊,請參閱 Azure 免費帳戶For more information, see Azure free account.

  1. 在 Azure 入口網站中,移至您所建立的 Databricks 工作區,然後按一下 [啟動工作區] 。In the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace.

  2. 系統會將您重新導向至 Azure Databricks 入口網站。You are redirected to the Azure Databricks portal. 在入口網站中按一下 [新增叢集]。From the portal, click New Cluster.

    Azure 上的 DatabricksDatabricks on Azure

  3. 在 [新增叢集] 頁面上,提供值以建立叢集。In the New cluster page, provide the values to create a cluster.

    在 Azure 上建立 Databricks Spark 叢集Create Databricks Spark cluster on Azure

    接受下列值以外的所有其他預設值:Accept all other default values other than the following:

    • 輸入叢集的名稱。Enter a name for the cluster.

    • 在本文中,使用 (5.X6.X7.X) 執行階段建立叢集。For this article, create a cluster with (5.X, 6.X, 7.X) runtime.

    • 請確定您已選取 [在活動__分鐘後終止] 核取方塊。Make sure you select the Terminate after __ minutes of inactivity checkbox. 請提供用來終止叢集的叢集未使用持續時間 (以分鐘為單位)。Provide a duration (in minutes) to terminate the cluster, if the cluster is not being used.

      選取 [建立叢集] 。Select Create cluster. 叢集在執行後,您就可以將 Notebook 連結至叢集,並執行 Spark 作業。Once the cluster is running, you can attach notebooks to the cluster and run Spark jobs.

如需如何建立叢集的詳細資訊,請參閱在 Azure Databricks 建立 Spark 叢集For more information on creating clusters, see Create a Spark cluster in Azure Databricks.

執行 Spark SQL 作業Run a Spark SQL job

執行下列工作,在 Databricks 中建立筆記本、將筆記本設定為從 Azure 開放資料集讀取資料,然後對資料執行 Spark SQL 作業。Perform the following tasks to create a notebook in Databricks, configure the notebook to read data from an Azure Open Datasets, and then run a Spark SQL job on the data.

  1. 在左窗格中選取 [Azure Databricks]。In the left pane, select Azure Databricks. 從 [一般工作] 選取 [新增筆記本]。From the Common Tasks, select New Notebook.

    建立新的筆記本Create a new notebook

  2. 在 [建立筆記本] 對話方塊中輸入名稱,選取 [Python] 作為語言,然後選取您先前建立的 Spark 叢集。In the Create Notebook dialog box, enter a name, select Python as the language, and select the Spark cluster that you created earlier.

    輸入筆記本詳細資料Enter notebook details

    選取 [建立]。Select Create.

  3. 在此步驟中,從 Azure 開啟資料集使用波士頓安全資料建立 Spark 資料框架,並使用 SQL 查詢資料。In this step, create a Spark DataFrame with Boston Safety Data from Azure Open Datasets, and use SQL to query the data.

    下列命令會設定 Azure 儲存體存取資訊。The following command sets the Azure storage access information. 將此 PySpark 程式碼貼入第一個資料格,然後使用 Shift+Enter 執行該程式碼。Paste this PySpark code into the first cell and use Shift+Enter to run the code.

    blob_account_name = "azureopendatastorage"
    blob_container_name = "citydatacontainer"
    blob_relative_path = "Safety/Release/city=Boston"
    blob_sas_token = r"?st=2019-02-26T02%3A34%3A32Z&se=2119-02-27T02%3A34%3A00Z&sp=rl&sv=2018-03-28&sr=c&sig=XlJVWA7fMXCSxCKqJm8psMOh0W4h7cSYO28coRqF2fs%3D"
    

    下列命令可讓 Spark 從遠端讀取 Blob 儲存體。The following command allows Spark to read from Blob storage remotely. 將此 PySpark 程式碼貼入下一個資料格,然後使用 Shift+Enter 執行該程式碼。Paste this PySpark code into the next cell and use Shift+Enter to run the code.

    wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
    spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name), blob_sas_token)
    print('Remote blob path: ' + wasbs_path)
    

    下列命令會建立一個資料框架。The following command creates a DataFrame. 將此 PySpark 程式碼貼入下一個資料格,然後使用 Shift+Enter 執行該程式碼。Paste this PySpark code into the next cell and use Shift+Enter to run the code.

    df = spark.read.parquet(wasbs_path)
    print('Register the DataFrame as a SQL temporary view: source')
    df.createOrReplaceTempView('source')
    
  4. 執行 SQL 陳述式,以便從稱為 source 的暫存檢視傳回前 10 個資料列的資料。Run a SQL statement return the top 10 rows of data from the temporary view called source. 將此 PySpark 程式碼貼入下一個資料格,然後使用 Shift+Enter 執行該程式碼。Paste this PySpark code into the next cell and use Shift+Enter to run the code.

    print('Displaying top 10 rows: ')
    display(spark.sql('SELECT * FROM source LIMIT 10'))
    
  5. 您會看到如下列螢幕擷取畫面所示的表格式輸出 (僅顯示某些資料行):You see a tabular output like shown in the following screenshot (only some columns are shown):

    範例資料Sample data

  6. 現在,您會建立此資料的視覺表示法來顯示有多少安全事件會使用 Citizens Connect App 和 City Worker App,而不是其他來源來報告。You now create a visual representation of this data to show how many safety events are reported using the Citizens Connect App and City Worker App instead of other sources. 從表格式輸出底部,選取 [長條圖] 圖示,然後按一下 [繪圖選項]。From the bottom of the tabular output, select the Bar chart icon, and then click Plot Options.

    建立長條圖Create bar chart

  7. 在 [自訂繪圖] 中,如螢幕擷取畫面所示的方式拖放值。In Customize Plot, drag-and-drop values as shown in the screenshot.

    自訂圓形圖Customize pie chart

    • 將 [索引鍵] 設定為 sourceSet Keys to source.

    • 將 [值] 設定為 <\id>Set Values to <\id>.

    • 將 [彙總] 設定為 [計數]。Set Aggregation to COUNT.

    • 將 [顯示類型] 設定為 [圓形圖]。Set Display type to Pie chart.

      按一下 [套用]。Click Apply.

清除資源Clean up resources

在完成本文後,您可以終止叢集。After you have finished the article, you can terminate the cluster. 若要這樣做,請從 Azure Databricks 工作區的左窗格中選取 [叢集] 。To do so, from the Azure Databricks workspace, from the left pane, select Clusters. 對於您想要終止的叢集,將游標移到 [動作] 資料行底下的省略符號上,然後選取 [終止] 圖示。For the cluster you want to terminate, move the cursor over the ellipsis under Actions column, and select the Terminate icon.

停止 Databricks 叢集Stop a Databricks cluster

如果您不手動終止叢集,叢集將會自動停止,但前提是您已在建立叢集時選取 [在停止活動 __ 分鐘後終止] 核取方塊。If you do not manually terminate the cluster it will automatically stop, provided you selected the Terminate after __ minutes of inactivity checkbox while creating the cluster. 在這種情況下,叢集將會在停止運作達指定時間後自動停止。In such a case, the cluster automatically stops, if it has been inactive for the specified time.

後續步驟Next steps

在本文中,您已經在 Azure Databricks 中建立一個 Spark 叢集,並從 Azure 開放資料集使用資料來執行 Spark 作業。In this article, you created a Spark cluster in Azure Databricks and ran a Spark job using data from Azure Open Datasets. 您也可以查看 Spark 資料來源,以了解如何從其他資料來源將資料匯入到 Azure Databricks。You can also look at Spark data sources to learn how to import data from other data sources into Azure Databricks. 前往下一篇文章,以了解如何使用 Azure Databricks 執行 ETL 作業 (擷取、轉換及載入資料)。Advance to the next article to learn how to perform an ETL operation (extract, transform, and load data) using Azure Databricks.