使用 Apache Spark REST API 將遠端作業提交至 HDInsight Spark 叢集Use Apache Spark REST API to submit remote jobs to an HDInsight Spark cluster

了解如何使用可將遠端作業提交至 Azure HDInsight Spark 叢集的 Apache Livy (也就是 Apache Spark REST API)。Learn how to use Apache Livy, the Apache Spark REST API, which is used to submit remote jobs to an Azure HDInsight Spark cluster. 如需詳細檔,請參閱 Apache LivyFor detailed documentation, see Apache Livy.

您可以使用 Livy 執行互動式 Spark 殼層,或提交要在 Spark 上執行的批次作業。You can use Livy to run interactive Spark shells or submit batch jobs to be run on Spark. 本文將討論如何使用 Livy 提交批次作業。This article talks about using Livy to submit batch jobs. 本文中的程式碼片段會使用 cURL 向 Livy Spark 端點發出 REST API 呼叫。The snippets in this article use cURL to make REST API calls to the Livy Spark endpoint.

PrerequisitesPrerequisites

HDInsight 上的 Apache Spark 叢集。An Apache Spark cluster on HDInsight. 如需指示,請參閱在 Azure HDInsight 中建立 Apache Spark 叢集For instructions, see Create Apache Spark clusters in Azure HDInsight.

提交 Apache Livy Spark 批次作業Submit an Apache Livy Spark batch job

在提交批次作業之前,您必須將應用程式 jar 上傳至與叢集相關聯的叢集儲存體。Before you submit a batch job, you must upload the application jar on the cluster storage associated with the cluster. 您可以使用命令列公用程式 AzCopy 來執行此動作。You can use AzCopy, a command-line utility, to do so. 此外也有各種用戶端可用來上傳資料。There are various other clients you can use to upload data. 在 HDInsight 上將 Apache Hadoop 作業的資料上傳中可找到其詳細資訊。You can find more about them at Upload data for Apache Hadoop jobs in HDInsight.

curl -k --user "admin:password" -v -H "Content-Type: application/json" -X POST -d '{ "file":"<path to application jar>", "className":"<classname in jar>" }' 'https://<spark_cluster_name>.azurehdinsight.net/livy/batches' -H "X-Requested-By: admin"

範例Examples

  • 如果 jar 檔案位於叢集儲存體 (WASBS) If the jar file is on the cluster storage (WASBS)

    curl -k --user "admin:mypassword1!" -v -H "Content-Type: application/json" -X POST -d '{ "file":"wasbs://mycontainer@mystorageaccount.blob.core.windows.net/data/SparkSimpleTest.jar", "className":"com.microsoft.spark.test.SimpleFile" }' "https://mysparkcluster.azurehdinsight.net/livy/batches" -H "X-Requested-By: admin"
    
  • 如果您需要在輸入檔案 (在此範例中為 input.txt) 中傳遞 jar 檔案名稱和類別名稱If you want to pass the jar filename and the classname as part of an input file (in this example, input.txt)

    curl -k  --user "admin:mypassword1!" -v -H "Content-Type: application/json" -X POST --data @C:\Temp\input.txt "https://mysparkcluster.azurehdinsight.net/livy/batches" -H "X-Requested-By: admin"
    

取得在叢集上執行之 Livy Spark 批次的相關資訊Get information on Livy Spark batches running on the cluster

語法:Syntax:

curl -k --user "admin:password" -v -X GET "https://<spark_cluster_name>.azurehdinsight.net/livy/batches"

範例Examples

  • 如果您想要擷取在叢集上執行的所有 Livy Spark 批次:If you want to retrieve all the Livy Spark batches running on the cluster:

    curl -k --user "admin:mypassword1!" -v -X GET "https://mysparkcluster.azurehdinsight.net/livy/batches"
    
  • 如果您想要使用指定的批次識別碼來取出特定批次If you want to retrieve a specific batch with a given batch ID

    curl -k --user "admin:mypassword1!" -v -X GET "https://mysparkcluster.azurehdinsight.net/livy/batches/{batchId}"
    

將 Livy Spark 批次作業刪除Delete a Livy Spark batch job

curl -k --user "admin:mypassword1!" -v -X DELETE "https://<spark_cluster_name>.azurehdinsight.net/livy/batches/{batchId}"

範例Example

正在刪除具有批次識別碼的批次作業 5Deleting a batch job with batch ID 5.

curl -k --user "admin:mypassword1!" -v -X DELETE "https://mysparkcluster.azurehdinsight.net/livy/batches/5"

Livy Spark 與高可用性Livy Spark and high-availability

Livy 可為在叢集上執行的 Spark 作業提供高可用性。Livy provides high-availability for Spark jobs running on the cluster. 以下是一些範例。Here is a couple of examples.

  • 當您從遠端將作業提交至 Spark 叢集之後,如果 Livy 服務中斷,則工作會繼續在背景執行。If the Livy service goes down after you've submitted a job remotely to a Spark cluster, the job continues to run in the background. 當 Livy 恢復運作時,它會還原作業的狀態並回報。When Livy is back up, it restores the status of the job and reports it back.
  • 適用于 HDInsight 的 Jupyter 筆記本是由後端中的 Livy 提供技術支援。Jupyter Notebooks for HDInsight are powered by Livy in the backend. 如果在 Notebook 執行 Spark 作業時,Livy 服務重新啟動,Notebook 就會繼續執行程式碼單元。If a notebook is running a Spark job and the Livy service gets restarted, the notebook continues to run the code cells.

請舉例說明Show me an example

在本節中,我們將透過範例了解如何使用 Livy Spark 來提交批次作業、監視作業的進度,然後加以刪除。In this section, we look at examples to use Livy Spark to submit batch job, monitor the progress of the job, and then delete it. 我們在此範例中使用的應用程式,就是 建立獨立 Scala 應用程式,並在 HDInsight Spark 叢集上執行一文中所開發的應用程式。The application we use in this example is the one developed in the article Create a standalone Scala application and to run on HDInsight Spark cluster. 這裡的步驟假設:The steps here assume:

  • 您已將應用程式 jar 複製到與叢集相關聯的儲存體帳戶。You've already copied over the application jar to the storage account associated with the cluster.
  • 您已在嘗試執行這些步驟的電腦上安裝了捲曲。You've CuRL installed on the computer where you're trying these steps.

執行下列步驟:Perform the following steps:

  1. 為了方便使用,請設定環境變數。For ease of use, set environment variables. 這個範例是以 Windows 環境為基礎,視您的環境需要修訂變數。This example is based on a Windows environment, revise variables as needed for your environment. CLUSTERNAME PASSWORD 以適當的值取代和。Replace CLUSTERNAME, and PASSWORD with the appropriate values.

    set clustername=CLUSTERNAME
    set password=PASSWORD
    
  2. 確認 Livy Spark 正在叢集中執行。Verify that Livy Spark is running on the cluster. 我們可以取得執行中的批次清單,加以確認。We can do so by getting a list of running batches. 如果您是第一次使用 Livy 執行作業,則輸出應該會傳回零。If you're running a job using Livy for the first time, the output should return zero.

    curl -k --user "admin:%password%" -v -X GET "https://%clustername%.azurehdinsight.net/livy/batches"
    

    您應該會看到如下列程式碼片段的輸出:You should get an output similar to the following snippet:

    < HTTP/1.1 200 OK
    < Content-Type: application/json; charset=UTF-8
    < Server: Microsoft-IIS/8.5
    < X-Powered-By: ARR/2.5
    < X-Powered-By: ASP.NET
    < Date: Fri, 20 Nov 2015 23:47:53 GMT
    < Content-Length: 34
    <
    {"from":0,"total":0,"sessions":[]}* Connection #0 to host mysparkcluster.azurehdinsight.net left intact
    

    請留意到輸出的最後一行顯示為 total:0,這表示沒有執行中的批次。Notice how the last line in the output says total:0, which suggests no running batches.

  3. 現在,我們要提交批次作業。Let us now submit a batch job. 下列程式碼片段會使用輸入檔案 (input.txt) 傳遞 jar 名稱和類別名稱來作為參數。The following snippet uses an input file (input.txt) to pass the jar name and the class name as parameters. 如果您是從 Windows 電腦執行這些步驟,則建議使用輸入檔作為建議的方法。If you're running these steps from a Windows computer, using an input file is the recommended approach.

    curl -k --user "admin:%password%" -v -H "Content-Type: application/json" -X POST --data @C:\Temp\input.txt "https://%clustername%.azurehdinsight.net/livy/batches" -H "X-Requested-By: admin"
    

    檔案 input.txt 中的參數定義如下:The parameters in the file input.txt are defined as follows:

    { "file":"wasbs:///example/jars/SparkSimpleApp.jar", "className":"com.microsoft.spark.example.WasbIOTest" }
    

    您應該會看到如下列程式碼片段的輸出:You should see an output similar to the following snippet:

    < HTTP/1.1 201 Created
    < Content-Type: application/json; charset=UTF-8
    < Location: /0
    < Server: Microsoft-IIS/8.5
    < X-Powered-By: ARR/2.5
    < X-Powered-By: ASP.NET
    < Date: Fri, 20 Nov 2015 23:51:30 GMT
    < Content-Length: 36
    <
    {"id":0,"state":"starting","log":[]}* Connection #0 to host mysparkcluster.azurehdinsight.net left intact
    

    請留意到輸出的最後一行顯示為 state:startingNotice how the last line of the output says state:starting. 此外也顯示 id:0It also says, id:0. 在這裡,批次識別碼是 0Here, 0 is the batch ID.

  4. 現在,您可以使用批次識別碼來擷取此批次的狀態。You can now retrieve the status of this specific batch using the batch ID.

    curl -k --user "admin:%password%" -v -X GET "https://%clustername%.azurehdinsight.net/livy/batches/0"
    

    您應該會看到如下列程式碼片段的輸出:You should see an output similar to the following snippet:

    < HTTP/1.1 200 OK
    < Content-Type: application/json; charset=UTF-8
    < Server: Microsoft-IIS/8.5
    < X-Powered-By: ARR/2.5
    < X-Powered-By: ASP.NET
    < Date: Fri, 20 Nov 2015 23:54:42 GMT
    < Content-Length: 509
    <
    {"id":0,"state":"success","log":["\t diagnostics: N/A","\t ApplicationMaster host: 10.0.0.4","\t ApplicationMaster RPC port: 0","\t queue: default","\t start time: 1448063505350","\t final status: SUCCEEDED","\t tracking URL: http://myspar.lpel.jx.internal.cloudapp.net:8088/proxy/application_1447984474852_0002/","\t user: root","15/11/20 23:52:47 INFO Utils: Shutdown hook called","15/11/20 23:52:47 INFO Utils: Deleting directory /tmp/spark-b72cd2bf-280b-4c57-8ceb-9e3e69ac7d0c"]}* Connection #0 to host mysparkcluster.azurehdinsight.net left intact
    

    輸出此時顯示 state:success,這表示作業已順利完成。The output now shows state:success, which suggests that the job was successfully completed.

  5. 現在,您可以視需要刪除批次。If you want, you can now delete the batch.

    curl -k --user "admin:%password%" -v -X DELETE "https://%clustername%.azurehdinsight.net/livy/batches/0"
    

    您應該會看到如下列程式碼片段的輸出:You should see an output similar to the following snippet:

    < HTTP/1.1 200 OK
    < Content-Type: application/json; charset=UTF-8
    < Server: Microsoft-IIS/8.5
    < X-Powered-By: ARR/2.5
    < X-Powered-By: ASP.NET
    < Date: Sat, 21 Nov 2015 18:51:54 GMT
    < Content-Length: 17
    <
    {"msg":"deleted"}* Connection #0 to host mysparkcluster.azurehdinsight.net left intact
    

    輸出的最後一行顯示批次已成功刪除。The last line of the output shows that the batch was successfully deleted. 當作業正在執行時,刪除作業也會終止作業。Deleting a job, while it's running, also kills the job. 如果您刪除已完成的作業,無論成功與否,這將會完全刪除作業資訊。If you delete a job that has completed, successfully or otherwise, it deletes the job information completely.

從 HDInsight 3.5 版開始對 Livy 設定的更新Updates to Livy configuration starting with HDInsight 3.5 version

根據預設,HDInsight 3.5 叢集與更新版本會停用以本機檔案路徑存取範例資料檔案或 jar。HDInsight 3.5 clusters and above, by default, disable use of local file paths to access sample data files or jars. 建議您使用 wasbs:// 路徑,而不是從叢集存取 jar 或範本資料檔案。We encourage you to use the wasbs:// path instead to access jars or sample data files from the cluster.

在 Azure 虛擬網路內提交叢集的 Livy 作業Submitting Livy jobs for a cluster within an Azure virtual network

如果您是從 Azure 虛擬網路內連線到 HDInsight Spark 叢集,可以直接連線到叢集上的 Livy。If you connect to an HDInsight Spark cluster from within an Azure Virtual Network, you can directly connect to Livy on the cluster. 在此案例中,Livy 端點的 URL 是 http://<IP address of the headnode>:8998/batchesIn such a case, the URL for Livy endpoint is http://<IP address of the headnode>:8998/batches. 在這裡,8998 是 Livy 在叢集前端節點上執行的連接埠。Here, 8998 is the port on which Livy runs on the cluster headnode. 如需有關在非公用連接埠上存取服務的詳細資訊,請參閱 HDInsight 上 Apache Hadoop 服務所使用的連接埠For more information on accessing services on non-public ports, see Ports used by Apache Hadoop services on HDInsight.

後續步驟Next steps