Use Azure Toolkit for IntelliJ to debug Spark applications remotely in HDInsight through VPN

We recommend debugging spark applications remotely through SSH. For instructions, see Remotely debug Spark applications on an HDInsight cluster with Azure Toolkit for IntelliJ through SSH.

This article provides step-by-step guidance on how to use the HDInsight Tools in Azure Toolkit for IntelliJ to submit a Spark job on an HDInsight Spark cluster, and then debug it remotely from your desktop computer. To complete these tasks, you must perform the following high-level steps:

  1. Create a site-to-site or point-to-site Azure virtual network. The steps in this document assume that you use a site-to-site network.
  2. Create a Spark cluster in HDInsight that is part of the site-to-site virtual network.
  3. Verify the connectivity between the cluster head node and your desktop.
  4. Create a Scala application in IntelliJ IDEA, and then configure it for remote debugging.
  5. Run and debug the application.

Prerequisites

Step 1: Create an Azure virtual network

Follow the instructions from the following links to create an Azure virtual network, and then verify the connectivity between your desktop computer and the virtual network:

Step 2: Create an HDInsight Spark cluster

We recommend that you also create an Apache Spark cluster in Azure HDInsight that is part of the Azure virtual network that you created. Use the information available in Create Linux-based clusters in HDInsight. As part of optional configuration, select the Azure virtual network that you created in the previous step.

Step 3: Verify the connectivity between the cluster head node and your desktop

  1. Get the IP address of the head node. Open Ambari UI for the cluster. From the cluster blade, select Dashboard.

    Select Dashboard in Ambari

  2. From the Ambari UI, select Hosts.

    Select Hosts in Ambari

  3. You see a list of head nodes, worker nodes, and zookeeper nodes. The head nodes have an hn* prefix. Select the first head node.

    Find the head node in Ambari

  4. From the Summary pane at the bottom of the page that opens, copy the IP Address of the head node and the Hostname.

    Find the IP address in Ambari

  5. Add the IP address and the hostname of the head node to the hosts file on the computer where you want to run and remotely debug the Spark job. This enables you to communicate with the head node by using the IP address, as well as the hostname.

    a. Open a Notepad file with elevated permissions. From the File menu, select Open, and then find the location of the hosts file. On a Windows computer, the location is C:\Windows\System32\Drivers\etc\hosts.

    b. Add the following information to the hosts file:

        # For headnode0
        192.xxx.xx.xx hn0-nitinp
        192.xxx.xx.xx hn0-nitinp.lhwwghjkpqejawpqbwcdyp3.gx.internal.cloudapp.net
    
        # For headnode1
        192.xxx.xx.xx hn1-nitinp
        192.xxx.xx.xx hn1-nitinp.lhwwghjkpqejawpqbwcdyp3.gx.internal.cloudapp.net
    
  6. From the computer that you connected to the Azure virtual network that is used by the HDInsight cluster, verify that you can ping the head nodes by using the IP address, as well as the hostname.
  7. Use SSH to connect to the cluster head node by following the instructions in Connect to an HDInsight cluster using SSH. From the cluster head node, ping the IP address of the desktop computer. Test the connectivity to both IP addresses assigned to the computer:

    • One for the network connection
    • One for the Azure virtual network
  8. Repeat the steps for the other head node.

Step 4: Create a Spark Scala application by using HDInsight Tools in Azure Toolkit for IntelliJ and configure it for remote debugging

  1. Open IntelliJ IDEA and create a new project. In the New Project dialog box, do the following:

    Select the new project template in IntelliJ IDEA

    a. Select HDInsight > Spark on HDInsight (Scala).

    b. Select Next.

  2. In the next New Project dialog box, do the following, and then select Finish:

    • Enter a project name and location.

    • In the Project SDK drop-down list, select Java 1.8 for the Spark 2.x cluster, or select Java 1.7 for the Spark 1.x cluster.

    • In the Spark version drop-down list, the Scala project creation wizard integrates the proper version for the Spark SDK and the Scala SDK. If the Spark cluster version is earlier than 2.0, select Spark 1.x. Otherwise, select Spark2.x. This example uses Spark 2.0.2 (Scala 2.11.8).

    Select the project SDK and Spark version

  3. The Spark project automatically creates an artifact for you. To view the artifact, do the following:

    a. From the File menu, select Project Structure.

    b. In the Project Structure dialog box, select Artifacts to view the default artifact that is created. You can also create your own artifact by selecting the plus sign (+).

    Create JAR

  4. Add libraries to your project. To add a library, do the following:

    a. Right-click the project name in the project tree, and then select Open Module Settings.

    b. In the Project Structure dialog box, select Libraries, select the (+) symbol, and then select From Maven.

    Add a library

    c. In the Download Library from Maven Repository dialog box, search for and add the following libraries:

    • org.scalatest:scalatest_2.10:2.2.1
    • org.apache.hadoop:hadoop-azure:2.7.1
  5. Copy yarn-site.xml and core-site.xml from the cluster head node and add them to the project. Use the following commands to copy the files. You can use Cygwin to run the following scp commands to copy the files from the cluster head nodes:

     scp <ssh user name>@<headnode IP address or host name>://etc/hadoop/conf/core-site.xml .
    

    Because we already added the cluster head node IP address and hostnames for the hosts file on the desktop, we can use the scp commands in the following manner:

     scp sshuser@hn0-nitinp:/etc/hadoop/conf/core-site.xml .
     scp sshuser@hn0-nitinp:/etc/hadoop/conf/yarn-site.xml .
    

    To add these files to your project, copy them under the /src folder in your project tree, for example <your project directory>\src.

  6. Update the core-site.xml file to make the following changes:

    a. Replace the encrypted key. The core-site.xml file includes the encrypted key to the storage account associated with the cluster. In the core-site.xml file that you added to the project, replace the encrypted key with the actual storage key associated with the default storage account. For more information, see Manage your storage access keys.

        <property>
              <name>fs.azure.account.key.hdistoragecentral.blob.core.windows.net</name>
              <value>access-key-associated-with-the-account</value>
        </property>
    

    b. Remove the following entries from core-site.xml:

        <property>
              <name>fs.azure.account.keyprovider.hdistoragecentral.blob.core.windows.net</name>
              <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
        </property>
    
        <property>
              <name>fs.azure.shellkeyprovider.script</name>
              <value>/usr/lib/python2.7/dist-packages/hdinsight_common/decrypt.sh</value>
        </property>
    
        <property>
              <name>net.topology.script.file.name</name>
              <value>/etc/hadoop/conf/topology_script.py</value>
        </property>
    

    c. Save the file.

  7. Add the main class for your application. From the Project Explorer, right-click src, point to New, and then select Scala class.

    Select the main class

  8. In the Create New Scala Class dialog box, provide a name, select Object in the Kind box, and then select OK.

    Create new Scala class

  9. In the MyClusterAppMain.scala file, paste the following code. This code creates the Spark context and opens an executeJob method from the SparkSample object.

     import org.apache.spark.{SparkConf, SparkContext}
    
     object SparkSampleMain {
       def main (arg: Array[String]): Unit = {
         val conf = new SparkConf().setAppName("SparkSample")
                                   .set("spark.hadoop.validateOutputSpecs", "false")
         val sc = new SparkContext(conf)
    
         SparkSample.executeJob(sc,
                                "wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv",
                                "wasb:///HVACOut")
       }
     }
    
  10. Repeat steps 8 and 9 to add a new Scala object called *SparkSample. Add the following code to this class. This code reads the data from the HVAC.csv (available in all HDInsight Spark clusters). It retrieves the rows that only have one digit in the seventh column in the CSV file, and then writes the output to /HVACOut under the default storage container for the cluster.

    import org.apache.spark.SparkContext
    
    object SparkSample {
     def executeJob (sc: SparkContext, input: String, output: String): Unit = {
       val rdd = sc.textFile(input)
    
       //find the rows which have only one digit in the 7th column in the CSV
       val rdd1 =  rdd.filter(s => s.split(",")(6).length() == 1)
    
       val s = sc.parallelize(rdd.take(5)).cartesian(rdd).count()
       println(s)
    
       rdd1.saveAsTextFile(output)
       //rdd1.collect().foreach(println)
     }
    }
    
  11. Repeat steps 8 and 9 to add a new class called RemoteClusterDebugging. This class implements the Spark test framework that is used to debug the applications. Add the following code to the RemoteClusterDebugging class:

    import org.apache.spark.{SparkConf, SparkContext}
    import org.scalatest.FunSuite
    
    class RemoteClusterDebugging extends FunSuite {
    
     test("Remote run") {
       val conf = new SparkConf().setAppName("SparkSample")
                                 .setMaster("yarn-client")
                                 .set("spark.yarn.am.extraJavaOptions", "-Dhdp.version=2.4")
                                 .set("spark.yarn.jar", "wasb:///hdp/apps/2.4.2.0-258/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar")
                                 .setJars(Seq("""C:\workspace\IdeaProjects\MyClusterApp\out\artifacts\MyClusterApp_DefaultArtifact\default_artifact.jar"""))
                                 .set("spark.hadoop.validateOutputSpecs", "false")
       val sc = new SparkContext(conf)
    
       SparkSample.executeJob(sc,
         "wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv",
         "wasb:///HVACOut")
     }
    }
    

    There are a couple of important things to note:

    • For .set("spark.yarn.jar", "wasb:///hdp/apps/2.4.2.0-258/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar"), make sure the Spark assembly JAR is available on the cluster storage at the specified path.
    • For setJars, specify the location where the artifact JAR is created. Typically, it is <Your IntelliJ project directory>\out\<project name>_DefaultArtifact\default_artifact.jar.
  12. In the*RemoteClusterDebugging class, right-click the test keyword, and then select Create RemoteClusterDebugging Configuration.

    Create a remote configuration

  13. In the Create RemoteClusterDebugging Configuration dialog box, provide a name for the configuration, and then select Test kind as the Test name. Leave all the other values as the default settings. Select Apply, and then select OK.

    Add the configuration details

  14. You should now see a Remote run configuration drop-down list in the menu bar.

    The Remote run drop-down list

Step 5: Run the application in debug mode

  1. In your IntelliJ IDEA project, open SparkSample.scala and create a breakpoint next to val rdd1. In the Create Breakpoint for pop-up menu, select line in function executeJob.

    Add a breakpoint

  2. To run the application, select the Debug Run button next to the Remote Run configuration drop-down list.

    Select the Debug Run button

  3. When the program execution reaches the breakpoint, you see a Debugger tab in the bottom pane.

    View the Debugger tab

  4. To add a watch, select the (+) icon.

    Select the + icon

    In this example, the application broke before the variable rdd1 was created. By using this watch, we can see the first five rows in the variable rdd. Select Enter.

    Run the program in debug mode

    What you see in the previous image is that at runtime, you might query terabytes of data and debug how your application progresses. For example, in the output shown in the previous image, you can see that the first row of the output is a header. Based on this output, you can modify your application code to skip the header row, if necessary.

  5. You can now select the Resume Program icon to proceed with your application run.

    Select Resume Program

  6. If the application finishes successfully, you should see output like the following:

    Console output

Next steps

Demo

Scenarios

Create and run applications

Tools and extensions

Manage resources