Create a Scala Maven application to run on Apache Spark cluster on HDInsight

Learn how to create a Spark application written in Scala using Maven with IntelliJ IDEA. The article uses Apache Maven as the build system and starts with an existing Maven archetype for Scala provided by IntelliJ IDEA. Creating a Scala application in IntelliJ IDEA involves the following steps:

  • Use Maven as the build system.
  • Update Project Object Model (POM) file to resolve Spark module dependencies.
  • Write your application in Scala.
  • Generate a jar file that can be submitted to HDInsight Spark clusters.
  • Run the application on Spark cluster using Livy.

Note

HDInsight also provides an IntelliJ IDEA plugin tool to ease the process of creating and submitting applications to an HDInsight Spark cluster on Linux. For more information, see Use HDInsight Tools Plugin for IntelliJ IDEA to create and submit Spark applications.

Prerequisites

Install Scala plugin for IntelliJ IDEA

If IntelliJ IDEA installation did not not prompt for enabling Scala plugin, launch IntelliJ IDEA and go through the following steps to install the plugin:

  1. Start IntelliJ IDEA and from welcome screen click Configure and then click Plugins.

    Enable scala plugin

  2. In the next screen, click Install JetBrains plugin from the lower left corner. In the Browse JetBrains Plugins dialog box that opens, search for Scala and then click Install.

    Install scala plugin

  3. After the plugin installs successfully, click the Restart IntelliJ IDEA button to restart the IDE.

Create a standalone Scala project

  1. Launch IntelliJ IDEA and create a new project. In the new project dialog box, make the following choices, and then click Next.

    Create Maven project

    • Select Maven as the project type.
    • Specify a Project SDK. Click New and navigate to the Java installation directory, typically C:\Program Files\Java\jdk1.8.0_66.
    • Select the Create from archetype option.
    • From the list of archetypes, select org.scala-tools.archetypes:scala-archetype-simple. This will create the right directory structure and download the required default dependencies to write Scala program.
  2. Provide relevant values for GroupId, ArtifactId, and Version. Click Next.
  3. In the next dialog box, where you specify Maven home directory and other user settings, accept the defaults and click Next.
  4. In the last dialog box, specify a project name and location and then click Finish.
  5. Delete the MySpec.Scala file at src\test\scala\com\microsoft\spark\example. You do not need this for the application.
  6. If required, rename the default source and test files. From the left pane in the IntelliJ IDEA, navigate to src\main\scala\com.microsoft.spark.example. Right-click App.scala, click Refactor, click Rename file, and in the dialog box, provide the new name for the application and then click Refactor.

    Rename files

  7. In the subsequent steps, you will update the pom.xml to define the dependencies for the Spark Scala application. For those dependencies to be downloaded and resolved automatically, you must configure Maven accordingly.

    Configure Maven for automatic downloads

    1. From the File menu, click Settings.
    2. In the Settings dialog box, navigate to Build, Execution, Deployment > Build Tools > Maven > Importing.
    3. Select the option to Import Maven projects automatically.
    4. Click Apply, and then click OK.
  8. Update the Scala source file to include your application code. Open and replace the existing sample code with the following code and save the changes. This code reads the data from the HVAC.csv (available on all HDInsight Spark clusters), retrieves the rows that only have one digit in the sixth column, and writes the output to /HVACOut under the default storage container for the cluster.

     package com.microsoft.spark.example
    
     import org.apache.spark.SparkConf
     import org.apache.spark.SparkContext
    
     /**
       * Test IO to wasb
       */
     object WasbIOTest {
       def main (arg: Array[String]): Unit = {
         val conf = new SparkConf().setAppName("WASBIOTest")
         val sc = new SparkContext(conf)
    
         val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
    
         //find the rows which have only one digit in the 7th column in the CSV
         val rdd1 = rdd.filter(s => s.split(",")(6).length() == 1)
    
         rdd1.saveAsTextFile("wasb:///HVACout")
       }
     }
    
  9. Update the pom.xml.

    1. Within <project>\<properties> add the following:

      <scala.version>2.10.4</scala.version>
      <scala.compat.version>2.10.4</scala.compat.version>
      <scala.binary.version>2.10</scala.binary.version>
      
    2. Within <project>\<dependencies> add the following:

       <dependency>
         <groupId>org.apache.spark</groupId>
         <artifactId>spark-core_${scala.binary.version}</artifactId>
         <version>1.4.1</version>
       </dependency>
      

      Save changes to pom.xml.

  10. Create the .jar file. IntelliJ IDEA enables creation of JAR as an artifact of a project. Perform the following steps.

    1. From the File menu, click Project Structure.
    2. In the Project Structure dialog box, click Artifacts and then click the plus symbol. From the pop-up dialog box, click JAR, and then click From modules with dependencies.

      Create JAR

    3. In the Create JAR from Modules dialog box, click the ellipsis (ellipsis ) against the Main Class.
    4. In the Select Main Class dialog box, select the class that appears by default and then click OK.

      Create JAR

    5. In the Create JAR from Modules dialog box, make sure that the option to extract to the target JAR is selected, and then click OK. This creates a single JAR with all dependencies.

      Create JAR

    6. The output layout tab lists all the jars that are included as part of the Maven project. You can select and delete the ones on which the Scala application has no direct dependency. For the application we are creating here, you can remove all but the last one (SparkSimpleApp compile output). Select the jars to delete and then click the Delete icon.

      Create JAR

      Make sure Build on make box is selected, which ensures that the jar is created every time the project is built or updated. Click Apply and then OK.

    7. From the menu bar, click Build, and then click Make Project. You can also click Build Artifacts to create the jar. The output jar is created under \out\artifacts.

      Create JAR

Run the application on the Spark cluster

To run the application on the cluster, you must do the following:

Next step

In this article you learned how to create a Spark scala application. Advance to the next article to learn how to run this application on an HDInsight Spark cluster using Livy.