Learn how to build your .NET for Apache Spark application on Ubuntu

This article teaches you how to build your .NET for Apache Spark applications on Ubuntu.

Warning

.NET for Apache Spark targets an out-of-support version of .NET (.NET Core 3.1). For more information, see the .NET Support Policy.

Prerequisites

If you already have all of the following prerequisites, skip to the build steps.

  1. Download and install .NET Core 3.1 SDK - installing the SDK adds the dotnet toolchain to your path. .NET Core 2.1, 2.2 and 3.1 are supported.

  2. Install OpenJDK 8.

    • You can use the following command:
    sudo apt install openjdk-8-jdk
    
    • Verify you are able to run java from your command-line.

      Sample java -version output:

      openjdk version "1.8.0_191"
      OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12)
      OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
      
    • If you already have multiple OpenJDK versions installed and want to select OpenJDK 8, use the following command:

      sudo update-alternatives --config java
      
  3. Install Apache Maven 3.6.0+.

    • Run the following command:

      mkdir -p ~/bin/maven
      cd ~/bin/maven
      wget https://archive.apache.org/dist/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.tar.gz
      tar -xvzf apache-maven-3.6.0-bin.tar.gz
      ln -s apache-maven-3.6.0 current
      export M2_HOME=~/bin/maven/current
      export PATH=${M2_HOME}/bin:${PATH}
      source ~/.bashrc
      

      Note that these environment variables will be lost when you close your terminal. If you want the changes to be permanent, add the export lines to your ~/.bashrc file.

    • Verify you are able to run mvn from your command-line

      Sample mvn -version output:

      Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 2018-10-24T18:41:47Z)
      Maven home: ~/bin/apache-maven-3.6.0
      Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre
      Default locale: en, platform encoding: UTF-8
      OS name: "linux", version: "4.4.0-17763-microsoft", arch: "amd64", family: "unix"
      
  4. Install Apache Spark 2.3+. Download Apache Spark 2.3+ and extract it into a local folder (e.g., ~/bin/spark-3.0.1-bin-hadoop2.7). (The supported spark versions are 2.3.*, 2.4.0, 2.4.1, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 3.0.0 and 3.0.1)

    tar -xvzf /path/to/spark-3.0.1-bin-hadoop2.7.tgz -C ~/bin/spark-3.0.1-bin-hadoop2.7
    
    • Add the necessary environment variables SPARK_HOME (e.g., ~/bin/spark-3.0.1-bin-hadoop2.7/) and PATH (e.g., $SPARK_HOME/bin:$PATH)

      export SPARK_HOME=~/bin/spark-3.0.1-hadoop2.7
      export PATH="$SPARK_HOME/bin:$PATH"
      source ~/.bashrc
      

      Note that these environment variables will be lost when you close your terminal. If you want the changes to be permanent, add the export lines to your ~/.bashrc file.

    • Verify you are able to run spark-shell from your command-line.

      Sample console output:

      Welcome to
             ____              __
            / __/__  ___ _____/ /__
           _\ \/ _ \/ _ `/ __/  '_/
          /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
             /_/
      
      Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
      Type in expressions to have them evaluated.
      Type :help for more information.
      
      scala> sc
      res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6eaa6b0c
      

Make sure you are able to run dotnet, java, mvn, spark-shell from your command-line before you move to the next section. Feel there is a better way? Please open an issue and feel free to contribute.

Build

For the remainder of this guide, you will need to have cloned the .NET for Apache Spark repository into your machine e.g., ~/dotnet.spark/.

git clone https://github.com/dotnet/spark.git ~/dotnet.spark

Build .NET for Spark Scala extensions layer

When you submit a .NET application, .NET for Apache Spark has the necessary logic written in Scala that informs Apache Spark how to handle your requests (e.g., request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the .NET for Apache Spark Scala Source Code.

The next step is to build the .NET for Apache Spark Scala extension layer:

cd src/scala
mvn clean package

You should see JARs created for the supported Spark versions:

  • microsoft-spark-2-3\target\microsoft-spark-2-3_2.11-<spark-dotnet-version>.jar
  • microsoft-spark-2-4\target\microsoft-spark-2-4_2.11-<spark-dotnet-version>.jar
  • microsoft-spark-3-0\target\microsoft-spark-3-0_2.12-<spark-dotnet-version>.jar

Build .NET sample applications using .NET CLI

This section explains how to build the sample applications for .NET for Apache Spark. These steps will help in understanding the overall building process for any .NET for Spark application.

  1. Build the worker:

    cd ~/dotnet.spark/src/csharp/Microsoft.Spark.Worker/
    dotnet publish -f netcoreapp3.1 -r ubuntu.18.04-x64
    

    Sample console output:

    user@machine:/home/user/dotnet.spark/src/csharp/Microsoft.Spark.Worker$ dotnet publish -f netcoreapp3.1 -r ubuntu.18.04-x64
    Microsoft (R) Build Engine version 16.6.0+5ff7b0c9e for .NET Core
    Copyright (C) Microsoft Corporation. All rights reserved.
    
       Restore completed in 36.03 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark.Worker/Microsoft.Spark.Worker.csproj.
       Restore completed in 35.94 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj.
       Microsoft.Spark -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark/Debug/netstandard2.1/Microsoft.Spark.dll
       Microsoft.Spark.Worker -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish/Microsoft.Spark.Worker.dll
       Microsoft.Spark.Worker -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish/
    
  2. Build the samples:

    cd ~/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples/
    dotnet publish -f netcoreapp3.1 -r ubuntu.18.04-x64
    

    Sample console output:

    user@machine:/home/user/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples$ dotnet publish -f netcoreapp3.1 -r ubuntu.18.04-x64
    Microsoft (R) Build Engine version 16.6.0+5ff7b0c9e for .NET Core
    Copyright (C) Microsoft Corporation. All rights reserved.
    
       Restore completed in 37.11 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj.
       Restore completed in 281.63 ms for /home/user/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples/Microsoft.Spark.CSharp.Examples.csproj.
       Microsoft.Spark -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark/Debug/netstandard2.1/Microsoft.Spark.dll
       Microsoft.Spark.CSharp.Examples -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish/Microsoft.Spark.CSharp.Examples.dll
       Microsoft.Spark.CSharp.Examples -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish/
    

Run the .NET for Spark sample applications

Once you build the samples, you can use spark-submit to submit your .NET Core apps. Make sure you have followed the prerequisites section and installed Apache Spark.

  1. Set the DOTNET_WORKER_DIR or PATH environment variable to include the path where the Microsoft.Spark.Worker binary has been generated (e.g., ~/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish).

    export DOTNET_WORKER_DIR=~/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish
    
  2. Open a terminal and go to the directory where your app binary has been generated (e.g., ~/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish).

    cd ~/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish
    
  3. Running your app follows the basic structure:

    spark-submit \
      [--jars <any-jars-your-app-is-dependent-on>] \
      --class org.apache.spark.deploy.dotnet.DotnetRunner \
      --master local \
      <path-to-microsoft-spark-jar> \
      <path-to-your-app-binary> <argument(s)-to-your-app>
    

    Here are some examples you can run:

    • Microsoft.Spark.Examples.Sql.Batch.Basic

      spark-submit \
      --class org.apache.spark.deploy.dotnet.DotnetRunner \
      --master local \
      ~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<spark_majorversion-spark_minorversion>_<scala_majorversion.scala_minorversion>-<spark_dotnet_version>.jar \
      Microsoft.Spark.CSharp.Examples Sql.Batch.Basic $SPARK_HOME/examples/src/main/resources/people.json
      
    • Microsoft.Spark.Examples.Sql.Streaming.StructuredNetworkWordCount

      spark-submit \
      --class org.apache.spark.deploy.dotnet.DotnetRunner \
      --master local \
      ~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<spark_majorversion-spark_minorversion>_<scala_majorversion.scala_minorversion>-<spark_dotnet_version>.jar \
      Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredNetworkWordCount localhost 9999
      
    • Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (maven accessible)

      spark-submit \
      --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2 \
      --class org.apache.spark.deploy.dotnet.DotnetRunner \
      --master local \
      ~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<spark_majorversion-spark_minorversion>_<scala_majorversion.scala_minorversion>-<spark_dotnet_version>.jar \
      Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
      
    • Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (jars provided)

      spark-submit \
      --jars path/to/net.jpountz.lz4/lz4-1.3.0.jar,path/to/org.apache.kafka/kafka-clients-0.10.0.1.jar,path/to/org.apache.spark/spark-sql-kafka-0-10_2.11-2.3.2.jar,`path/to/org.slf4j/slf4j-api-1.7.6.jar,path/to/org.spark-project.spark/unused-1.0.0.jar,path/to/org.xerial.snappy/snappy-java-1.1.2.6.jar \
      --class org.apache.spark.deploy.dotnet.DotnetRunner \
      --master local \
      ~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<spark_majorversion-spark_minorversion>_<scala_majorversion.scala_minorversion>-<spark_dotnet_version>.jar \
      Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test