Learn how to build your .NET for Apache Spark application on Ubuntu
This article teaches you how to build your .NET for Apache Spark applications on Ubuntu.
Warning
.NET for Apache Spark targets an out-of-support version of .NET (.NET Core 3.1). For more information, see the .NET Support Policy.
Prerequisites
If you already have all of the following prerequisites, skip to the build steps.
Download and install .NET Core 3.1 SDK - installing the SDK adds the
dotnet
toolchain to your path. .NET Core 2.1, 2.2 and 3.1 are supported.Install OpenJDK 8.
- You can use the following command:
sudo apt install openjdk-8-jdk
Verify you are able to run
java
from your command-line.Sample java -version output:
openjdk version "1.8.0_191" OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12) OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
If you already have multiple OpenJDK versions installed and want to select OpenJDK 8, use the following command:
sudo update-alternatives --config java
Install Apache Maven 3.6.0+.
Run the following command:
mkdir -p ~/bin/maven cd ~/bin/maven wget https://archive.apache.org/dist/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.tar.gz tar -xvzf apache-maven-3.6.0-bin.tar.gz ln -s apache-maven-3.6.0 current export M2_HOME=~/bin/maven/current export PATH=${M2_HOME}/bin:${PATH} source ~/.bashrc
Note that these environment variables will be lost when you close your terminal. If you want the changes to be permanent, add the
export
lines to your~/.bashrc
file.Verify you are able to run
mvn
from your command-lineSample mvn -version output:
Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 2018-10-24T18:41:47Z) Maven home: ~/bin/apache-maven-3.6.0 Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre Default locale: en, platform encoding: UTF-8 OS name: "linux", version: "4.4.0-17763-microsoft", arch: "amd64", family: "unix"
Install Apache Spark 2.3+. Download Apache Spark 2.3+ and extract it into a local folder (e.g.,
~/bin/spark-3.0.1-bin-hadoop2.7
). (The supported spark versions are 2.3.*, 2.4.0, 2.4.1, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 3.0.0 and 3.0.1)tar -xvzf /path/to/spark-3.0.1-bin-hadoop2.7.tgz -C ~/bin/spark-3.0.1-bin-hadoop2.7
Add the necessary environment variables
SPARK_HOME
(e.g.,~/bin/spark-3.0.1-bin-hadoop2.7/
) andPATH
(e.g.,$SPARK_HOME/bin:$PATH
)export SPARK_HOME=~/bin/spark-3.0.1-hadoop2.7 export PATH="$SPARK_HOME/bin:$PATH" source ~/.bashrc
Note that these environment variables will be lost when you close your terminal. If you want the changes to be permanent, add the
export
lines to your~/.bashrc
file.Verify you are able to run
spark-shell
from your command-line.Sample console output:
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201) Type in expressions to have them evaluated. Type :help for more information. scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6eaa6b0c
Make sure you are able to run dotnet
, java
, mvn
, spark-shell
from your command-line before you move to the next section. Feel there is a better way? Please open an issue and feel free to contribute.
Build
For the remainder of this guide, you will need to have cloned the .NET for Apache Spark repository into your machine e.g., ~/dotnet.spark/
.
git clone https://github.com/dotnet/spark.git ~/dotnet.spark
Build .NET for Spark Scala extensions layer
When you submit a .NET application, .NET for Apache Spark has the necessary logic written in Scala that informs Apache Spark how to handle your requests (e.g., request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the .NET for Apache Spark Scala Source Code.
The next step is to build the .NET for Apache Spark Scala extension layer:
cd src/scala
mvn clean package
You should see JARs created for the supported Spark versions:
microsoft-spark-2-3\target\microsoft-spark-2-3_2.11-<spark-dotnet-version>.jar
microsoft-spark-2-4\target\microsoft-spark-2-4_2.11-<spark-dotnet-version>.jar
microsoft-spark-3-0\target\microsoft-spark-3-0_2.12-<spark-dotnet-version>.jar
Build .NET sample applications using .NET CLI
This section explains how to build the sample applications for .NET for Apache Spark. These steps will help in understanding the overall building process for any .NET for Spark application.
Build the worker:
cd ~/dotnet.spark/src/csharp/Microsoft.Spark.Worker/ dotnet publish -f netcoreapp3.1 -r ubuntu.18.04-x64
Sample console output:
user@machine:/home/user/dotnet.spark/src/csharp/Microsoft.Spark.Worker$ dotnet publish -f netcoreapp3.1 -r ubuntu.18.04-x64 Microsoft (R) Build Engine version 16.6.0+5ff7b0c9e for .NET Core Copyright (C) Microsoft Corporation. All rights reserved. Restore completed in 36.03 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark.Worker/Microsoft.Spark.Worker.csproj. Restore completed in 35.94 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj. Microsoft.Spark -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark/Debug/netstandard2.1/Microsoft.Spark.dll Microsoft.Spark.Worker -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish/Microsoft.Spark.Worker.dll Microsoft.Spark.Worker -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish/
Build the samples:
cd ~/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples/ dotnet publish -f netcoreapp3.1 -r ubuntu.18.04-x64
Sample console output:
user@machine:/home/user/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples$ dotnet publish -f netcoreapp3.1 -r ubuntu.18.04-x64 Microsoft (R) Build Engine version 16.6.0+5ff7b0c9e for .NET Core Copyright (C) Microsoft Corporation. All rights reserved. Restore completed in 37.11 ms for /home/user/dotnet.spark/src/csharp/Microsoft.Spark/Microsoft.Spark.csproj. Restore completed in 281.63 ms for /home/user/dotnet.spark/examples/Microsoft.Spark.CSharp.Examples/Microsoft.Spark.CSharp.Examples.csproj. Microsoft.Spark -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark/Debug/netstandard2.1/Microsoft.Spark.dll Microsoft.Spark.CSharp.Examples -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish/Microsoft.Spark.CSharp.Examples.dll Microsoft.Spark.CSharp.Examples -> /home/user/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish/
Run the .NET for Spark sample applications
Once you build the samples, you can use spark-submit
to submit your .NET Core apps. Make sure you have followed the prerequisites section and installed Apache Spark.
Set the
DOTNET_WORKER_DIR
orPATH
environment variable to include the path where theMicrosoft.Spark.Worker
binary has been generated (e.g.,~/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish
).export DOTNET_WORKER_DIR=~/dotnet.spark/artifacts/bin/Microsoft.Spark.Worker/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish
Open a terminal and go to the directory where your app binary has been generated (e.g.,
~/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish
).cd ~/dotnet.spark/artifacts/bin/Microsoft.Spark.CSharp.Examples/Debug/netcoreapp3.1/ubuntu.18.04-x64/publish
Running your app follows the basic structure:
spark-submit \ [--jars <any-jars-your-app-is-dependent-on>] \ --class org.apache.spark.deploy.dotnet.DotnetRunner \ --master local \ <path-to-microsoft-spark-jar> \ <path-to-your-app-binary> <argument(s)-to-your-app>
Here are some examples you can run:
Microsoft.Spark.Examples.Sql.Batch.Basic
spark-submit \ --class org.apache.spark.deploy.dotnet.DotnetRunner \ --master local \ ~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<spark_majorversion-spark_minorversion>_<scala_majorversion.scala_minorversion>-<spark_dotnet_version>.jar \ Microsoft.Spark.CSharp.Examples Sql.Batch.Basic $SPARK_HOME/examples/src/main/resources/people.json
Microsoft.Spark.Examples.Sql.Streaming.StructuredNetworkWordCount
spark-submit \ --class org.apache.spark.deploy.dotnet.DotnetRunner \ --master local \ ~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<spark_majorversion-spark_minorversion>_<scala_majorversion.scala_minorversion>-<spark_dotnet_version>.jar \ Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredNetworkWordCount localhost 9999
Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (maven accessible)
spark-submit \ --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2 \ --class org.apache.spark.deploy.dotnet.DotnetRunner \ --master local \ ~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<spark_majorversion-spark_minorversion>_<scala_majorversion.scala_minorversion>-<spark_dotnet_version>.jar \ Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (jars provided)
spark-submit \ --jars path/to/net.jpountz.lz4/lz4-1.3.0.jar,path/to/org.apache.kafka/kafka-clients-0.10.0.1.jar,path/to/org.apache.spark/spark-sql-kafka-0-10_2.11-2.3.2.jar,`path/to/org.slf4j/slf4j-api-1.7.6.jar,path/to/org.spark-project.spark/unused-1.0.0.jar,path/to/org.xerial.snappy/snappy-java-1.1.2.6.jar \ --class org.apache.spark.deploy.dotnet.DotnetRunner \ --master local \ ~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<spark_majorversion-spark_minorversion>_<scala_majorversion.scala_minorversion>-<spark_dotnet_version>.jar \ Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test