Learn how to build your .NET for Apache Spark application on Windows
This article teaches you how to build your .NET for Apache Spark applications on Windows.
Warning
.NET for Apache Spark targets an out-of-support version of .NET (.NET Core 3.1). For more information, see the .NET Support Policy.
Prerequisites
If you already have all of the following prerequisites, skip to the build steps.
Download and install the .NET Core SDK - installing the SDK adds the
dotnet
toolchain to your path.Install Visual Studio 2019 (Version 16.3 or later). The Community version is free. When configuring your installation, include these components:
- .NET desktop development
- All Required Components
- .NET Framework 4.6.1 Development Tools
- All Required Components
- .NET Core cross-platform development
- All Required Components
- .NET desktop development
Install Java 1.8.
Select the appropriate version for your operating system. For example, jdk-8u201-windows-x64.exe for Windows x64 machine. Install using the installer and verify that you're able to run
java
from your command line.Install Apache Maven 3.6.0+.
- Download Apache Maven and extract to a local directory. For example, *C:\bin\apache-maven*.
- Add Apache Maven to your PATH environment variable. For example, C:\bin\apache-maven\bin.
- Verify that you're able to run
mvn
from your command line.
Install Apache Spark 2.3+.
Download Apache Spark 2.3+ and extract it into a local folder (for example, C:\bin\spark-3.0.1-bin-hadoop2.7*) using 7-zip. (The supported spark versions are 2.3., 2.4.0, 2.4.1, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 3.0.0, and 3.0.1)
Add a new environment variable
SPARK_HOME
. For example, *C:\bin\spark-3.0.1-bin-hadoop2.7*.set SPARK_HOME=C:\bin\spark-3.0.1-bin-hadoop2.7\
Add Apache Spark to your PATH environment variable. For example, C:\bin\spark-3.0.1-bin-hadoop2.7\bin.
set PATH=%SPARK_HOME%\bin;%PATH%
Verify you are able to run
spark-shell
from your command-line.Sample console output:
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201) Type in expressions to have them evaluated. Type :help for more information. scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6eaa6b0c
Install WinUtils.
Download
winutils.exe
binary from WinUtils repository. You should select the version of Hadoop the Spark distribution was compiled with. For example, use hadoop-2.7.1 for Spark 3.0.1.Save
winutils.exe
binary to a directory of your choice. For example, C:\hadoop\bin.Set
HADOOP_HOME
to reflect the directory with winutils.exe (without bin). For instance, using command-line:set HADOOP_HOME=C:\hadoop
Set PATH environment variable to include
%HADOOP_HOME%\bin
. For instance, using command line:set PATH=%HADOOP_HOME%\bin;%PATH%
Make sure you're able to run
dotnet
,java
,mvn
, andspark-shell
from your command line before you move to the next section.
Note
A new instance of the command line may be required if you updated any environment variables.
Build
For the remainder of this guide, you will need to have cloned the .NET for Apache Spark repository into your machine. You can choose any location for the cloned repository. For example, C:\github\dotnet-spark\.
git clone https://github.com/dotnet/spark.git C:\github\dotnet-spark
Build .NET for Apache Spark Scala extensions layer
When you submit a .NET application, .NET for Apache Spark has the necessary logic written in Scala that informs Apache Spark how to handle your requests (for example, request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the .NET for Spark Scala Source Code.
Regardless of whether you are using .NET Framework or .NET Core, you will need to build the .NET for Apache Spark Scala extension layer:
cd src\scala
mvn clean package
You should see JARs created for the supported Spark versions:
microsoft-spark-2-3\target\microsoft-spark-2-3_2.11-<spark-dotnet-version>.jar
microsoft-spark-2-4\target\microsoft-spark-2-4_2.11-<spark-dotnet-version>.jar
microsoft-spark-3-0\target\microsoft-spark-3-0_2.12-<spark-dotnet-version>.jar
Build the .NET for Spark sample applications
This section explains how to build the sample applications for .NET for Apache Spark. These steps will help in understanding the overall building process for any .NET for Spark application.
Use Visual Studio for .NET Framework
Open
src\csharp\Microsoft.Spark.sln
in Visual Studio and build theMicrosoft.Spark.CSharp.Examples
project under theexamples
folder (this will in turn build the .NET bindings project as well). If you want, you can write your own code in theMicrosoft.Spark.Examples
project (the 'input_file.json' in this example is a json file with the data you want to create the dataframe with):// Instantiate a session var spark = SparkSession .Builder() .AppName("Hello Spark!") .GetOrCreate(); // Create initial DataFrame DataFrame df = spark.Read().Json(input_file.json); // Print schema df.PrintSchema(); // Apply a filter and show results df.Filter(df["age"] > 21).Show();
Once the build is successful, you will see the appropriate binaries produced in the output directory. Sample console output:
Directory: C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\net461 Mode LastWriteTime Length Name ---- ------------- ------ ---- -a---- 3/6/2019 12:18 AM 125440 Apache.Arrow.dll -a---- 3/16/2019 12:00 AM 13824 Microsoft.Spark.CSharp.Examples.exe -a---- 3/16/2019 12:00 AM 19423 Microsoft.Spark.CSharp.Examples.exe.config -a---- 3/16/2019 12:00 AM 2720 Microsoft.Spark.CSharp.Examples.pdb -a---- 3/16/2019 12:00 AM 143360 Microsoft.Spark.dll -a---- 3/16/2019 12:00 AM 63388 Microsoft.Spark.pdb -a---- 3/16/2019 12:00 AM 34304 Microsoft.Spark.Worker.exe -a---- 3/16/2019 12:00 AM 19423 Microsoft.Spark.Worker.exe.config -a---- 3/16/2019 12:00 AM 11900 Microsoft.Spark.Worker.pdb -a---- 3/16/2019 12:00 AM 23552 Microsoft.Spark.Worker.xml -a---- 3/16/2019 12:00 AM 332363 Microsoft.Spark.xml ------------------------------------------- More framework files -------------------------------------
Use .NET Core CLI for .NET Core
Build the worker:
cd C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker\ dotnet publish -f netcoreapp3.1 -r win-x64
Sample console output:
PS C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker> dotnet publish -f netcoreapp3.1 -r win-x64 Microsoft (R) Build Engine version 16.6.0+5ff7b0c9e for .NET Core Copyright (C) Microsoft Corporation. All rights reserved. Restore completed in 299.95 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark\Microsoft.Spark.csproj. Restore completed in 306.62 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker\Microsoft.Spark.Worker.csproj. Microsoft.Spark -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark\Debug\netstandard2.1\Microsoft.Spark.dll Microsoft.Spark.Worker -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\x64\Debug\netcoreapp3.1\win-x64\Microsoft.Spark.Worker.dll Microsoft.Spark.Worker -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\x64\Debug\netcoreapp3.1\win-x64\publish\
Build the samples:
cd C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples\ dotnet publish -f netcoreapp3.1 -r win-x64
Sample console output:
PS C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples> dotnet publish -f netcoreapp3.1 -r win-x64 Microsoft (R) Build Engine version 16.6.0+5ff7b0c9e for .NET Core Copyright (C) Microsoft Corporation. All rights reserved. Restore completed in 44.22 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark\Microsoft.Spark.csproj. Restore completed in 336.94 ms for C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples\Microsoft.Spark.CSharp.Examples.csproj. Microsoft.Spark -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark\Debug\netstandard2.1\Microsoft.Spark.dll Microsoft.Spark.CSharp.Examples -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\x64\Debug\netcoreapp3.1\win-x64\Microsoft.Spark.CSharp.Examples.dll Microsoft.Spark.CSharp.Examples -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\x64\Debug\netcoreapp3.1\win-x64\publish\
Run the .NET for Spark sample applications
Once you build the samples, running them will be through spark-submit
regardless of whether you are targeting .NET Framework or .NET Core. Make sure you have followed the prerequisites section and installed Apache Spark.
Set the
DOTNET_WORKER_DIR
orPATH
environment variable to include the path where theMicrosoft.Spark.Worker
binary has been generated (for example, C:\github\dotnet\spark\artifacts\bin\Microsoft.Spark.Worker\Debug\net461 for .NET Framework, C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\x64\Debug\netcoreapp3.1\win-x64\publish for .NET Core):set DOTNET_WORKER_DIR=C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\x64\Debug\netcoreapp3.1\win-x64\publish
Open PowerShell and go to the directory where your app binary has been generated (for example, C:\github\dotnet\spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\net461 for .NET Framework, C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\x64\Debug\netcoreapp3.1\win-x64\publish for .NET Core):
cd C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\x64\Debug\netcoreapp3.1\win-x64\publish
Running your app follows the basic structure:
spark-submit.cmd ` [--jars <any-jars-your-app-is-dependent-on>] ` --class org.apache.spark.deploy.dotnet.DotnetRunner ` --master local ` <path-to-microsoft-spark-jar> ` <path-to-your-app-exe> <argument(s)-to-your-app>
Here are some examples you can run:
Microsoft.Spark.Examples.Sql.Batch.Basic
spark-submit.cmd ` --class org.apache.spark.deploy.dotnet.DotnetRunner ` --master local ` C:\github\dotnet-spark\src\scala\microsoft-spark-<version>\target\microsoft-spark-<spark_majorversion-spark_minorversion>_<scala_majorversion.scala_minorversion>-<spark_dotnet_version>.jar ` Microsoft.Spark.CSharp.Examples.exe Sql.Batch.Basic %SPARK_HOME%\examples\src\main\resources\people.json
Microsoft.Spark.Examples.Sql.Streaming.StructuredNetworkWordCount
spark-submit.cmd ` --class org.apache.spark.deploy.dotnet.DotnetRunner ` --master local ` C:\github\dotnet-spark\src\scala\microsoft-spark-<version>\target\microsoft-spark-<spark_majorversion-spark_minorversion>_<scala_majorversion.scala_minorversion>-<spark_dotnet_version>.jar ` Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredNetworkWordCount localhost 9999
Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (maven accessible)
spark-submit.cmd ` --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2 ` --class org.apache.spark.deploy.dotnet.DotnetRunner ` --master local ` C:\github\dotnet-spark\src\scala\microsoft-spark-<version>\target\microsoft-spark-<spark_majorversion-spark_minorversion>_<scala_majorversion.scala_minorversion>-<spark_dotnet_version>.jar ` Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (jars provided)
spark-submit.cmd --jars path\to\net.jpountz.lz4\lz4-1.3.0.jar,path\to\org.apache.kafka\kafka-clients-0.10.0.1.jar,path\to\org.apache.spark\spark-sql-kafka-0-10_2.11-2.3.2.jar,`path\to\org.slf4j\slf4j-api-1.7.6.jar,path\to\org.spark-project.spark\unused-1.0.0.jar,path\to\org.xerial.snappy\snappy-java-1.1.2.6.jar ` --class org.apache.spark.deploy.dotnet.DotnetRunner ` --master local ` C:\github\dotnet-spark\src\scala\microsoft-spark-<version>\target\microsoft-spark-<spark_majorversion-spark_minorversion>_<scala_majorversion.scala_minorversion>-<spark_dotnet_version>.jar ` Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test