Learn how to build your .NET for Apache Spark application on Windows

This article teaches you how to build your .NET for Apache Spark applications on Windows.

Prerequisites

If you already have all of the following prerequisites, skip to the build steps.

  1. Download and install the .NET Core SDK - installing the SDK will add the dotnet toolchain to your path. .NET Core 2.1, 2.2 and 3.1 are supported.

  2. Install Visual Studio 2019 (Version 16.3 or later). The Community version is completely free. When configuring your installation, include these components at minimum:

    • .NET desktop development
      • All Required Components
        • .NET Framework 4.6.1 Development Tools
    • .NET Core cross-platform development
      • All Required Components
  3. Install Java 1.8.

    • Select the appropriate version for your operating system e.g., jdk-8u201-windows-x64.exe for Win x64 machine.
    • Install using the installer and verify you are able to run java from your command-line.
  4. Install Apache Maven 3.6.0+.

    • Download Apache Maven 3.6.0.
    • Extract to a local directory e.g., C:\bin\apache-maven-3.6.0\.
    • Add Apache Maven to your PATH environment variable e.g., C:\bin\apache-maven-3.6.0\bin.
    • Verify you are able to run mvn from your command-line.
  5. Install Apache Spark 2.3+.

    • Download Apache Spark 2.3+ and extract it into a local folder (e.g., C:\bin\spark-2.3.2-bin-hadoop2.7\) using 7-zip. (The supported spark versions are 2.3.*, 2.4.0, 2.4.1, 2.4.3 and 2.4.4)

    • Add a new environment variable SPARK_HOME e.g., C:\bin\spark-2.3.2-bin-hadoop2.7\.

      set SPARK_HOME=C:\bin\spark-2.3.2-bin-hadoop2.7\       
      
    • Add Apache Spark to your PATH environment variable e.g., C:\bin\spark-2.3.2-bin-hadoop2.7\bin.

      set PATH=%SPARK_HOME%\bin;%PATH%
      
    • Verify you are able to run spark-shell from your command-line.
      Sample console output:

      Welcome to
             ____              __
            / __/__  ___ _____/ /__
           _\ \/ _ \/ _ `/ __/  '_/
          /___/ .__/\_,_/_/ /_/\_\   version 2.3.2
             /_/
      
      Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
      Type in expressions to have them evaluated.
      Type :help for more information.
      
      scala> sc
      res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6eaa6b0c
      
  6. Install WinUtils.

    • Download winutils.exe binary from WinUtils repository. You should select the version of Hadoop the Spark distribution was compiled with, e.g. use hadoop-2.7.1 for Spark 2.3.2.

    • Save winutils.exe binary to a directory of your choice e.g., C:\hadoop\bin.

    • Set HADOOP_HOME to reflect the directory with winutils.exe (without bin). For instance, using command-line:

      set HADOOP_HOME=C:\hadoop
      
    • Set PATH environment variable to include %HADOOP_HOME%\bin. For instance, using command-line:

      set PATH=%HADOOP_HOME%\bin;%PATH%
      

Make sure you are able to run dotnet, java, mvn, spark-shell from your command-line before you move to the next section. Feel there is a better way? Please open an issue and feel free to contribute.

Note

A new instance of the command-line may be required if any environment variables were updated.

Build

For the remainder of this guide, you will need to have cloned the .NET for Apache Spark repository into your machine. You can choose any location for the cloned repository, e.g., C:\github\dotnet-spark\.

git clone https://github.com/dotnet/spark.git C:\github\dotnet-spark

Build .NET for Apache Spark Scala extensions layer

When you submit a .NET application, .NET for Apache Spark has the necessary logic written in Scala that informs Apache Spark how to handle your requests (e.g., request to create a new Spark Session, request to transfer data from .NET side to JVM side etc.). This logic can be found in the .NET for Spark Scala Source Code.

Regardless of whether you are using .NET Framework or .NET Core, you will need to build the .NET for Apache Spark Scala extension layer:

cd src\scala
mvn clean package 

You should see JARs created for the supported Spark versions:

  • microsoft-spark-2.3.x\target\microsoft-spark-2.3.x-<version>.jar
  • microsoft-spark-2.4.x\target\microsoft-spark-2.4.x-<version>.jar

Build the .NET for Spark sample applications

This section explains how to build the sample applications for .NET for Apache Spark. These steps will help in understanding the overall building process for any .NET for Spark application.

Using Visual Studio for .NET Framework

  1. Open src\csharp\Microsoft.Spark.sln in Visual Studio and build the Microsoft.Spark.CSharp.Examples project under the examples folder (this will in turn build the .NET bindings project as well). If you want, you can write your own code in the Microsoft.Spark.Examples project (the 'input_file.json' in this example is a json file with the data you want to create the dataframe with):

      // Instantiate a session
      var spark = SparkSession
          .Builder()
          .AppName("Hello Spark!")
          .GetOrCreate();
    
      // Create initial DataFrame
      DataFrame df = spark.Read().Json(input_file.json);
    
      // Print schema
      df.PrintSchema();
    
      // Apply a filter and show results
      df.Filter(df["age"] > 21).Show();
    

    Once the build is successful, you will see the appropriate binaries produced in the output directory.
    Sample console output:

          Directory: C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\net461
    
    
      Mode                LastWriteTime         Length Name
      ----                -------------         ------ ----
      -a----         3/6/2019  12:18 AM         125440 Apache.Arrow.dll
      -a----        3/16/2019  12:00 AM          13824 Microsoft.Spark.CSharp.Examples.exe
      -a----        3/16/2019  12:00 AM          19423 Microsoft.Spark.CSharp.Examples.exe.config
      -a----        3/16/2019  12:00 AM           2720 Microsoft.Spark.CSharp.Examples.pdb
      -a----        3/16/2019  12:00 AM         143360 Microsoft.Spark.dll
      -a----        3/16/2019  12:00 AM          63388 Microsoft.Spark.pdb
      -a----        3/16/2019  12:00 AM          34304 Microsoft.Spark.Worker.exe
      -a----        3/16/2019  12:00 AM          19423 Microsoft.Spark.Worker.exe.config
      -a----        3/16/2019  12:00 AM          11900 Microsoft.Spark.Worker.pdb
      -a----        3/16/2019  12:00 AM          23552 Microsoft.Spark.Worker.xml
      -a----        3/16/2019  12:00 AM         332363 Microsoft.Spark.xml
      ------------------------------------------- More framework files -------------------------------------
    

Using .NET Core CLI for .NET Core

Note

We are currently working on automating .NET Core builds for Spark .NET. Until then, we appreciate your patience in performing some of the steps manually.

  1. Build the worker:

    cd C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker\
    dotnet publish -f netcoreapp2.1 -r win10-x64
    

    Sample console output:

    PS C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker> dotnet publish -f netcoreapp2.1 -r win10-x64
    Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core
    Copyright (C) Microsoft Corporation. All rights reserved.
    
      Restore completed in 299.95 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark\Microsoft.Spark.csproj.
      Restore completed in 306.62 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark.Worker\Microsoft.Spark.Worker.csproj.
      Microsoft.Spark -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark\Debug\netstandard2.0\Microsoft.Spark.dll
      Microsoft.Spark.Worker -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp2.1\win10-x64\Microsoft.Spark.Worker.dll
      Microsoft.Spark.Worker -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp2.1\win10-x64\publish\
    
  2. Build the samples:

    cd C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples\
    dotnet publish -f netcoreapp2.1 -r win10-x64
    

    Sample console output:

    PS C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples> dotnet publish -f netcoreapp2.1 -r win10-x64
    Microsoft (R) Build Engine version 16.0.462+g62fb89029d for .NET Core
    Copyright (C) Microsoft Corporation. All rights reserved.
    
      Restore completed in 44.22 ms for C:\github\dotnet-spark\src\csharp\Microsoft.Spark\Microsoft.Spark.csproj.
      Restore completed in 336.94 ms for C:\github\dotnet-spark\examples\Microsoft.Spark.CSharp.Examples\Microsoft.Spark.CSharp.Examples.csproj.
      Microsoft.Spark -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark\Debug\netstandard2.0\Microsoft.Spark.dll
      Microsoft.Spark.CSharp.Examples -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\netcoreapp2.1\win10-x64\Microsoft.Spark.CSharp.Examples.dll
      Microsoft.Spark.CSharp.Examples -> C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\netcoreapp2.1\win10-x64\publish\
    

Run the .NET for Spark sample applications

Once you build the samples, running them will be through spark-submit regardless of whether you are targeting .NET Framework or .NET Core. Make sure you have followed the prerequisites section and installed Apache Spark.

  1. Set the DOTNET_WORKER_DIR or PATH environment variable to include the path where the Microsoft.Spark.Worker binary has been generated (e.g., C:\github\dotnet\spark\artifacts\bin\Microsoft.Spark.Worker\Debug\net461 for .NET Framework, C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp2.1\win10-x64\publish for .NET Core):

    set DOTNET_WORKER_DIR=C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp2.1\win10-x64\publish
    
  2. Open Powershell and go to the directory where your app binary has been generated (e.g., C:\github\dotnet\spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\net461 for .NET Framework, C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\netcoreapp2.1\win10-x64\publish for .NET Core):

    cd C:\github\dotnet-spark\artifacts\bin\Microsoft.Spark.CSharp.Examples\Debug\netcoreapp2.1\win10-x64\publish
    
  3. Running your app follows the basic structure:

    spark-submit.cmd `
      [--jars <any-jars-your-app-is-dependent-on>] `
      --class org.apache.spark.deploy.dotnet.DotnetRunner `
      --master local `
      <path-to-microsoft-spark-jar> `
      <path-to-your-app-exe> <argument(s)-to-your-app>
    

    Here are some examples you can run:

    • Microsoft.Spark.Examples.Sql.Batch.Basic

      spark-submit.cmd `
      --class org.apache.spark.deploy.dotnet.DotnetRunner `
      --master local `
      C:\github\dotnet-spark\src\scala\microsoft-spark-<version>\target\microsoft-spark-<version>.jar `
      Microsoft.Spark.CSharp.Examples.exe Sql.Batch.Basic %SPARK_HOME%\examples\src\main\resources\people.json
      
    • Microsoft.Spark.Examples.Sql.Streaming.StructuredNetworkWordCount

      spark-submit.cmd `
      --class org.apache.spark.deploy.dotnet.DotnetRunner `
      --master local `
      C:\github\dotnet-spark\src\scala\microsoft-spark-<version>\target\microsoft-spark-<version>.jar `
      Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredNetworkWordCount localhost 9999
      
    • Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (maven accessible)

      spark-submit.cmd `
      --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2 `
      --class org.apache.spark.deploy.dotnet.DotnetRunner `
      --master local `
      C:\github\dotnet-spark\src\scala\microsoft-spark-<version>\target\microsoft-spark-<version>.jar `
      Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
      
    • Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (jars provided)

      spark-submit.cmd 
      --jars path\to\net.jpountz.lz4\lz4-1.3.0.jar,path\to\org.apache.kafka\kafka-clients-0.10.0.1.jar,path\to\org.apache.spark\spark-sql-kafka-0-10_2.11-2.3.2.jar,`path\to\org.slf4j\slf4j-api-1.7.6.jar,path\to\org.spark-project.spark\unused-1.0.0.jar,path\to\org.xerial.snappy\snappy-java-1.1.2.6.jar `
      --class org.apache.spark.deploy.dotnet.DotnetRunner `
      --master local `
      C:\github\dotnet-spark\src\scala\microsoft-spark-<version>\target\microsoft-spark-<version>.jar `
      Microsoft.Spark.CSharp.Examples.exe Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test