Call a Java UDF from your .NET for Apache Spark application

In this article, you learn how to call a Java User-Defined Function (UDF) from your .NET for Apache Spark application.

  1. How to define your Java UDFs and compile them into a jar - this step is not needed if you already have a UDF defined in a jar file. In which case, all you need is the full name of the UDF function including the package.
  2. Register and call your Java UDF in your .NET for Apache Spark application.

Warning

.NET for Apache Spark targets an out-of-support version of .NET (.NET Core 3.1). For more information, see the .NET Support Policy.

Define and compile your Java UDFs

  1. Create a Maven or SBT project and add the following dependencies into the project configuration file:

    1. org.apache.spark.spark-core_2.11.<version>
    2. org.apache.spark.spark-sql_2.11.<version>
  2. Define your Java UDF by implementing the relevant interface (according to your UDF's signature) and importing the relevant package as shown below in a simple example

    package com.ScalaUdf.app; // Name of package where UDF is defined
    import org.apache.spark.sql.api.java.UDF1; // UDF interface to implement
    
    public class JavaUdf implements UDF1<Integer, Integer> { // Name of the Java UDF
        private static final int serialVersionUID = 1;
        @Override
        public Integer call(Integer num) throws Exception { // Define logic of UDF
            return (num + 5);
        }
    }
    
  3. Compile and package your project to create and executable jar say UdfApp-0.0.1.jar.

Register and call Java UDFs in .NET for Apache Spark

  1. Use the RegisterJava API to register your Java UDF with Spark SQL.

  2. Register the DataFrame on which you want to call your UDF as an SQL Table using the CreateOrReplaceTempView function.

  3. Use SparkSession.Sql to call the UDF on the table view using Spark SQL. A basic example to illustrate the above steps:

    class Program
    {
        static void Main()
        {
            SparkSession spark = SparkSession
                .Builder()
                .AppName("Scala/Java UDFs from .NET for Apache Spark")
                .GetOrCreate();
            spark.Udf().RegisterJava<int>("udfAdd5", "com.ScalaUdf.app.JavaUdf"); // Register your Java UDF as 'udfAdd5'
            DataFrame df = spark.CreateDataFrame(new int[] { 2, 5 });
            df.CreateOrReplaceTempView("numbersData"); // Create an SQL table from the DataFrame `df`
            DataFrame dfUdf = spark.Sql("SELECT udfAdd5(_1) As Result FROM numbersData"); // Call the registered UDF on the table
            dfUdf.Show();
            spark.Stop();
        }
    }
    
  4. Submit this application using spark-submit by passing the previously compiled Java UDF jar through the --jars option:

    spark-submit --master local --jars UdfApp-0.0.1.jar --class org.apache.spark.deploy.dotnet.DotnetRunner microsoft-spark-2-4_2.11-1.0.0.jar InterRuntimeUDFs.exe
    

    The resultant dfUdf DataFrame had the number 5 added to each row of the input column as defined by JavaUdf:

    +-------+
    | Result|
    +-------+
    |      7|
    |     10|
    +-------+
    

Call .NET UDF from Scala or Python in Apache Spark

You can also register and invoke a C# UDF from an Apache Spark application written in Scala or Python using the sparkdotnetudf open source tool.