December 2015

Volume 30 Number 13

Test Run - Introduction to Spark for .NET Developers

By James McCaffrey | December 2015

James McCaffreySpark is an open source computing framework for Big Data, and it’s becoming increasingly popular, especially in machine learning scenarios. In this article I’ll describe how to install Spark on a machine running a Windows OS, and explain basic Spark functionality from a .NET developer’s point of view.

The best way to see where this article is headed is to take a look at the demo interactive session shown in Figure 1. From a Windows command shell running in administrative mode, I started a Spark environment by issuing a spark-shell command.

Spark in Action
Figure 1 Spark in Action

The spark-shell command generates a Scala interpreter that runs in the shell and in turn issues a Scala prompt (scala>). Scala is a scripting language that’s based on Java. There are other ways to interact with Spark, but using a Scala interpreter is the most common approach, in part because the Spark framework is written mostly in Scala. You can also interact with Spark by using Python language commands or by writing Java programs.

Notice the multiple warning messages in Figure 1. These messages are very common when running Spark because Spark has many optional components that, if not found, generate warnings. In general, warning messages can be ignored for simple scenarios.

The first command entered in the demo session is:

scala> val f = sc.textFile("README.md")

This can be loosely interpreted to mean, “Store into an immutable RDD object named f the contents of file README.md.” Scala objects can be declared as val or var. Objects declared as val are immu­table and can’t change.

The Scala interpreter has a built-in Spark context object named sc, which is used to access Spark functionality. The textFile function loads the contents of a text file into a Spark data structure called a resilient distributed dataset (RDD). RDDs are the primary programming abstraction used in Spark. You can think of an RDD as somewhat similar to a .NET collection stored in RAM across several machines.

Text file README.md (the .md extension stands for markdown document) is located in the Spark root directory C:\spark_1_4_1. If your target file is located somewhere else, you can provide a full path such as C:\\Data\\ReadMeToo.txt.

The second command in the demo session is:

scala> val ff = f.filter(line => line.contains("Spark"))

This means, “Store into an immutable RDD object named ff only those lines from object f that have the word ‘Spark’ in them.” The filter function accepts what’s called a closure. You can think of a closure as something like an anonymous function. Here, the closure accepts a dummy string input parameter named line and returns true if line contains “Spark,” false otherwise.

Because “line” is just a parameter name, I could’ve used any other name in the closure, for example:

ln => ln.contains("Spark")

Spark is case-sensitive, so the following would generate an error:

ln => ln.Contains("Spark")

Scala has some functional programming language characteristics, and it’s possible to compose multiple commands. For example, the first two commands could be combined into one as:

val ff = sc.textFile("README.md").filter(line => lne.contains("Spark"))

The final three commands in the demo session are:

scala> val ct = ff.count()
scala> println(ct)
scala> :q

The count function returns the number of items in an RDD, which in this case is the number of lines in file README.md that contain the word Spark. There are 19 such lines. To quit a Spark Scala session, you can type the :q command.

Installing Spark on a Windows Machine

There are four main steps for installing Spark on a Windows machine. First, you install a Java Development Kit (JDK) and the Java Runtime Environment (JRE). Second, you install the Scala language. Third, you install the Spark framework. And fourth, you configure the host machine system variables.

The Spark distribution comes in a compressed .tar format, so you’ll need a utility to extract the Spark files. I recommend installing the open source 7-Zip program before you begin.

Although Spark and its components aren’t formally supported on a wide range of Windows OS versions, I’ve successfully installed Spark on machines running Windows 7, 8, 10, and Server 2008 and 2012. The demo shown in Figure 1 is running on a Windows 8.1 machine.

You install the JDK by running a self-extracting executable, which you can find by doing an Internet search. I used version jdk-8u60-windows-x64.exe.

When installing the 64-bit version of the JDK, the default installation directory is C:\Program Files\Java\jdkx.x.x_xx\, as shown in Figure 2. I recommend that you don’t change the default location.

The Default JDK Location
Figure 2 The Default JDK Location

Installing the JDK also installs an associated JRE. After the installation finishes, the default Java parent directory will contain both a JDK directory and an associated JRE directory, as shown in Figure 3.

![Java JDK and JRE Installed to C:\Program Files\Java\](images/mt595756.McCaffrey_Figure3-JavaInstallDirectories_hires(en-us,MSDN.10).png "Java JDK and JRE Installed to C:\Program Files\Java\")
Figure 3 Java JDK and JRE Installed to C:\Program Files\Java\

Note that your machine will likely also have a Java directory with one or more 32-bit JRE directories at C:\Program Files (x86). It’s OK to have both 32-bit and 64-bit versions of JREs on your machine, but I recommend using only the 64-bit version of the Java JDK.

Installing Scala

The next step is to install the Scala language, but before you do so, you must go to the Spark download site (described in the next section of this article) and determine which version of Scala to install. The Scala version must be compatible with the Spark version you’ll install in the following step.

Unfortunately, information about Scala-Spark version compatibility is scanty. When I was installing the Spark components (quite some time ago by the time you read this), the current version of Spark was 1.5.0, but I couldn’t find any information about which version of Scala was compatible with that version of Spark. Therefore, I looked at the previous version of Spark, which was 1.4.1, and found some information on developer discussion Web sites that suggested that version 2.10.4 of Scala was likely compatible with version 1.4.1 of Spark.

Installing Scala is easy. The installation process simply involves running an .msi installer file.

The Scala installation wizard guides you through the process. Interestingly, the default installation directory for Scala is in 32-bit directory C:\Program Files (x86)\ rather than in 64-bit directory C:\Program Files\ (see Figure 4).

![Scala Installs to C:\Program Files (x86)\scala\](images/mt595756.McCaffrey_Figure5-ScalaInstallDirectories_hires(en-us,MSDN.10).png "Scala Installs to C:\Program Files (x86)\scala\")
Figure 4 Scala Installs to C:\Program Files (x86)\scala\

If you intend to interact with Spark by writing Java programs rather than by using Scala commands, you’ll need to install an additional tool called the Scala Simple Build Tool (SBT). Interacting with Spark through compiled Java programs is much, much more difficult than using the interactive Scala.

Installing Spark

The next step is to install the Spark framework. First, though, be sure you have a utility program such as 7-Zip that can extract .tar format files. The Spark installation process is manual—you download a compressed folder to your local machine, extract the compressed files, and then copy the files to a root directory. This means that if you wish to uninstall Spark you simply delete the Spark files.

You can find the Spark site at spark.apache.org. The download page allows you to select a version and a package type. Spark is a computing framework and requires a distributed file system (DFS). By far the most common DFS used with the Spark framework is the Hadoop distributed file system (HDFS). For testing and experimentation purposes, such as the demo session in Figure 1, you can install Spark on a system that doesn’t have a DFS. In this scenario, Spark will use the local file system.

If you haven’t extracted .tar files before, you might find the process a bit confusing because you typically have to extract twice. First, download the .tar file (mine was named spark-1.4.1-bin-hadoop2.6.tar) to any temporary directory (I used C:\Temp). Next, right-click on the .tar file and select “Extract files” from the context menu and extract to a new directory inside the temporary directory.

The first extraction process creates a new compressed file without any file extension (in my case spark-1.4.1-bin-hadoop2.6). Next, right-click on that new file and select “Extract files” again from the context menu and extract to a different directory. This second extraction will produce the Spark framework files.

Create a directory for the Spark framework files. A common convention is to create a directory named C:\spark_x_x_x, where the x values indicate the version. Using this convention, I created a C:\spark_1_4_1 directory and copied the extracted files into that directory, as shown in Figure 5.

![Manually Copy Extracted Spark Files to C:\spark_x_x_x\](images/mt595756.McCaffrey_Figure6-SparkFilesInstalled_hires(en-us,MSDN.10).png "Manually Copy Extracted Spark Files to C:\spark_x_x_x\")
Figure 5 Manually Copy Extracted Spark Files to C:\spark_x_x_x\

Configuring Your Machine

After installing Java, Scala and Spark, the last step is to configure the host machine. This involves downloading a special utility file needed for Windows, setting three user-defined system environment variables, setting the system Path variable, and optionally modifying a Spark configuration file.

Running Spark on Windows requires that a special utility file named winutils.exe be in a local directory named C:\hadoop. You can find the file in several places by doing an Internet search. I created directory C:\hadoop and then found a copy of winutils.exe at http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and downloaded the file into the directory.

Next, create and set three user-defined system environment variables and modify the system Path variable. Go to Control Panel | System | Advanced System Settings | Advanced | Environment Variables. In the User Variables section, create three new variables with these names and values:

JAVA_HOME     C:\Program Files\Java\jdk1.8.0_60
SCALA_HOME    C:\Program Files (x86)\scala
HADOOP_HOME   C:\hadoop

Then, in the System Variables, edit the Path variable by adding the location of the Spark binaries, C:\spark_1_4_1\bin. Be careful; you really don’t want to lose any values in the Path variable. Note that the Scala installation process will have already added the location of the Scala binaries for you (see Figure 6).

Configuring Your System
Figure 6 Configuring Your System

After you set up your system variables, I recommend modifying the Spark configuration file. Go to the root directory C:\spark_1_4_1\config and make a copy of file log4j.properties.template. Rename that copy by removing the .template extension. Edit the first configuration entry from log4j.rootCategory=INFO to log4j.rootCategory=WARN.

The idea is that by default Spark spews all kinds of informational messages. Changing the logging level from INFO to WARN greatly reduces the number of messages and makes interacting with Spark less messy.

The Hello World of Spark

The Hello World example of distributed computing is to calculate the number of different words in a data source. Figure 7 shows the word-count example using Spark.

Word Count Example Using Spark
Figure 7 Word Count Example Using Spark

The Scala shell is sometimes called a read, evaluate, print loop (REPL) shell. You can clear the Scala REPL by typing CTRL+L. The first command in Figure 7 loads the contents of file README.md into an RDD named f, as explained previously. In a realistic scenario, your data source could be a huge file spread across hundreds of machines, or could be in a distributed database such as Cassandra.

The next command is:

scala> val fm = f.flatMap(line => line.split(" "))

The flatMap function call splits each line in the f RDD object on blank space characters, so resulting RDD object fm will hold a collection of all the words in the file. From a developer’s point of view, you can think of fm as something like a .NET List<string> collection.

The next command is:

scala> val m = fm.map(word => (word, 1))

The map function creates an RDD object that holds pairs of items, where each pair consists of a word and the integer value 1. You can see this more clearly if you issue an m.take(5) command. You’ll see the first five words in file README.md, and a 1 value next to each word. From a developer’s point of view, m is roughly a List<Pair> collection in which each Pair object consists of a string and an integer. The string (a word in README.md) is a key and the integer is a value, but unlike many key-value pairs in the Microsoft .NET Framework, duplicate key values are allowed in Spark. RDD objects that hold key-value pairs are sometimes called pair RDDs to distinguish them from ordinary RDDs.

The next command is:

scala> val cts = m.reduceByKey((a,b) => a + b)

The reduceByKey function combines the items in object m by adding the integer values associated with equal key values. If you did a cts.take(10) you’d see 10 of the words in README.md followed by the number of times each word occurs in the file. You might also notice that the words in object cts aren’t necessarily in any particular order.

The reduceByKey function accepts a closure. You can use an alternate Scala shortcut notation:

scala> val cts = m.reduceByKey(_ + _)

The underscore is a parameter wild card, so the syntax can be interpreted as “add whatever two values are received.”

Notice that this word-count example uses the map function followed by the reduceByKey function. This is an example of the MapReduce paradigm.

The next command is:

scala> val sorted =
     cts.sortBy(item => item._2, false)

This command sorts the item in the cts RDD, based on the second value (the integer count) of the items. The false argument means to sort in descending order, in other words from highest count to lowest. The Scala shortcut syntax form of the sort command would be:

scala> val sorted = cts.sortBy(_._2, false)

Because Scala has many functional language characteristics and uses a lot of symbols instead of keywords, it’s possible to write Scala code that’s very unintuitive.

The final command in the Hello World example is to display the results:

scala> sorted.take(5).foreach(println)

This means, “Fetch the first five objects in the RDD object named sorted, iterate over that collection, applying the println function to each item.” The results are:

(,66)
(the,21)
(Spark,14)
(to,14)
(for,11)

This means there are 66 occurrences of the empty/null word in README.md, 21 occurrences of the word “the,”, 14 occurrences of “Spark” and so on.

Wrapping Up

The information presented in this article should get you up and running if you want to experiment with Spark on a Windows machine. Spark is a relatively new technology (created at UC Berkeley in 2009), but interest in Spark has increased dramatically over the past few months, at least among my colleagues.

In a 2014 competition among Big Data processing frameworks, Spark set a new performance record, easily beating the previous record set by a Hadoop system the year before. Because of its exceptional performance characteristics, Spark is particularly well-suited for use with machine learning systems. Spark supports an open source library of machine learning algorithms named MLib.


Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. Dr. McCaffrey can be reached at jammc@microsoft.com.

Thanks to the following Microsoft technical experts for reviewing this article: Gaz Iqbal and Umesh Madan