September 2013

Volume 28 Number 9

Azure Insider - Hadoop and HDInsight: Big Data in Windows Azure

By Bruno Terkaly | September 2013

Bruno Terkaly, Ricardo VillalobosLet’s begin with a bold assertion: “If you, your startup, or the enterprise you work for aren’t saving massive quantities of data to disk for current and future analysis, you are compromising your effectiveness as a technical leader.” Isn’t it foolish to base important business decisions on gut instinct alone, rather than real quantitative data?

There are many reasons why big data is so pervasive. First, it’s amazingly cheap to collect and store data in any form, structured or unstructured, especially with the help of products such as Windows Azure Storage services. Second, it’s economical to leverage the cloud to provide the needed compute power—running on commodity hardware—to analyze this data. Finally, big data done well provides a major competitive advantage to businesses because it’s possible to extract undiscovered information from vast quantities of unstructured data. The purpose of this month’s article is to show how you can leverage the Windows Azure platform—in particular the Windows Azure HDInsight Service—to solve big data challenges.

Barely a day goes by without some hyped-up story in the IT press—and sometimes even in the mainstream media—about big data. Big data refers simply to data sets so large and complex that they’re difficult to process using traditional techniques, such as data cubes, de-normalized relational tables and batch-based extract, transform and load (ETL) engines, to name a few. Advocates talk about extracting business and scientific intelligence from petabytes of unstructured data that might originate from a variety of sources: sensors, Web logs, mobile devices and the Internet of Things, or IoT (technologies based on radio-frequency identification [RFID], such as near-field communication, barcodes, Quick Response [QR] codes and digital watermarking). IoT changes the definition of big—we’re now talking about exabytes of data a day!

Does big data live up to all the hype? Microsoft definitely believes it does and has bet big on big data. First, big data leads to better marketing strategies, replacing gut-based decision making with analytics based on real consumer behavior. Second, business leaders can improve strategic decisions, such as adding a new feature to an application or Web site, because they can study the telemetry and usage data of applications running on a multitude of devices. Third, it helps financial services detect fraud and assess risk. Finally, though you may not realize it, it’s big data technologies that are typically used to build recommendation engines (think Netflix). Recommendations are often offered as a service on the Web or within big companies to expedite business decisions. The really smart businesses are collecting data today without even knowing what type of questions they’re going to ask of the data tomorrow.

Big data really means data analytics, which has been around for a long time. While there have always been huge data stores being mined for intelligence, what makes today’s world different is the sheer variety of mostly unstructured data. Fortunately, products like Windows Azure bring great economics, allowing just about anyone to scale up his compute power and apply it to vast quantities of storage, all in the same datacenter. Data scientists describe the new data phenomenon as the three Vs—velocity, volume and variety. Never has data been created with such speed, size and the lack of a defined structure.

The world of big data contains a large and vibrant ecosystem, but one open source project reigns above them all, and that’s Hadoop. Hadoop is the de facto standard for distributed data crunching. You’ll find a great introduction to Hadoop at bit.ly/PPGvDP: “Hadoop provides a MapReduce framework for writing applications that process large amounts of structured and semi-structured data in parallel across large clusters of machines in a very reliable and fault-tolerant manner.” In addition, as you learn more about this space, you’ll likely come to agree with the perspective of Matt Winkler (principal PM on HDInsight) that Hadoop is “an ecosystem of related projects on top of the core distributed storage and MapReduce framework.” Paco Nathan, director of data science at Concurrent and a committer on the Cascading open source project (cascading.org), says further, “the abstraction layers enable people to leverage Hadoop at scale without knowing the underpinnings.”

The MapReduce Model

MapReduce is the programming model used to process huge data sets; essentially it’s the “assembly language” for Hadoop, so understanding what it does is crucial to understanding Hadoop. MapReduce algorithms are written in Java and split the input data set into independent chunks that are processed by the map tasks in a completely parallel manner. The framework sorts the output of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system. The framework takes care of scheduling tasks, monitoring them and re-executing failed tasks.

Ultimately, most developers won’t author low-level Java code for MapReduce. Instead, they’ll use advanced tooling that abstracts the complexities of MapReduce, such as Hive or Pig. To gain an appreciation of this abstraction, we’ll take a look at low-level Java MapReduce and at how the high-level Hive query engine, which HDInsight supports, makes the job much easier.

Why HDInsight?

HDInsight is an Apache Hadoop implementation that runs in globally distributed Microsoft datacenters. It’s a service that allows you to easily build a Hadoop cluster in minutes when you need it, and tear it down after you run your MapReduce jobs. As Windows Azure Insiders, we believe there are a couple key value propositions of HDInsight. The first is that it’s 100 percent Apache-based, not a special Microsoft version, meaning that as Hadoop evolves, Microsoft will embrace the newer versions. Moreover, Microsoft is a major contributor to the Hadoop/Apache project and has provided a great deal of its query optimization know-how to the query tooling, Hive.

The second aspect of HDInsight that’s compelling is that it works seamlessly with Windows Azure Blobs, mechanisms for storing large amounts of unstructured data that can be accessed from anywhere in the world via HTTP or HTTPS. HDInsight also makes it possible to persist the meta-data of table definitions in SQL Server so that when the cluster is shut down, you don’t have to re-create your data models from scratch.

Figure 1 depicts the breadth and depth of Hadoop support in the Windows Azure platform.

Hadoop Ecosystem in Windows Azure
Figure 1 Hadoop Ecosystem in Windows Azure

 On top is the Windows Azure Storage system, which provides secure and reliable storage and includes built-in geo-replication for redundancy of your data across regions. Windows Azure Storage includes a variety of powerful and flexible storage mechanisms, such as Tables (a NoSQL, keyvalue store), SQL database, Blobs and more. It supports a REST-ful API that allows any client to perform create, read, update, delete (CRUD) operations on text or binary data, such as video, audio and images. This means that any HTTP-capable client can interact with the storage system. Hadoop directly interacts with Blobs, but that doesn’t limit your ability to leverage other storage mechanisms within your own code.

The second key area is Windows Azure support for virtual machines (VMs) running Linux. Hadoop runs on top of Linux and leverages Java, which makes it possible to set up your own single-node or multi-node Hadoop cluster. This can be a tremendous money saver and productivity booster, because a single VM in Windows Azure is very economical. You can actually build your own multi-node cluster by hand, but it isn’t trivial and isn’t needed when you’re just trying to validate some basic algorithms.

Setting up your own Hadoop cluster makes it easy to start learning and developing Hadoop applications. Moreover, performing the setup yourself provides valuable insight into the inner workings of a Hadoop job. If you want to know how to do it, see the blog post, “How to Install Hadoop on a Linux-Based Windows Azure Virtual Machine,” at bit.ly/1am85mU.

Of course, once you need a larger cluster, you’ll want to take advantage of HDInsight, which is available today in Preview mode. To begin, go into the Windows Azure portal (bit.ly/12jt5KW) and sign in. Next, select Data Services | HDInsight | Quick Create. You’ll be asked for a cluster name, the number of compute nodes (currently four to 32 nodes) and the storage account to which to bind. The location of your storage account determines the location of your cluster. Finally, click CREATE HDINSIGHT CLUSTER. It will take 10 to 15 minutes to provision your cluster. The time it takes to provision is not related to the size of the cluster.

Note that you can also create and manage an HDInsight cluster programmatically using Windows PowerShell, as well as through cross-platform tooling on Linux- and Mac-based systems. Much of the functionality in the command-line interface (CLI) is also available in an easy-to-use management portal, which allows you to manage the cluster, including the execution and management of jobs on the cluster. You can download Windows Azure PowerShell as well as the CLIs for Mac and Linux at bit.ly/ZueX9Z. Then set up your VM running CentOS (a version of Linux), along with the Java SDK and Hadoop.

Exploring Hadoop

To experiment with Hadoop and learn about its power, we decided to leverage publicly available data from data.sfgov.org. Specifically, we downloaded a file containing San Francisco crime data for the previous three months and used it as is. The file includes more than 33,000 records (relatively small by big data standards) derived from the SFPD Crime Incident Reporting system. Our goal was to perform some simple analytics, such as calculating the number and type of crime incidents. Figure 2 shows part of the output from the Hadoop job that summarized the crime data.

Figure 2 Crime Data Summarized by Type

Grand Theft from Locked Auto 2617
Malicious Mischief 1623
Drivers License 1230
Aided Case 1195
Lost Property 1083

The code in Figure 3 summarizes the three months of crimes. The input file contained more than 30,000 rows of data, while the output contained only 1,000 records. The top five of those 1,000 records are shown in Figure 2.

Figure 3 The MapReduce Java Code That Summarizes the Crime Data

// CrimeCount.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
// This code is based on the standard word count examples
// you can find almost everywhere.
// We modify the map function to be able to aggregate based on
// crime type.
// The reduce function as well as main is unchanged,
// except for the name of the job.
public class CrimeCount {
public static class Map extends MapReduceBase implements 
  Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    private String mytoken = null;
    public void map(LongWritable key, Text value,
                    OutputCollector<Text, IntWritable> output,
                    Reporter reporter) throws IOException {
      // Read one line from the input file.
      String line = value.toString();
      // Parse the line into separate columns.
      String[] myarray = line.split(",");
      // Verify that there are at least three columns.
      if(myarray.length >= 2){
        // Grab the third column and increment that
        // crime (i.e. LOST PROPERTY found, so add 1).
        mytoken = myarray[2];
        word.set(mytoken);
        // Add the key/value pair to the output.
        output.collect(word, one);
      }
    }
  // A fairly generic implementation of reduce.
  public static class Reduce extends MapReduceBase implements Reducer<Text,
    IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterator<IntWritable> values,
                               OutputCollector<Text,
                               IntWritable> output,
                               Reporter reporter) throws IOException {
      // Loop through an aggregate key/value pairs.
      int sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      }
      output.collect(key, new IntWritable(sum));
    }
  }
  // Kick off the MapReduce job, specifying the map and reduce
  // construct, as well as input and output parameters.
  public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(CrimeCount.class);
    conf.setJobName("crimecount");
    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);
    conf.setMapperClass(Map.class);
    conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);
    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);
    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));
    JobClient.runJob(conf);
  }
}

Once you’ve saved the code in Figure 3 as CrimeCount.java, you need to compile, package and submit the Hadoop job. Figure 4 contains instructions for copying the input crime data file into the Hadoop Distributed File System (HDFS); compiling CrimeCount.java; creating the crimecount.jar file; running the Hadoop job (using crimecount.jar); and viewing the results—that is, the output data. To download the entire source code, go to sdrv.ms/16kKJKh and right-click the CrimeCount folder.

Figure 4 Compiling, Packaging and Running the Hadoop Job

# Make a folder for the input file.
hadoop fs -mkdir /tmp/hadoopjob/crimecount/input
# Copy the data file into the folder.
hadoop fs -put SFPD_Incidents.csv /tmp/hadoopjob/crimecount/input
# Create a folder for the Java output classes.
mkdir crimecount_classes
# Compile the Java source code.
javac -classpath /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.3.0.jar:/usr/lib/hadoop-0.20-mapreduce/hadoop-core-2.0.0-mr1-cdh4.3.0.jar -d crimecount_classes CrimeCount.java
# Create a jar file from the compiled Java code.
jar -cvf crimecount.jar -C crimecount_classes/ .
# Submit the jar file as a Hadoop job, passing in class path as well as
# the input folder and output folder.
# *NOTE* HDInsight users can use \"asv:///SFPD_Incidents.csv,\" instead of
# \"/tmp/hadoopjob/crimecount/input\" if they uploaded the input file
# (SFPD_Incidents.csv) to Windows Azure Storage.
hadoop jar crimecount.jar org.myorg.CrimeCount /tmp/hadoopjob/crimecount/input /tmp/hadoopjob/crimecount/output
# Display the output (the results) from the output folder.
hadoop fs -cat /tmp/hadoopjob/crimecount/output/part-00000

Now you have an idea of the pieces that make up a minimal Hadoop environment, as well as what MapReduce Java code looks like and how it ends up being submitted as a Hadoop job at the command line. Chances are, at some point you’ll want spin up a cluster to run some big jobs, then shut it down using higher-level tooling like Hive or Pig, and this is what HDInsight is all about because it makes it easy, with built-in support for Pig and Hive.

Once your cluster is created, you can work at the Hadoop command prompt or you can use the portal to issue Hive and Pig queries. The advantage of these queries is that you never have to delve into Java and modify MapReduce functions, perform the compilation and packaging, or kick off the Hadoop job with the .jar file. Although you can remote in to the head node of the Hadoop cluster and perform these tasks (writing Java code, compiling the Java code, packaging it up as a .jar file, and using the .jar file to execute it as a Hadoop job), this isn’t the optimal approach for most Hadoop users—it’s too low-level.

The most productive way to run MapReduce jobs is to leverage the Windows Azure portal in HDInsight and issue Hive queries, assuming that using Pig is less technically appropriate. You can think of Hive as higher-level tooling that abstracts away the complexity of writing MapReduce functions in Java. It’s really nothing more than a SQL-like scripting language. Queries written in Hive get compiled into Java MapReduce functions. Moreover, because Microsoft has contributed significant portions of optimization code for Hive in the Apache Hadoop project, chances are that queries written in Hive will be better optimized and will run more efficiently than handcrafted code in Java. You can find an excellent tutorial at bit.ly/Wzlfbf

All of the Java and script code we presented previously can be replaced with the tiny amount of code in Figure 5. It’s remarkable how three lines of code in Hive can efficiently achieve the same or better results than that previous code.

Figure 5 Hive Query Code to Perform the MapReduce

# Hive does a remarkable job of representing native Hadoop data stores
# as relational tables so you can issue SQL-like statements.
# Create a pseudo-table for data loading and querying
CREATE TABLE sfpdcrime(
IncidntNum string,
Category string,
Descript string,
DayOfWeek string,
CrimeDate string,
CrimeTime string,
PdDistrict string,
Resolution string,
Address string,
X string,
Y string,
CrimeLocation string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
# Load data into table.
LOAD DATA INPATH 'asv://sanfrancrime@brunoterkaly.blob.core.windows.net/SFPD_Incidents.csv' OVERWRITE INTO TABLE sfpdcrime;
select count(*) from sfpdcrime;
# Ad hoc query to aggregate and summarize crime types.
SELECT Descript, COUNT(*) AS cnt FROM sfpdcrime GROUP BY Descript
order by cnt desc;

There are some important points to note about Figure 5. First, notice that these commands look like familiar SQL statements, allowing you to create table structures into which you can load data. What’s particularly interesting is the loading of data from Windows Azure Storage services. Note the asv prefix in the load statement in Figure 5. ASV stands for Azure Storage Vault, which you can use as a storage mechanism to provide input data to Hadoop jobs. As you may recall, while provisioning the process of an HDInsight cluster, you specified one or more specific Windows Azure Storage service accounts. The ability to leverage Windows Azure Storage services in HDInsight dramatically improves the usability and efficiency of managing and executing Hadoop jobs.

We’ve only scratched the surface in this article. There’s a significant amount of sophisticated tooling that supports and extends HDInsight, and a variety of other open source projects you can learn about at the Apache Hadoop portal (hadoop.apache.org). Your next steps should include watching the Channel 9 video “Make Your Apps Smarter with Azure HDInsight” at bit.ly/19OVzfr. If your goal is to remain competitive by making decisions based on real data and analytics, HDInsight is there to help.

The Hadoop Ecosystem

Once you leave the low-level world of writing MapReduce jobs in Java, you’ll discover an incredible, highly evolved ecosystem of tooling that greatly extends Hadoop’s capabilities. For example, Cloudera and Hortonworks are successful companies with business models based on Hadoop products, education and consulting services. Many open source projects provide additional capabilities, such as machine learning (ML); SQL-like query engines that support data summarization and ad hoc querying (Hive); data-­flow language support (Pig); and much more. Here are just some of the projects that are worth a look: Sqoop, Pig, Apache Mahout, Cascading and Oozie. Microsoft offers a variety of tools as well, such as Excel with PowerPivot, Power View, and ODBC drivers that make it possible for Windows applications to issue queries against Hive data. Visit bit.ly/WIeBeq to see a fascinating visual of the Hadoop ecosystem.


Bruno Terkaly is a developer evangelist for Microsoft. His depth of knowledge comes from years of experience in the field, writing code using a multitude of platforms, languages, frameworks, SDKs, libraries and APIs. He spends time writing code, blogging and giving live presentations on building cloud-based applications, specifically using the Windows Azure platform. You can read his blog at blogs.msdn.com/b/brunoterkaly.

Ricardo Villalobos is a seasoned software architect with more than 15 years of experience designing and creating applications for companies in the supply chain management industry. Holding different technical certifications, as well as a master’s degree in business administration from the University of Dallas, he works as a cloud architect in the Windows Azure CSV incubation group for Microsoft. You can read his blog at blog.ricardovillalobos.com.

Terkaly and Villalobos jointly present at large industry conferences. They encourage readers of Windows Azure Insider to contact them for availability. Terkaly can be reached at bterkaly@microsoft.com and Villalobos can be reached at Ricardo.Villalobos@microsoft.com.

Thanks to the following technical experts for reviewing this article: Paco Nathan (Concurrent Inc.) and Matt Winkler (Microsoft)