July 2012

Volume 27 Number 07

Microsoft Azure - Hadoop on Microsoft Azure

By Lynn Langit | July 2012

There’s been a lot of buzz about Hadoop lately, and interest in using it to process extremely large data sets seems to grow day by day. With that in mind, I’m going to show you how to set up a Hadoop cluster on Azure. This article assumes basic familiarity with Hadoop technologies. If you’re new to Hadoop, see “What Is Hadoop?” As of this writing, Hadoop on Azure is in private beta. To get an invitation, visit hadooponazure.com. The beta is compatible with Apache Hadoop (snapshot 0.20.203+).

What Is Hadoop?

Hadoop is an open source library designed to batch-process massive data sets in parallel. It’s based on the Hadoop distributed file system (HDFS), and consists of utilities and libraries for working with data stored in clusters. These batch processes run using a number of different technologies, such as Map/Reduce jobs, and may be written in Java or other higher-level languages, such as Pig. There are also languages that can be used to query data stored in a Hadoop cluster. The most common language for query is HQL via Hive.  For more information, visit hadoop.apache.org.

Setting Up Your Cluster

Once you’re invited to participate in the beta, you can set up your Hadoop cluster. Go to hadooponazure.com and log in with your authorized Windows Live ID. Next, fill out the dialog boxes on the Web portal using the following values:

  1. Cluster (DNS) name: Enter name in the form “<your unique string>.cloudapp.net”.
  2. Cluster size: Choose the number of nodes, from 4 to 32, and their associated storage allocations, from 2TB to 16TB per cluster.
  3. Administrator username and password: Enter a username and password; password complexity restrictions are listed on the page. Once this is set you can connect via remote desktop or via Excel.
  4. Configuration information for a SQL Azure instance: This is an option for storing the Hive Metastore. If it’s selected, you’ll need to supply the URL to your SQL Azure server instance, as well as the name of the target database and login credentials. The login you specify must have the following permissions on the target database: ddl_ddladmin, ddl_datawriter, ddl_datareader.

After you’ve filled in this information, click Request cluster. You’ll see a series of status updates in the Web portal as your cluster (called Isotope in the beta) is being allocated, created and started. For each cluster you allocate, you’ll see many worker nodes and one head node, which is also known as the NameNode.

After some period of time (five to 30 minutes in my experience), the portal will update to show that your cluster is allocated and ready for use. You can then simply explore the Metro-style interface (by clicking the large buttons) to see what types of data processing and management tasks you can perform (see Figure 1). In addition to using the Web portal to interact with your cluster, you might want to open the available ports (closed by default) for FTP or ODBC Server access. I’ll discuss some alternative methods of connecting in a bit.

The Azure Hadoop Portal
Figure 1 The Azure Hadoop Portal

In the Your Cluster section of the portal, you can perform basic administrative tasks such as configuring access to your cluster, importing data and managing the cluster via the interactive console. The interactive console supports JavaScript or Hive. As Figure 1 shows, you can also access the Your Tasks section. Here you can run a MapReduce job (via a .jar file) and see the status of any MapReduce jobs that are running as well as those that have recently completed.

The portal buttons display information about three recently completed MapReduce jobs: C# Streaming Example, Word Count Example and 10GB Terasort Example. Each button shows the status of both the Map and the Reduce portion of each job. There are several other options for viewing the status of running (or completed) MapReduce jobs directly from the portal and via other means of connecting to your cluster, such as Remote Desktop Protocol (RDP).

Connecting to Your Data

You can make data available to your Hadoop on Azure cluster in a number of ways, including directly uploading to your cluster and accessing data stored in other locations.

Although FTP allows uploading theoretically any size data files, best practice is to upload files that are in a lower gigabyte-size range. If you want to run batch jobs on data stored outside of Hadoop, you’ll need to perform a couple of configuration steps first. To set up outside connections, click on the Manage Cluster button on the main portal and then configure the storage locations you wish to use, such as an Azure Blob storage location, an Azure Data Market query result or an Amazon Web Services (AWS) S3 storage location:

  1. To configure a connection to an AWS S3 bucket, enter your security keys (both public and private) so you can access data stored on S3 in your Hadoop cluster.
  2. To work with data from the Azure Data Market, fill in values for username (WLID), passkey (for the data source you wish to query and import), source (extract) query and (destination) Hive table name. Be sure to remove the parameter for default query limit (100 rows) from the query generated by the tools in the Data Market before you enter the query into the text box on your cluster.
  3. To access data stored in Azure Blob storage, you’ll need to enter the storage account name (URL) to the Blob storage locations and your passkey (private key) value.

Running a MapReduce Job

After setting up and verifying your Hadoop cluster, and making your data available, you’ll probably want to start crunching this data by running one or more MapReduce jobs. The question is, how best to start? If you’re new to Hadoop, there are some samples you can run to get a feel of what’s possible. You can view and run any of these by clicking the Samples button on the Web portal.

If you’re experienced with Hadoop techniques and want to run your own MapReduce job, there are several methods. The method you select will depend on your familiarity with the Hadoop tools (such as the Hadoop command prompt) and your preferred language. You can use Java, Pig, JavaScript or C# to write an executable MapReduce job for Hadoop on Azure.

I’ll use the Word Count sample to demonstrate how to run a MapReduce job from the portal using a .jar file. As you might expect, this job counts words for some input—in this example a large text file (the contents of an entire published book)—and outputs the result. Click Samples, then WordCount to open the job configuration page on the portal, as shown in Figure 2.

Setting Up the WordCount Sample
Figure 2 Setting Up the WordCount Sample

You’ll see two configurable parameters for this job, one for the function (word count) and the other for the source data (the text file). The source data (Parameter 1) includes not only the name of the input file, but also the path to its location. This path to the source data file can be text, or it can be “local,” which means that the file is stored on this Hadoop Azure cluster. Alternatively, the source data can be retrieved from AWS S3 (via the S3n:// or the S3:// protocol), from the Azure Blob storage (via the ASV:// protocol) or from the Azure Data Market (by first importing the desired data using a query), or be retrieved directly from the HDFS store. After you enter the path to a remote location, you can click on the verification icon (a triangle) and you should get an OK message if you can connect using the string provided.

After you configure the parameters, click Execute job. You’ll find a number of ways to monitor both job status as the job is executing and job results after the job completes. For example, on the main portal page, the Your Tasks section displays a button with the status of the most recent jobs during execution and after completion. A new button is added for each job, showing the job name, percentage complete for both the Map and the Reduce portions during execution, and the status (OK, failed and so forth) after job completion.

The Job History page, which you can get to from the Manage Your Account section of the main page, provides more detail about the job, including the text (script) used to run the job and the status, with date and time information. You can click the link for each job to get even more information about job execution.

If you decide to run a sample, be sure to read the detailed instructions for that particular sample. Some samples can be run from the Web portal (Your Tasks | Create Job); others require an RDP connection to your cluster.

Using JavaScript to Run Jobs

Click on the Interactive Console button to open the JavaScript console. Here you can run MapReduce jobs by executing .jar files (Java) by running a Pig command from the prompt, or by writing and executing MapReduce jobs directly in JavaScript.

You can also directly upload source data from the js> prompt using the fs.put command. This command opens a dialog box where you can choose a file to upload to your cluster. IIS limits the size of the file you can upload via the JavaScript console to 4GB.

You can also use source data from other remote stores (such as Azure Blobs) or from other cloud vendors. To work with source data from AWS S3, you use a request in the format s3n://<bucket name>/<folder name>.

Using the JavaScript console, you can verify connectivity to your AWS S3 bucket by using the #ls command with the bucket address, like so:

js> #ls s3n://HadoopAzureTest/Books
Found 2 items
-rwxrwxrwx   1          0 2012-03-30 00:20 /Books
-rwxrwxrwx   1    1395667 2012-03-30 00:22 /Books/davinci.txt

 

When you do, you should get a list of the contents (both folders and files) of your bucket as in this example.

If you’d like to review the contents of the source file before you run your job, you can do so from the console with the #cat command:

js> #cat s3n://HadoopAzureTest/Books/davinci.txt

 

After you verify that you can connect to your source data, you’ll want to run your MapReduce job. The following is the JavaScript syntax for the Word Count sample MapReduce job (using a .jar file):

var map = function (key, value, context) {
  var words = value.split(/[^a-zA-Z]/);
  for (var i = 0; i < words.length; i++) {
    if (words[i] !== "") {
      context.write(words[i].toLowerCase(), 1);
    }
  }
};
var reduce = function (key, values, context) {
  var sum = 0;
  while (values.hasNext()) {
    sum += parseInt(values.next());
  }
  context.write(key, sum);
};

In the map portion, the script splits the source text into individual words; in the reduce portion, identical words are grouped and then counted. Finally, an output (summary) file with the top words by count (and the count of those words) is produced. To run this WordCount job directly from the interactive JavaScript console, start with the pig keyword to indicate you want to run a Pig job. Next, call the from method, which is where you pass in the location of the source data. In this case, I’ll perform the operation on data stored remotely—in AWS S3.

Now you call the mapReduce method on the Pig job, passing in the name of the file with the JavaScript code for this job, includ­ing the required parameters. The parameters for this job are the method of breaking the text—on each word—and the value and datatype of the reduce aggregation. In this case, the latter is a count (sum) of data type long.

You then specify the output order using the orderBy method, again passing in the parameters; here the count of each group of words will be output in descending order. In the next step, the take method specifies how many aggregated values should be returned—in this case the 10 most commonly occurring words. Finally, you call the to method, passing in the name of the output file you want to generate. Here’s the complete syntax to run this job:

pig.from("s3n://HadoopAzureTest/Books").mapReduce("WordCount.js", "word, count:long").orderBy("count DESC").take(10).to("DaVinciTop10Words.txt")

 

As the job is running, you’ll see status updates in the browser—the percent complete of first the map and then the reduce job. You can also click a link to open another browser window, where you’ll see more detailed logging about the job progress. Within a couple of minutes, you should see a message indicating the job completed successfully. To further validate the job output, you can then run a series of commands in the JavaScript console.

The first command, fs.read, displays the output file, showing the top 10 words and the total count of each in descending order. The next command, parse, shows the same information and will populate the data variable with the list. The last command, graph.bar, displays a bar graph of the results. Here’s what these commands look like:

js> file = fs.read("DaVinciTop10Words.txt")
js> data = parse(file.data, "word, count:long")
js> graph.bar(data)

 

An interesting aspect of using JavaScript to execute MapReduce jobs is the terseness of the JavaScript code in comparison to the Java. The MapReduce WordCount sample Java job contains more than 50 lines of code, but the JavaScript sample contains only 10 lines. The functionality of both jobs is similar.

Using C# with Hadoop Streaming

Another way you can run MapReduce jobs in Hadoop on Azure is via C# Streaming. You’ll find an example showing how to do this on the portal. As with the previous example, to try out this sample, you need to upload the needed files (davinci.txt, cat.exe and wc.exe) to a storage location such as HDFS, ASV or S3. You also need to get the IP address of your Hadoop HEADNODE. To get the value using the interactive console, run this command:

js>#cat apps/dist/conf/core-site.xml

 

Fill in the values on the job runner page; your final command will look something like this:

Hadoop jar hadoop-examples-0.20.203.1-SNAPSHOT.jar
-files "hdfs:///example/apps/wc.exe,hdfs:///example/apps/cat.exe"
-input "/example/data/davinci.txt"
-output "/example/data/StreamingOutput/wc.txt"
-mapper "cat.exe"
-reducer "wc.exe" 

In the sample, the mapper and the reducer are executable files that read the input from stdin, line by line, and emit the output to stdout. These files produce a Map/Reduce job, which is submitted to the cluster for execution. Both the mapper file, cat.exe, and the reducer file, wc.exe, are shown in Figure 3.

The Mapper and Reducer Files
Figure 3 The Mapper and Reducer Files

Here’s how the job works. First the mapper file launches as a process on mapper task initialization. If there are multiple mappers, each will launch as a separate process on initialization. In this case, there’s only a single mapper file—cat.exe. On exe­cution, the mapper task converts the input into lines and feeds those lines to the stdin portion of the MapReduce job. Next, the mapper gathers the line outputs from stdout and converts each line into a key/value pair. The default behavior (which can be changed) is that the key is created from the line prefix up to the first tab character and the value is created from the remainder of the line. If there’s no tab in the line, the whole line becomes the key and the value will be null. 

After the mapper tasks are complete, each reducer file launches as a separate process on reducer task initialization. On execution, the reducer converts key/value input pairs into lines and feeds those lines to the stdin process. Next, the reducer collects the line-­oriented outputs from the stdout process and converts each line into a key/value pair, which is collected as the output of the reducer.

Using HiveQL to Query a Hive Table

Using the interactive Web console, you can execute a Hive query against Hive tables you’ve defined in your Hadoop cluster. To learn more about Hive, see hive.apache.org.

To use Hive, you first create (and load) a Hive table. Using our WordCount MapReduce sample output file (DavinciTop10­Words.txt), you can execute the following command to create and then verify your new Hive table:

hive> LOAD DATA INPATH
'hdfs://lynnlangit.cloudapp.net:9000/user/lynnlangit/DaVinciTop10Words.txt'
OVERWRITE INTO TABLE wordcounttable;
hive> show tables;
hive> describe wordcounttable:
hive> select * from wordcounttable; 

Hive syntax is similar to SQL syntax, and HiveQL provides similar query functionality. Keep in mind that all data is case-sensitive by default in Hadoop.

Other Ways to Connect to Your Cluster

Using RDP In addition to working with your cluster via the Web portal, you can also establish a remote desktop connection to the cluster’s NameNode server. To connect via RDP, you click the Remote Desktop button on the portal, then click on the downloaded RDP connection file and, when prompted, enter your administrator username and password. If prompted, open firewall ports on your client machine. After the connection is established, you can work directly with your cluster’s NameNode using the Windows Explorer shell or other tools that are included with the Hadoop installation, much as you would with the default Hadoop experience.

My NameNode server is running Windows Server 2008 R2 Enterprise SP1 on a server with two processors and 14GB of RAM, with Apache Hadoop release 0.20.203.1 snapshot installed. Note that the cluster resources consist of the name node and the associated worker nodes, so the total number of processors for my sample cluster is eight.

The installation includes standard Hadoop management tools, such as the Hadoop Command Shell or command-line interface (CLI), the Hadoop MapReduce job tracker (found at http://[namenode]:50030) and the Hadoop NameNode HDFS (found at http://[namenode]:50070). Using the Hadoop Command Shell you can run MapReduce jobs or other administrative tasks (such as managing your DFS cluster state) via your RDP session.

At this time, you can connect via RDP using only a Windows client machine. Currently, the RDP connection uses a cookie to enable port forwarding. The Remote Desktop Connection for Mac client doesn’t have the ability to use that cookie, so it can’t connect to the virtual machine.

Using the Sqoop Connector Microsoft shipped several connectors for Hadoop to SQL Server in late 2011 (for SQL Server 2008 R2 or later or for SQL Server Parallel Data Warehouse). The Sqoop-based SQL Server connector is designed to let you import or export data between Hadoop on Linux and SQL Server. You can download the connector from bit.ly/JgFmm3. This connector requires that the JDBC driver for SQL Server be installed on the same node as Sqoop. Download the driver at bit.ly/LAIU4F.

You’ll find an example showing how to use Sqoop to import or export data between SQL Azure and HDFS in the samples section of the portal.

Using FTP To use FTP, you first have to open a port, which you can do by clicking the Configure Ports button on the portal and then dragging the slider to open the default port for FTPS (port 2226). To communicate with the FTP server, you’ll need an MD5 hash of the password for your account. Connect via RDP, open the users.conf file, copy the MD5 hash of the password for the account that will be used to transfer files over FTPS and then use this value to connect. Note that the MD5 hash of the password uses a self-signed certificate on the Hadoop server that might not be fully trusted.

You can also open a port for ODBC connections (such as to Excel) in this section of the portal. The default port number for the ODBC Server connections is 10000. For more complex port configurations, though, use an RDP connection to your cluster.

Using the ODBC Driver for Hadoop (to Connect to Excel and PowerPivot) You can download an ODBC driver for Hadoop from the portal Downloads page. This driver, which includes an add-in for Excel, can connect from Hadoop to either Excel or PowerPivot. Figure 4 shows the Hive Pane button that’s added to Excel after you install the add-in. The button exposes a Hive Query pane where you can establish a connection to either a locally hosted Hadoop server or a remote instance. After doing so, you can write and execute a Hive query (via HiveQL) against that cluster and then work with the results that are returned to Excel.


Figure 4 The Hive Query Pane in Excel

You can also connect to Hadoop data using PowerPivot for Excel. To connect to PowerPivot from Hadoop, first create an OLE DB for ODBC connection using the Hive provider. On the Hive Query pane, next connect to the Hadoop cluster using the connection you configured previously, then select the Hive tables (or write a HiveQL query) and return the selected data to PowerPivot.

Be sure to download the correct version of the ODBC driver for your machine hardware and Excel. The driver is available in both 32-bit and 64-bit editions.

Easy and Flexible—but with Some Unknowns

The Hadoop on Azure beta shows several interesting strengths, including:

  • Setup is easy using the intuitive Metro-style Web portal.
  • You get flexible language choices for running MapReduce jobs and data queries. You can run MapReduce jobs using Java, C#, Pig or JavaScript, and queries can be executed using Hive (HiveQL).
  • You can use your existing skills if you’re familiar with Hadoop technologies. This implementation is compliant with Apache Hadoop snapshot 0.203+.
  • There are a variety of connectivity options, including an ODBC driver (SQL Server/Excel), RDP and other clients, as well as connectivity to other cloud data stores from Microsoft (Azure Blobs, the Azure Data Market) and others (Amazon Web Services S3 buckets).

There are, however, many unknowns in the version of Hadoop on Azure that will be publicly released:

  • The current release is a private beta only; there is little information on a roadmap and planned release features.
  • Pricing hasn’t been announced.
  • During the beta, there’s a limit to the size of files you can upload, and Microsoft included a disclaimer that “the beta is for testing features, not for testing production-level data loads.” So it’s unclear what the release-version performance will be like.

To see video demos (screencasts) of the beta functionality of Hadoop on Azure, see my BigData Playlist on YouTube at bit.ly/LyX7Sj.


Lynn Langit (LynnLangit.com) runs her own technical training and consulting company. She designs and builds data solutions that include both RDBMS and NoSQL systems. She recently returned to private practice after working as a developer evangelist for Microsoft for four years. She is the author of three books on SQL Server Business Intelligence, most recently “Smart Business Intelligence Solutions with SQL Server 2008” (Microsoft Press, 2009). She is also the cofounder of the non-profit TKP (TeachingKidsProgramming.org).

Thanks to the following technical expert for reviewing this article: Denny Lee