January 2015

Volume 30 Number 1


Azure SDK 2.5 - Hadoop Made Easier for Microsoft Developers

By Omid Afnan | January 2015

This year Microsoft announced the availability of HDInsight, its Big Data cloud offering. With this service you’re able to create Hadoop, HBase and now Storm clusters within Microsoft Azure. By using the Azure Management Portal, you can follow some simple steps to create a Hadoop cluster, associate Azure storage with it and choose the number of cores you need. With the click of a button, within minutes you have a cluster provisioned with the necessary computation resources and a 100 percent Apache Hadoop distribution up and running in the cloud. That’s it: No hardware procurement, no sorting through Hadoop stacks, no need to learn how to install Hadoop. You’re ready to run Big Data queries!

You may be saying: That’s great … now what? There are a number of common cases people are trying to address with Big Data. You or your organization may have some specific cases in mind, like Internet of Things (IoT) telemetry crunching, social sentiment analysis or user preference analysis. A case that’s interesting to many of our customers is that of analyzing Web services or application logs. In this article I’m going to show you how this scenario can be boot-strapped using HDInsight and how you can use the newly released Microsoft Azure HDInsight Tools for Visual Studio to simplify your development experience.

The Log Analysis Scenario

Most apps and services produce a large trail of data intended for tracking usage, understanding failures or managing the execution environment. Data streams such as error logs, software quality management (SQM) output, click-stream instrumentation and Web server logs can easily accumulate at the rate of gigabytes or terabytes a week. With the exception of debugging application failures or service outages, these logs are often left untouched due to their sheer size and the semi-structured nature of their contents.

However, many organizations are beginning to tackle this problem in order to discover quality issues more quickly, understand customer usage better and, ultimately, drive up customer engagement and satisfaction. A common base case is the analysis of Web server logs to derive two kinds of information:

  • Frequency of particular errors happening on the Web site
  • Traffic patterns into or out of the Web site

Parsing and aggregating this log data is an activity that allows a high degree of parallelism. The ideal would be to have subsets of records farmed out to individual servers that can parse, transform and summarize those records. Then you’d merge these partial results to create the final aggregated data. This is, of course, exactly what the MapReduce engine in Hadoop does and the Hive Query Language (HiveQL) allows you to express your target aggregations in a SQL-like query syntax. As your data grows larger with increased usage, this approach scales effectively to keep overall processing time under control.

Programming against Hadoop MapReduce is made significantly easier with high-level languages like HiveQL. The HDInsight tools in Visual Studio bring the management of this kind of code into a standard framework familiar to all developers. The tool also exposes information related to the Hadoop cluster and the data artifacts in ways that are easy to understand. In the rest of this article I’ll walk through a simple log analysis case and highlight how to use the language and the tool to get to a working solution quickly.

Getting Set Up

I’m going to start by creating an HDInsight cluster for running my queries. The cluster is needed only when running queries and can be deleted when not in use, or it can be left running (in which case you continue to incur charges). I’ll put my persistent data on Azure storage, which is created separately and can be used with many different clusters over time. Figure 1 shows where this is done in the Azure Management Portal.

Creating a Hadoop Cluster in the Microsoft Azure Management Portal
Figure 1 Creating a Hadoop Cluster in the Microsoft Azure Management Portal

After the cluster is created it will show up in the Management Portal, where you can view or delete it as needed. Note that there’s also a Query Console, which you can open from the Management Portal to submit simple queries directly from the browser.

At this point you should have already installed the Azure SDK 2.5, which includes the HDInsight Tools for Visual Studio. When you navigate to the File | New | Project screen, you’ll find a new category for HDInsight projects. Like other Visual Studio projects, the Hive Application project type enables the easy creation of the related files needed for specifying a working application in a given language—in this case HiveQL. You’ll also notice a Hive sample project that contains the code example presented in this article. Using the Hive project lets you easily group, save, modify and share related queries.

Getting the Data

Now, in Visual Studio, open Server Explorer to browse Azure services and objects in your subscription. Use this to navigate to the Azure | HDInsight node where you can find your recently created Hadoop cluster (called Weblog­Analytics in this example). Expand this node to find two children: Hadoop databases and the default blob storage you associated with the account at creation time. The blob storage now has a default container with the same name as your cluster, and Hadoop applications can read and write to this container and its subfolders using the URIs provided.

You can now copy or load your data into this container, and then simply reference the data by supplying the correct URI. If your Web site is located outside of Azure, use AzCopy to move the desired logs to this storage container, as needed. If you have existing data in Azure, you can access that data without further copying, which is much simpler if you’re using Azure Websites to host your application. In this case, at cluster creation time, choose the storage account containing your Web site data as the associated storage account for your cluster. You have even more flexibility to add multiple storage containers using the HDInsight custom  create option. In these cases you can easily access the required files with a reference such as:

wasb://container@storage.blob.core.windows.net/example/data/filename.ext

For this article I’m going to use some log data that’s available as part of the sample code in HDInsight. You can navigate to the HDISamples folder in the default container where you’ll find a set of folders containing samples, including the file:

HDISamples/webloganalyticsstorage/webloganalytics/HdiSamples/
  WebsiteLogSampleData/SampleLog/909f2b.log

Once the raw data is available, you can turn it into a Hive table that can be referenced in other Hive queries as a rowset from which you can select rows. To do this you create it as an EXTERNAL table, which is a meta-data operation in Hadoop and doesn’t change the data on disk. The only step you have to take is to define the column types and name them so you can refer to the data in the familiar SQL-style fashion. This is an important aspect of the Big Data model—that you can do a late binding between the format of your data and the data itself. There’s no need to create schemas up front, or worry about cleansing your data at load time. In fact, there’s no checking that the rows in your data adhere to any schema you define. Instead, it’s your job to build your queries such that different versions of data or missing data is handled as your business logic requires.

Again, you can use the Visual Studio IDE to simplify this task. Figure 2 shows the kind of table creation tool you may have seen in SQL Server tools. You can simply define table columns by adding entries to the Table Columns area, providing a name, data type and additional comments. Beyond a name, nothing else is needed to create a basic, empty table definition. However, in this case, I want to apply this definition to an existing file, so I use the External table property to specify the URI mentioned earlier.

Creating a Hive Table Definition
Figure 2 Creating a Hive Table Definition

Metadata operations are achieved by running DDL statements in the Hadoop cluster. The create table dialog shows the DDL generated, which in this case works out to:

CREATE EXTERNAL TABLE IF NOT EXISTS weblogs (s_date date, s_time string,
  s_sitename string, cs_method string, cs_uristem string,
  cs_uriquery string, s_port int, cs_username string, c_ip string,
  cs_useragent string,
  cs_cookie string, cs_referer string, cs_host string, sc_status int,
  sc_substatus int,
  sc_win32status int, sc_bytes int, cs_bytes int, s_timetaken int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION 'wasb://webloganalytics@
  webloganalyticsstorage.blob.core.windows.net/HdiSamples/ 
  WebsiteLogSampleData/SampleLog/'
TBLPROPERTIES ('skip.header.line.count'='2');

The DDL includes a couple of extra items. In this example, the sample file is using spaces to separate column values in each line, so I specify the field delimiter to be a space using the ROW FORMAT clause. Hive also allows you to set some specific metadata values related to a table. In the last line of the DDL I’m indicating that the text file contains two rows of header information that can be skipped. This DDL can be cut and pasted into another HiveQL script, or you can push the submit button to kick off the table creation. This results in a task being sent to the Hadoop cluster that will create the table. In Visual Studio you’ll see a job status window, which you can refresh to track the task’s completion status. Once the task is done, you can use Server Explorer to see the new table, browse its definition or view the first 100 rows (from the context menu of the database).

Cooking Data into Information

One of the key things you’ll want to know is how well your Web site is working. A starting point might be to look at how often users are getting client errors on their HTTP requests. You can take a look at the different kinds of errors received by different source referrers, group them by error type and list them by most impacted (highest error count). You’ll want to keep these results, so let’s put them in a new Hive table, which you can do using a CREATE EXTERNAL TABLE statement. This time you’ll populate the table by inserting calculated values using a select statement. Figure 3 shows what the code looks like.

Figure 3 Inserting Calculated Values

DROP TABLE IF EXISTS ClientErrors;               
--create table ClientErrors for storing errors users experienced and their frequencies
CREATE EXTERNAL TABLE ClientErrors(sc_status int, cs_referer string, cs_page string, cnt int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
--populate table ClientErrors with data from table weblogs
INSERT OVERWRITE TABLE ClientErrors
SELECT sc_status,
  cs_referer,
  concat(cs_uristem,'?', regexp_replace(cs_uriquery,
  'X-ARR-LOG-ID=[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}','')) cs_page,
  count(distinct c_ip) as cnt
FROM weblogs
WHERE sc_status >=400 and sc_status < 500
GROUP BY sc_status,
  cs_referer,
  concat(cs_uristem,'?', regexp_replace(cs_uriquery,
  'X-ARR-LOG-ID=[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}',''))
ORDER BY cnt;

You’ll notice I’m doing some cleanup in the data. For the cs-uri-query field, I simply want to see any query string that was passed in with the HTTP request, but don’t care about the X-ARR-LOG-ID information stored in the logs. So, I simply replace that portion of the string with a blank. Of course, as this is a declarative query statement and not a procedure, it’s not possible to specify reuse of the transformation I use for the cs-uri-query field. However, I can count on the query execution runtime to optimize this situation.

I’d further like to look at the top referrers for my Web site. To do this I can simply select the cs_referer field from the IIS logs and count the number of referrals by day and referrer. Once again, I want to keep this data for further use, so I create another Hive table called RefersPerDay and store the query results there. Figure 4 shows how to do the table creation and data insertion.

Figure 4 Creating the Table and Inserting Data

DROP TABLE IF EXISTS RefersPerDay;
--create table RefersPerDay for storing references from external Web sites
CREATE EXTERNAL TABLE IF NOT EXISTS RefersPerDay(year int, month int, day int,
  cs_referer string, cnt int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
--populate table RefersPerDay with data from the weblogs table
INSERT OVERWRITE TABLE RefersPerDay
SELECT year(s_date), month(s_date), day(s_date),
  cs_referer, count(distinct c_ip) as cnt
FROM weblogs
WHERE sc_status >=200 and sc_status <300
GROUP BY s_date, cs_referer
ORDER BY cnt desc;

All of the code in Figure 3 and Figure 4 can be put into a single Hive QL file. You can now create a new Hive application project, enter this code and run the queries. The Hive project allows you to save your query for later use. These examples are quite simple, but they’re based on the actual format of IIS logs. Therefore, they should work against your actual server logs and serve as a basis for creating more complex analyses.

Of course, running the queries means submitting them to your Hadoop cluster in Azure. To do this, you have to choose the cluster where you want to submit the job. The HDInsight toolbar allows you to select your cluster from a list of clusters available across your subscriptions. As you can see in Figure 5, I selected the Weblog­Analytics cluster. Choosing Submit will send the query to that cluster. Here I’m showing an advanced submit, where you can add a friendly name for the resulting job that will appear in the cluster. This makes it easier to track running queries. You can also add further arguments for the Hive processor.

Submitting a Query
Figure 5 Submitting a Query

As soon as the query is submitted, you’ll see a job summary window like the one in Figure 6, indicating the current state of your query as it’s initialized, starts to run, and eventually completes or fails. Your Hive queries are turned into MapReduce tasks that can be fanned out across a large number of nodes for execution. This is where the inherent power of the Hadoop processing model kicks in to allow massive scaling. In the job summary view, you can see the job go through different percentages of the map and reduce phase until it completes all queries that were submitted together.

The Hive Job Summary Window
Figure 6 The Hive Job Summary Window

When the job finishes, you can review the associated query and look at job output and exceptions or errors through the links under Job Details. Of course, the query put summarized data into new tables and you can go back to Server Explorer to look at those tables and their content. You can also run new queries—ones that use the data in these new tables for further summarizing or aggregation.

Wrapping Up

With HDInsight you can quickly and easily access large amounts of data you’ve generated or transferred to the cloud. You can spin up Hadoop clusters to process the data, accessing the data from anywhere in Azure. Hive allows you to specify the structure of your data and to evolve with the changes that occur in your data format over time. HDInsight Tools for Visual Studio further allows you to explore your Hadoop clusters and data in intuitive ways. More important, it allows you to create Hive query projects and manage them like any other code, helping you move from simple, working queries to complex information processing over time. Aggregate data created using HDInsight is best visualized through tools such as Power BI, which offer direct connection to HDInsight sources. Other capabilities, like machine learning, can also be used in conjunction with these tools to extend the analytic possibilities. Watch this space for more on solving Big Data problems using Azure services.


Omid Afnan is a principal program manager in the Azure Big Data team working on implementations of distributed computation systems and related developer tool-chains. He lives and works in China. Reach him at omafnan@microsoft.com.

Thanks to the following technical experts for reviewing this article: Duc Le and Tony Murphy
Duc Le (MIcrosoft). Duc Le is a software developer in the Microsoft HDInsight Big Data team. Before joining Microsoft, he was a graduate student doing research on programming languages and software testing. He likes reading and especially enjoys Sir Arthur Conan Doyle.

Tony Murphy (Microsoft). Tony Murphy has been with Microsoft since 2007, working first on the SQL Server engine before moving to the SQL Azure team. He is a founding member of the HDInsight team at Microsoft.