An End-to-End HDInsight Demo

I’ve got one last post in me before vacation starts…

By now, the big data buzzword fatigue has set in and you’ve probably completed 1,000 demos that highlight the different parts of the Hadoop ecosystem. So why spend time on yet another HDInsight demo? It’s pretty simple, my customers repeatedly ask for an end-to-end demo to help them understand how it fits in with their existing data warehouse/BI investment.

So this blog series will be different. The focus will not be on training or explaining concepts per se, rather we will zero in on a functional demo that you can play with and share. For full disclosure, this demo is loosely based on an existing Azure Tutorial. I’ve attempted to beef up the tutorial to make it a little bit more “real world”.

You can get the zipped up demo (PowerShell and HQL) files HERE.

About this Demo

So what does this demo actually do. The first iteration of this demo is pretty basic as we will build onto the script provide in later posts. Version 1 of this demo performs the following steps:

  1. Downloads and prepares three-years (configurable) of demo data before storing it in a dedicated Azure Blob Storage account. The blob storage account and the associated container will be created if it does not exist.
  2. Creates a storage account and container for the HDInsight Cluster (if they do not exist).
  3. Creates a SQL database server instance and SQL database (if they do not exist) for the Hive Metastore. This step will also create the necessary SQL Database Server firewall rules so that you can access the server directly.
  4. The HDInsight cluster is provisioned and configured for access to the sample data.
  5. A partitioned Hive table is created over top of the sample data

Environment Set-Up

Much of this tutorial will use the Microsoft Azure PowerShell libraries. You will need to set-up, install and configure the libraries in order to use the provided PowerShell scripts. The demo also assumes that you have imported your Azure publish settings as discussed in the link above.

Downloading the Demo Data

For this demo, we will use the Airline On-Time Performance data published by Bureau of Transportation Statistics. The full data sets are available for download in zip compression format which each monthly file being in the 20-25Mb size range. I will not go into detail on the steps required to download and prepare the demo data in this specific post, I have another post that talks all about it. Feel free to review that post HERE.

The script as provided will download, prepare and then store three years of airline performance data to the configured storage account. You can tweak these values to either use more or less data. The process to download this data only needs to be run once and took around 30 minutes to complete.

Configuration

The script parameters are all located in the demo.ps1 file. In this file you will need to at the minimum provide:

  • HDInsight Storage Account Name
  • HDInsight Container Name
  • Azure Location
  • HDInsight Cluster Name
  • SQL Database Server Name (set to an empty-string “”, to create a new SQL Database Server)
  • Your External IP Address
  • Sample Data Storage Account Name
  • Sample Data Storage Container Name
  • The local file system path where you extract the demo files

You do not need to pre-create these storage accounts and containers or anything else in your Azure account. The script will handle the creation process if the items do not exists.

If you use the empty-string to create the database server, copy the server name out of the output window and into the appropriate variable so that the script can re-use the database server on subsequent runs. Also note, there are a number of other setting including usernames, passwords, etc which can be tweaked if you so desire.

Exploring the Results & Next Steps

After the script completes, you have a fully-functional HDInsight environment to play with. You can query the sample data directly through the Hive table using the following query to see the total count of demo records by year:

 SELECT COUNT(*), Year FROM RawAirlinePerformanceData GROUP BY Year

image

image

After you are finished, simply delete your HDInsight cluster to prevent it from running up a bill or eating through your MSDN credit since you pay for compute time when its running. You can re-run this script as many times as necessary. When running this script on subsequent iterations, comment out the Load Sample Data and Hive Table Configuration sections in the demo.ps1 file.

In the next iteration of this series, we will refine our demo by incorporating more pieces of the Hadoop puzzle. These include

  • Extending Hive through views
  • Adding a Pig job to process our sample data
  • Using Sqoop to move summarized or aggregated data to a SQL Server database
  • Building an Oozie workflow to demonstrate orchestration within Hadoop

Feel free to share this or leverage this demo as needed but note that this script contains no error checking/handling and is provided as-is. I would also love to hear from you on what you would like to see in a demo like this as I hope it is ultimately useful in helping you explore the Microsoft Azure HDInsight Hadoop implementation.

Till next time!

Chris