Using Hadooop (HDInsight) with Microsoft - Two (OK, Three) Options
Microsoft has many tools for “Big Data”. In fact, you need many tools – there’s no product called “Big Data Solution” in a shrink-wrapped box – if you find one, you probably shouldn’t buy it. It’s tempting to want a single tool that handles everything in a problem domain, but with large, complex data, that isn’t a reality. You’ll mix and match several systems, open and closed source, to solve a given problem.
But there are tools that help with handling data at large, complex scales. Normally the best way to do this is to break up the data into parts, and then put the calculation engines for that chunk of data right on the node where the data is stored. These systems are in a family called “Distributed File and Compute”. Microsoft has a couple of these, including the High Performance Computing edition of Windows Server. Recently we partnered with Hortonworks to bring the Apache Foundation’s release of Hadoop to Windows. And as it turns out, there are actually two (technically three) ways you can use it.
(There’s a more detailed set of information here: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx, I’ll cover the options at a general level below)
First Option: Windows Azure HDInsight Service
Your first option is that you can simply log on to a Hadoop control node and begin to run Pig or Hive statements against data that you have stored in Windows Azure. There’s nothing to set up (although you can configure things where needed), and you can send the commands, get the output of the job(s), and stop using the service when you are done – and repeat the process later if you wish.
(There are also connectors to run jobs from Microsoft Excel, but that’s another post)
This option is useful when you have a periodic burst of work for a Hadoop workload, or the data collection has been happening into Windows Azure storage anyway. That might be from a web application, the logs from a web application, telemetrics (remote sensor input), and other modes of constant collection.
You can read more about this option here: http://blogs.msdn.com/b/windowsazure/archive/2012/10/24/getting-started-with-windows-azure-hdinsight-service.aspx
Second Option: Microsoft HDInsight Server
Your second option is to use the Hadoop Distribution for on-premises Windows called Microsoft HDInsight Server. You set up the Name Node(s), Job Tracker(s), and Data Node(s), among other components, and you have control over the entire ecostructure.
This option is useful if you want to have complete control over the system, leave it running all the time, or you have a huge quantity of data that you have to bulk-load constantly – something that isn’t going to be practical with a network transfer or disk-mailing scheme.
You can read more about this option here: http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx
Third Option (unsupported): Installation on Windows Azure Virtual Machines
Although unsupported, you could simply use a Windows Azure Virtual Machine (we support both Windows and Linux servers) and install Hadoop yourself – it’s open-source, so there’s nothing preventing you from doing that.
Aside from being unsupported, there are other issues you’ll run into with this approach – primarily involving performance and the amount of configuration you’ll need to do to access the data nodes properly. But for a single-node installation (where all components run on one system) such as learning, demos, training and the like, this isn’t a bad option.
Did I mention that’s unsupported? :)
You can learn more about Windows Azure Virtual Machines here: http://www.windowsazure.com/en-us/home/scenarios/virtual-machines/
And more about Hadoop and the installation/configuration (on Linux) here: http://en.wikipedia.org/wiki/Apache_Hadoop
And more about the HDInsight installation here: http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW
Choosing the right option
Since you have two or three routes you can go, the best thing to do is evaluate the need you have, and place the workload where it makes the most sense. My suggestion is to install the HDInsight Server locally on a test system, and play around with it. Read up on the best ways to use Hadoop for a given workload, understand the parts, write a little Pig and Hive, and get your feet wet. Then sign up for a test account on HDInsight Service, and see how that leverages what you know. If you're a true tinkerer, go ahead and try the VM route as well.
Oh - there’s another great reference on the Windows Azure HDInsight that just came out, here: http://blogs.msdn.com/b/brunoterkaly/archive/2012/11/16/hadoop-on-azure-introduction.aspx