Microsoft, and Hadoop for Big Data

Despite common misconceptions Microsoft now has extensive interoperability with open source technologies for example you can run a php application on Azure, get support from us to run RedHat, SUSE or CentOs on Hyper-V and manage your applications from System Center. ,  So extending this approach to the world of big data with Hadoop is a logical step given the pervasiveness of Hadoop in this space.

Hopefully your reading this because you have some idea of what big data is. If not it is basically an order of magnitude bigger than you can store, it  changes very quickly and is typically made up of different kinds of data that you can’t handle with the technologies you already have.  For example web logs, tweets, photos, and sounds.  Traditionally we have discarded this information as having little or no value compared with the investment needed to process it, especially as it often not clear what value is contained in this information.  For this reason big data has been filed in the too difficult drawer, unless you are megacorp or a government.

However after some research by Google, an approach to attacking this problem called map reduce was born.  Map is where the structure for the data is declared for example pulling out the actual tweet from a twitter massage, the hashtags and other useful fields such as whether this is a retweet.  The reduce phase then pulls out meaning from these structures such as digram ( the key 2 word phrases) sentiment, and so on.

Hadoop uses map reduce but the key to its power is that it applies these map reduce concept on large clusters of servers by getting each node to run the functions locally, thus taking the code to the data to minimise IO and network traffic using its own file system – HDFS.  There are lots of tools in the Hadoop armoury built on top of this notably Hive which presents HDFS as a data warehouse that you can run SQL against and the PIG (latin) language where you load data and work with your functions.

What Microsoft are developing in conjunction with a leading Hadoop developer Horton Works is to add integration to Hadoop to make it more enterprise friendly:

  • an odbc driver to connect to Hive
  • an addin in Excel to query the Hive
  • the ability to run Hadoop as a service on Windows Server
  • the ability run Hadoop on Azure and this create clusters and when you need them and use Azures massive connectivity to the internet to pull data in there rather than choke bandwidth to your own data centre.
  • F# programming for Hadoop. F# is a functional programming language that data scientists understand in the same way as I learned Fortran in my distant past as an engineering student.

At the time of writing there these tools are still in development and there is only “by invitation” admission to Hadoop on Azure. However I wanted to write this up now after a talk I gave a couple of weeks ago at the cloud world forum.. looking at the deck in isolation doesn't really help as I don’t tend to use PowerPoint to repeat what I am saying!

23 march 2013: This post has been superseded my post on HDInsight as that is the new name of of tools that have now been released to public beta