Despite common misconceptions Microsoft now has extensive interoperability with open source technologies for example you can run a php application on Azure, get support from us to run RedHat, SUSE or CentOs on Hyper-V and manage your applications from System Center. , So extending this approach to the world of big data with Hadoop is a logical step given the pervasiveness of Hadoop in this space.
Hopefully your reading this because you have some idea of what big data is. If not it is basically an order of magnitude bigger than you can store, it changes very quickly and is typically made up of different kinds of data that you can’t handle with the technologies you already have. For example web logs, tweets, photos, and sounds. Traditionally we have discarded this information as having little or no value compared with the investment needed to process it, especially as it often not clear what value is contained in this information. For this reason big data has been filed in the too difficult drawer, unless you are megacorp or a government.
However after some research by Google, an approach to attacking this problem called map reduce was born. Map is where the structure for the data is declared for example pulling out the actual tweet from a twitter massage, the hashtags and other useful fields such as whether this is a retweet. The Reduce phase then pulls out meaning from these structures such as digrams ( the key 2 word phrases) sentiment, and so on.
Hadoop uses map reduce but the key to its power is that it applies the map reduce concept on large clusters of servers by getting each node to run the functions locally, thus taking the code to the data to minimise IO and network traffic using its own file system – HDFS. There are lots of tools in the Hadoop armoury built on top of this, notably Hive which presents HDFS as a data warehouse that you can run SQL against and the PIG (latin) language where you load data and work with your functions.
Here a Map function defines what a word is in a string of character and the reduce function then counts the words. Obviously this a bit sledgehammer/nut, but hopefully you get the idea. Also the clever bit is that each node has part of the data and the algorithm to process and then reports back when it’s done with the answers to a controlling node a bit like High Performance Computing and the SQL Server Parallel Data warehouse.
So where does Microsoft fit into this?
The answer is HDInsight which is now in public beta. This is a toolkit developed in conjunction with Horton Works to add integration to Hadoop to make it more enterprise friendly:
- Azure HDInsight is the ability run Hadoop on Azure so you can create clusters and when you need them and use Azure’s massive connectivity to the internet to pull data in there rather than choke bandwidth to your own data centre. There is actually a lot of free and paid for data already in Azure you might want to use as well via the Windows Azure Marketplace including weather data from the Met. Office, UK Crime stats and stats for Greater London.
- You can also run HDInsight as a service on Windows Server 2012 via the Web Platform Installer. Like any web service running on server this can of course be managed from System Center 2012; so you can monitor health and performance and if needs be spin up/shut down HDInsight nodes, just like any other web service under your control. I would see this being used to prototype solutions or work on complex data that you have in house and don’t or can’t want to put up in the cloud.
- An odbc driver to connect to Hive. With this HDInsight just be caome another data source; for example you can pull the data into SQL Server via it’s built in integration services, and squirt it out if needs be too. The dirver means you can directly build analysis services cubes (another part of SQL Server) or use PowerPivot in Excel to explore the data that way.
- The Data Explorer Preview add-in in Excel 2013 to query the HDInsight as well as load of other relational sources
- F# programming for Hadoop. F# is a functional programming language that data scientists understand in the same way as I learned Fortran in my distant past as an engineering student. Note these add-ins are free but are NOT Microsoft tools.
Big Data is definitely happening, for example there was even a special meeting at the last G8 meeting on this as it is such a significant technology. However it cannot be solved in one formulaic way by one technology; rather it’s an approach and in the case of Microsoft a set of rich tools to consolidate, store, analyse and consume: The point being to integrate Big Data into your business intelligence project using familiar tools, the only rocket science being the map reduce bit, and that is the specialism of a data scientist. Some of their work is published by academics so you might find the algorithm you need is already out there - for example the map function to interpret a tweet and pull out the bits you need is on twitter.
However research is going all the time to crack such problems as earthquake prediction, emotion recognition from photographs, ,edical research and so on. If you are interested in that sort of thing world then you might want to go along to the Big Data Hackathon 13/14th April in Haymarket, London, and see what other like minded individuals can do with this stuff.