Bring your data to life
Charlotte asked me to write up something on bringing your ideas to life, and that got me thinking: what is life anyway? Many things don’t look very ‘alive’, because we barely notice them moving or changing over time - coral or lichen for instance. Other life forms move so fast we can barely see them or are so well camouflaged we aren’t aware of them.
Trends in a business can be hidden like this. While we can go on blissfully unaware of what is really happening, and while we are always interested in what isn’t working, we also need to understand what is working. The only way to do this is to dig into the data; it is important to understand the chain of events affecting an outcome and build on that. For example, when I worked for a marketing department, we tried to work out what campaign most influenced a customer’s decision to buy a new car. In a fashion house, it was to work out when to discount a line: too early and you run out of stock when you could have made more profit – too late and you have a warehouse full of last years stock, which is in many ways worse.
We could plot all of this information, and now that Power BI has been released, that can work for quite a few scenarios where the KPI/metric we are investigating is dependent on only a few variables, for example, shoe sales can depend on size and the age of the customer. In more complicated scenarios, however, it can be difficult for 3-dimension humans to visualize the 20-dimensions that may affect an outcome on a flat screen, even if we are given a HoloLens. This is where techniques like data mining can be useful: they can show which of a large number of factors are statistically significant to an outcome, and all we have to do is feed the data into the engine to get the answer.
So where does one get their hands on a data mining engine and the compute resources to run something like this, bearing in mind that it may not be clear whether this kind of analysis will even give a credible answer? My suggestion would be to adapt Azure Machine Learning for this; it’s quick to setup, it’s Azure so it’s pay per use, and although there are differences between data mining and Machine Learning, Azure ML has the tools needed for this sort of question.
The difference between Machine Learning and data mining
Before I get to how this might be possible, allow me to explain the difference between Machine Learning and data mining as this can be a source of confusion. Data mining and Machine Learning both use the same algorithms; what makes them different is that data mining is analysing the data as an end in itself, Machine Learning is asking a specific question of that data, such as predicting an outcome.
Azure Machine learning has a wide variety of modules, some of which are in fact data mining. While these might not form part of an actual published Machine Learning model (experiments as they are called), they can do a great job of understanding what’s going on in the data. A good example of this are the statistical and filter based feature selection modules. The feature selection module allows us to set the number of features (dimension, factors, columns) that we want to use to predict the outcome of a label (the outcome). As in the screenshot below:
I am looking for the 7 most significant features that will predict whether a flight is delayed by more than 15 minutes (the arrdel15 column). Notice that this module has two outputs, the left one is the data flowing through the module with those seven selected columns, and the right one will show us correlations between those features and the label which I can visualize.
This shows the column names relevant to flights being late with their scores in order. I can adapt this module to find out all the scores by simply setting the number of desired features from 7 to 0. I can also select whichever column I want to match against, and what algorithm I want to make that comparison from a long list – such as Chi Squared, Pearson Kendall, Spearman and Fisher (these are standard techniques which can be researched on Wikipedia and on the help on MSDN).
If I use the descriptive statistics module I can get a variety of information about the data as well, and there’s nothing to configure just run and visualise.
For the elementary statistics, however, I have to choose what sort of statistics I want over a set of columns e.g. Min, Max, Mode, and some which I would say are not so elementary!
So while Machine Learning can help with data mining it does mean we need to have some understanding of the maths and statistics behind these disciplines. Nevertheless, if the problems and opportunities you are trying to uncover are obvious then we would have already spotted them with simpler tools and techniques. Furthermore, Machine Learning and data mining are not perfect; statistically relevant factors can sometimes occur without there being any underlying real relationship between two variables – it is just coincidence. As such, not only do we need some skill in maths, we also need to understand our business.
We need these data professionals
While in other areas, cloud services and advanced analytics are reducing the need for server huggers to tinker and maintain performance and reliability: Data professionals with these rare skills are very much in demand. So it might be time for us to reacquaint ourselves with our maths skills and start spending more time understanding the business and data of the organisations we work for.
Applying that all to myself means I’ll be getting up to speed with SQL Server 2016, learning R as it applies to statistics, and trying to remember all the maths I have half forgotten. To do that I’ll be hanging out on the data professional portal of MVA, doing some free Coursera courses and spending time with our MVP experts at their events. I’ll also be working with our marketing teams and some of our more interesting customers to better understand how to bring data to life in a variety of businesses, all the while sharing my findings on my blog as I go on this journey
If you haven’t tried Azure Machine Learning yet then we have a pretty good course to get you started -all you need is a web browser, Word to open the lab guide and some curiosity!