Data Mining not dead, but possibly forgotten
With the launch of SQL Server 2008 R2 and SQL Server Denali ctp3 there has been ever more BI functionality built in but nothing has been mentioned about the strange world of data mining. Data mining has been in SQL Server since analysis services was launched but I have rarely seen it being used - it doesn’t seem to have entered mainstream consciousness in the way that some of the cool reporting front end tools have possibly because it smacks a little of black magic, or is seen as unnecessary or hard.
So here’s a quick post to get you thinking and hopefully encourage you to take it for a spin.
What exactly is Data Mining?
The simple answer is the ability to derive insights from your data that you didn’t know were there. Classic examples of use are customers who bought this also bought that (what is called basket analysis in the retail world) and trying to understand what factors influence a purchase.
How does it work?
There are a number of approaches, possibly the most famous being neural networking. In fact there are six different techniques built into analysis services each designed to do a particular job:
- Association is used for that basket analysis I mention and recommending additional items customers may want to purchase.
- Clustering groups such things as customers by detecting what is not obvious form casual observation. You may have heard a story a few years ago of the banks segregating customers into cherries (good customers) through to lemons (bad customers) and this might been done better using clustering rather than a trivial exercise around what revenue they bring in.
- Decision Trees try and predict the outcome of an attribute by understanding what factors determine the value of that attribute. The way the prediction works depends on what is being predicted and whether this is discrete (values such as yes or no, red green or blue and so on) or continuous (e.g. a number like value or price). So this could be used to work out what factors make a up a decision to purchase e.g. age gender, income, homeowner and so on.
- Linear Regression. You may remember this one form matches at school, the process of trying to draw a line through a series of dots to best represent the trend of those dots on a graph. This is used by the decision trees algorithm where the predicted column is continuous.
- Naive Bayes is the method for determine which customers would most likely respond to a marketing campaign, the naive term means that no relationship is assumed between any of the factors affecting the outcome.
- Logistic Regression. Every week you’ll hear something like “women drivers over 40 are 50% less likely to have a car accident” and this would be the algorithm that would be used to discover that relationship. It can also be used to discover which are the most important factors in determining an outcome.
- Sequence Clustering is similar to clustering but the data contains a sequence of events and this enables you to explore such things as a users navigation of a web site (known as click stream analysis).
- Time Series is used for forecasting a continuous value such as sales or profit.
- Neural Network is the non of the above option for really complicated input data. The mechanism for determining the outcome mimics the way neurons work in the brain and doesn’t actually show you how its results are derived at. Text mining is one example that uses this technique.
In all of these techniques you identify a sample set of data which has the values of the object you are trying to predict, from which you create a model. You then apply this model to live data to do your prediction and analysis.
How do I do this?
There are three ways of creating models:
1. You can create mining models in BI development studio as part of an analysis services project in SQL Server 2008 R2 or in the beta of the new version Project Denali and there are examples mining models in the sample adventure works analysis service project from Codeplex:
2. You can code a model directly in Data Mining Expressions (DMX) which might be appropriate for embedding models into an application such as an ecommerce site.
3. There’s an Excel add-in for data mining :
This works by taking the data in a spreadsheet and then posting it to an instance of analysis services for processing and then returning the results back to Excel. Originally designed for Excel 2007 it does work in Excel 2010 but only for 32bit installations, and while it can’t use data in the PowerPivot window in PowerPivot for Excel it can consume data in Excel pivot tables derived from PowerPivot data. Note that when I mention analysis services here, I mean classic analysis services in olap mode not the new shiny tabular based analysis services that also exists in SQL Server Denali.
First of all you are probably licensed for all of this now, you’ll have SQL Server standard or higher around somewhere you can use this as the mining engine and a copy of Excel for the front end (the excel add-in for data mining isn’t a separate license). However this might all seem a bit abstract and you might have a hard time convincing yourself never mind your users that this is worth the effort. One idea I have is what if you took the data out of some of the data you have access to see how this stuff works, for example data from System Center, or whatever data centre management tools you have (dare I mention VSphere?) your helpdesk system or asset management tools to get insight on what is really going on. This might be a useful project in it’s own right but as the data is not sensitive you can share it as an example of what’s possible and possibly impress your manager as well as the business owners.