How can we mine? Let me count the ways...
Recently I received some customer feedback that SQL Server Data Mining "doesn't have enough algorithms." More specifically, the comment was that we have the same capabilities are other Data Mining providers, we just "hide" many facilities as algorithm parameters rather than separating out each as a named algorithm. So let's count the Microsoft algorithms a few different ways to work this out.
First - let's go by the box. This is the list of algorithms as specified in Books Online
- Microsoft Decision Trees
- Microsoft Clustering
- Microsoft Naive Bayes
- Microsoft Association Rules
- Microsoft Neural Networks
- Microsoft Time Series
- Microsoft Sequence Clustering
- Microsoft Linear Regression
- Microsoft Logistic Regression
So that's nine - count 'em nine algorithms. But that's just one way. If you look at my book, Data Mining with SQL Server 2005 written with Zhaohui Tang, there are only seven algorithms! What? You say! How can it be?
Let me explain. During the development of SQL Server 2005, we realized a couple of tricks; 1) linear regression was the same as our tree algorithm, just forced to not split; and 2) logistic regression was the same as our Neural Nets, just with zero hidden layers. However, we got similar feedback - people want more algorithms, and specifically these ones, so we set up two "new algorithms" by forcibly setting parameters on the Decision Tree and Neural Network algorithms and voila! we shipped with nine named algorithms. It would have been hard to fill up two entire chapters explaining that last sentence, so Zhaohui and I decided just to stick to the seven core algorithms.
Anyway, this posting isn't really about how to count less algorithms, I really wanted to show you how to count more. When we set about designing SQL Server Data Mining, we really and truly tried to make data mining operations simpler. We thought at the time, rightly or wrongly, that the more options end users have, the more complicated and difficult the product would be to use. Therefore, we tried to determine the best behavior in a class, and make more advanced options available through parameters.
For example, take our clustering algorithm. We assumed that if people wanted clustering, most likely didn't care about the details of the algorithm, they just wanted to get the job done, and that those people who wanted more would look for it (the design principal - make the simple things simple, and the complex things possible). So we bundled up different flavors of clustering into a single package that many vendors would have broken apart. So let's start counting with clustering.
Our default clustering behavior is EM (Expectation Maximization) clustering using the Bradley-Fayyad scalable framework
Setting a parameter changes that to a K-Means clustering implementation using the same framework
Setting the same parameter another way, provides non-scalable versions of the two clustering varieties. (I know it's hard to swallow that the non-scalable versions count as separate algorithms, but if you started with the vanilla versions and added scalability, then of course you would consider those versions as new algorithms - I'm just working backwards here.
Let's move to our Decision Tree algorithm and we will consider our classification tree as one algorithm.
But our Decision Tree also predicts continuous and counts as a regression tree, so we will count that as another algorithm.
Oops! Our Decision Tree also creates full linear regressions at each of the leaf nodes. To get the typical regression tree behavior you need to make sure that none of the continuous inputs have the REGRESSOR flag and you get yet another algorithm.
Oh yeah, our trees allow for multiple targets in each model, allowing the discovery and display of interrelated patterns through our dependency net. I've seen other vendors advertise such functionality as an "algorithm" so there's our #8.
How about collaborative filtering with Trees - just slap a PREDICT flag on a nested table, and you have a complete recommendation system. Let's call it Associative Trees
If we're going to count Associative Trees, we also have "Associative Bayes". I guess the multiple target interrelated pattern thing counts here as well.
Association Rules. A-priori style
It seems odd to count association rules twice since we can do predictions with it, but nobody else does it (or didn't before - correct me if I'm wrong), so Predictive Association Rules makes the cut.
Well if we're going to go and call predictive association an algorithm, we had better do the same for our clustering algorithm. Granted, clustering doesn't make a great classifier or estimator, but the great Highlight Exceptions functionality of the Data Mining addins comes from this ability. Yes, we can do nested table prediction as well with clustering, but I wouldn't recommend it to my mom, so I won't take another four for that.
Neural Networks, Sequence Clustering, Time Series, Linear Regression and Logistic Regression. Yeah, yeah, I could get into varieties here, but I think you get the point.
So by that count, and not being too creative (trust me, I can do more) we're looking at 23 algorithms in SQL Server 2005 Data Mining. There are a few more options coming up in SQL Server 2008 that are worth discussing as well.
The time series of SQL Server 2007 uses the ARTXP algorithm - "Auto Regression Trees with Cross Predict". In 2008, we're adding ARIMA as well, for algorithm #24.
And yet again with Time Series, the default mode of operation is to blend ARTXP and ARIMA results in an intelligent way to maximize accuracy and stability for #25.
Arbitrarily there are 23 algorithms in SQL 2005 and 25 in SQL 2008, with the option of teasing out even more varieties depending on how you apply parameters and flags to the base nine (or seven - depending on how you count!). Next time someone quips that SQL Server only has "nine" algorithms, tell them that's just the packaging - each of those nine provides a wealth of value in each box.