The danger of predictive analytics

 As a technical professional working in the data field, I implement small and large-scale data systems every day. Some of those systems are used to record transactional data for immediate processing, others record the data in large collections for later historical analysis. And many of the systems I now design have the goal of predictive analytics - the ability to forecast or predict a future event. (I should mention that there is a statistical difference between forecasting and prediction; but for my purposes here both terms suffice.)

These systems are not necessarily new - they are based on grounded statistical and numerical methods that have been in use for some time. Even so, the computing environments I use to implement them (HDInsight, MDX, R, Python and other tools such as the Power* add ins for Microsoft Excel) seem to be almost magical in their ability to find patterns, perform regressions, and present those trends in a compelling way to detail to a company where the next data point will occur. It's almost as magical as this:

There is, however, a danger associated with these systems. Nate Silver, a noted statistician who wrote some great books (highly recommended) and predicted with uncanny accuracy the last few American elections, recently made a rather huge error in predicting a recent soccer game (futbol if you're not American). One of the critiques I read on the failed prediction spoke about the dangers of working with too many variables and too little data. Another example was the 2013 explosion of a meteor over Siberia, where a noted JPL researcher predicted a mass of around 20-30 kilotons for the explosion, which turned out to be several times that amount. To be fair, he told JPL that his initial estimate was likely to be incorrect - because of the lack of enough base data, and the time to calculate it.

So are "big data", "predictive analytics", "deep learning" and all the rest just new fads, not to be trusted? No, they are not fads. They are actually a continuation of a process organizations have used for hundreds of years. But what *is* true is that as always, the reliability of data, the applicability of data, and the amount of data is as important *if not more so* than the algorithms used to make predictions from the data. As we move forward with our smart-phone level glossy tools, let's not forget the basics. Humans are programmed to recognize patterns (even when they aren't there) and to take the shortest route to a decision, so the old adage "a computer is a device that allows you to make mistakes far faster than with paper and pencil" is something we need to keep in mind.

And so the upshot is this: the new tools and processes at our disposal for analyzing data should be used to inform our decisions - not prescribe them.