The Data Analysis Maturity Model – Level One: Data Collection Hygiene

Data Science and Advanced Analytics are umbrella terms that usually deal with predictive or prescriptive analytics. They often involve Reporting, Business Intelligence, Data Mining, Machine Learning, Deep Learning, and Artificial Intelligence techniques. Most of the time these technologies rely heavily on linear algebra and statistics for their predictions and pattern analysis.

In any foundational mathematics, and especially in statistics, base-data trustworthiness is essential. The result of a calculation is only as good as the information it uses. Depending on the algorithm in use, even small errors in the data are magnified until the result is untrustworthy. It follows that the processes and technologies used to collect, store and manipulate the data your downstream predictions use should be verified before you implement any kind of advanced analytics. However, many companies assume or ignore these concepts and rush in to creating solutions without verifying their upstream sources. I’ve found that describing a series of “Maturity Models” is useful to understand that there is a progression you should follow to get to effective Advanced Analytics.

In the first level of data analytic maturity, the organization has good data collection “hygiene”. Starting at the collection point, the base data must be consistent and trustworthy. 

Although it might not seem important to Data Science and predictive analytics that a web form have proper field validation, being able to trust a prediction algorithm begins here. As data professionals, we often think about Declarative Referential Integrity (DRI), proper data types, and other data hygiene controls at the storage and processing level. That’s very important of course – but beginning with the way we collect data, and even in the way we verify who we collect it from, it’s vital to ensure we have a chain of custody mindset through to the end prediction. Often the source of data comes not only from a Relational Database Management System (RDBMS) but from “unstructured” (although there’s really no such thing) sources such as a text or binary files. Mechanisms starting at the collection point must be set in the context of Programmatic Referential Integrity (PRI) where you write code to provide linking between data elements in a file, as well as trusting the DRI in a data engine. It’s also about validating  the structure of XML, JSON and other files for integrity.

This maturity level also documents the data flow, referencing the programs that collect the data. For analytics such as medical or financial predictions, the Data Scientist should be able to trace back the way the data is collected. In the next article, I’ll explain the second maturity level - reliable data storage and query systems.