The Data Analysis Maturity Model – Level Two: Reliable Data Storage and Query Systems
I'm in a series defining a Data Analysis Maturity Model. In the first level, I described the importance of creating, validating, testing and tracking your base data collection methods. With the source data clearly defined, tracked, documented and verified, the next level of data analysis maturity is to store and process the data correctly.
Data can be stored in text files, binary representations on a file system, in a Relational Database Management System, (RDBMS), a Document Database, Graph Database, or any other system – the point is to match the storage, processing and retrieval not to what the developer wants to use, but to the requirements and constraints the data application requires. If you need the highest level of transaction input/output compliance for Online Transaction Processing (OTLP), you might use an RDBMS to store and process the data. If the data needs to be read concurrently on a massive scale, you might select a scale-out system. Other systems are used for Online Analytic Processing (OLAP) or perhaps biased towards highly related data, which might require a Graph Database. And even an in-memory system such as a cache might be used for the fastest retrieval across a distributed solution. In any case, the storage and processing system should be guided by the requirements for the data, not simply because you have a given technology or your team’s background is familiarity with a given system. And of course you might use more than one technology in a given solution.
Even if you need to use multiple technologies, it can be useful to limit the storage types or vendors so that the team can specialize to a greater degree, ensuring best practices – but this isn’t a requirement. As organizations grow larger, it’s normal to see more data processing systems in use. I often find that in larger organizations, a mature environment has internal “user groups”, defined processes and practices, and an oversight process to ensure that data is stored and processed correctly and to spec. Whatever system is used, it should follow best industry practices, and have a stringent protocol and documentation throughout the solution space.
After the data is stored and processed, a Level Two maturity involves a highly developed query expertise. This should be a skill found not only within the technology teams, but across the organization – for whoever will interact with data. We don’t often think of executives or line-workers needing to understand set-theory and tuple extractions, but in fact they often do want to learn the way to select data using SQL or some other query language. Training should be voluntary, on-line, and easy to understand. I’ve conducted “Lunch and Learns” for multiple roles within the organizations to show where the data is stored, what users are allowed to access, and the proper way to access it. This might mean teaching folks to use the SQL language, Excel, Graph queries or other technologies. A solid understanding of query techniques – across all users – can solve many problems that “Advanced Analytics” is often brought in to alleviate.
For the IT side of this maturity, it’s important to think about “Views not SQL” - even if the data is not in an RDBMS. This phrase means that the technology team should create abstractions over base data using whatever technology that processing engine supports, to ensure security, consistency, and reliability. The team should then communicate those abstractions using a Data Dictionary that is central, easy to find, and kept up to date. This step is vital, but often overlooked, and not having at least a basic Data Dictionary is the source of most incorrect, conflicting, and erroneous results from analysis. I had once instance in a company where there were no less than five independent systems tracking what was called the “complete inventory”. Needless to say, analyzing that data was impossible until we combined them into a single unit – which was a massive effort. With a properly maintained and advertised Data Dictionary, creating those independent systems might not have happened at all.
In the next Maturity Level, I’ll explain the importance of reporting across the enterprise.