Data Wrangling – ELT not ETL

In Data Science, we most often use Extract, Load and Transform (ELT) as opposed to Extract, Transform and Load (ETL) that you see most often in Business Intelligence (BI). There are a couple of reasons for this.

First, in many BI solutions you have a few data sources that you are integrating into a historical data exploration tool. You need to combine these sources into database structures like cubes. You’re setting up pre-computed set of data for users to explore and get answers to questions, often using visualizations.

To create these combined database structures from disparate data sources, you need to ensure that the data is related through a key field of some sort, and that the data are fairly homogenous – numbers are numbers (not text), text is text (not binary) and that keys fields are the same to link everything up. So you Extract the data from the sources, Transform the data into the right types and shapes, and then Load it into the analysis structures. ETL.

But in Data Science, you’re not browsing data, but looking for an answer. You’ll still extract the data – usually with far more variety and from many more sources – and you’ll probably end up processing it with several tools, from R to Azure ML, or perhaps with Python. Or more likely you’ll use many technologies (yes, SQL as well) to get different parts of the answer, or to get multiple kinds of answers. Since each technology has different aspects (R has about 9 datatypes, SQL Server over 20 for instance) you want to wait a late as possible in each step to change the source data in any way. Every time you change the data you lose fidelity from the source. And base data is the lifeblood of statistical analysis – the base numbers are VERY important. They have to be representative,

So in Data Science, you Extract, then Load, and then only at the last moment do you Transform. The overall ELT process is often called “Data Wrangling”, and has interesting direct and side effects. In the posts that follow, I’ll cover this process in more detail. For now, focus on locating, documenting and understanding the sources of your data.