The Hardest Thing In Data Science
When I started down the path of learning Data Science, I was nervous. Ihave to work hard at math – it’s a skill I love but one that does not come naturally to me. I was nervous because I thought the most daunting task I would face in Data Science was learning all the algebra, statistics, and other maths I would need to do the job.
But I was wrong.
Math isn’t the hardest thing in Data Science. Actually, since it’s so mature, and documented, and well-known, it’s quite possibly the easiest thing to conquer in the skillset. No, the hardest thing about Data Science is asking the right question.
Wait – what? Surely that’s an easy thing to do – you have something you want to know, and you just ask that, right? Well, no. Many an aspiring Data Scientist is dashed on the rocks of the following process:
- Listen to question
- Select technology to answer question
- Find data
- Use technology over data
But that’s wrong, too. As a Data Scientist, you need to spend time – real time – on that first item (I’ll cover the proper process for Data Science in another post). So what is so hard about asking a question?
Nothing is knowable
There is no certainty in any data. In fact, there’s not even any certainty in reality, but that’s another matter. But most people don’t realize that. Sure, you can count the number of customers, or days, or widgets (maybe) to a degree of accuracy, but making predictions or classifications from those numbers is simply precision guesswork.
The farther out or the less data you have, the worse the prediction or classification. Your audience won’t believe that. They are bathed in news channels with simple graphics, three-line statements from politicians, and deceptive, trigger-based marketing. They want something that is exact, sure, and confident.
What you’ll need to work on here involves two things – one for your methodology and one for your audience. For your methodology, your focus is on reducing the margin of error. To do that you need good quality data, data you fully understand, and lots of it.
To help with the audience problem, use analogies and stories. Explain the possibility you’re wrong more than the possibility you’re right – something that goes against what you might want to tell the person paying your bills.
You don’t have all the data
“More data beats better algorithms” is kind of true, to a point. For instance, if I had every possible data point for an object, I can simply observe it in the descriptive, rather than having to extrapolate with a numerical analysis. But you will never have all the data, because of time and the ability to gather it.
But you do need more data, and you need better quality data. In Machine Learning, the “features” are the columns of data that predict the “label”, which is the answer you are looking for. Feature selection, and data grooming, are the parts of the process you should spend the most time on. Once you define the right features, you want a lot of them. More is better.
Your question is way too broad
“Why is my system slow?” or “What is our customer base like?” are questions that are just far too open. Data Science is more accurate when you start by telling audience questions “When you say that, tell me what you really want to do with the answer.” A better question would be “Among our best customers, what are the social causes they care about the most, so that we can advertise to them at those locations?” Or “When our systems slow down, is that due to human or systemic shortcomings?” and so on. Then don’t stop. Ask why they want to know that. “Because we have a limited set of funds for advertising” is a really good thing to know – the question might then change to “Where should we spend our limited resources for advertising for the highest return” – or even better ” Are we spending enough on advertising?” See how the question changes when you push back? Push back.
Your audience is impatient
A Data Scientist spends an inordinate amount of time setting and re-setting expectations. Sure, systems like Azure ML and HDInsight make getting answers from data faster than ever – but that’s just the processing part. Question definition, data sourcing, data grooming, testing and experimentation, and interaction development (reports or Cortana) takes time, and in today’s smart-phone app world, people just won’t wait.
But some things are complicated because they’re, well, complicated. They take time. But your audience won’t wait…so what do you do?
Break the problem down. Get as many smaller answers as you can so that you buy time to develop more complete answers. Show results quickly, and qualify that there are better answers coming.
So there you have it. Yes, you need to learn the math. You need to know R, and Python, and Azure ML, and the Data Catalog, and more. But the part that is the hardest has little to do with technology. It’s knowing how to ask a good question.