Examine real-world data

Completed

Data presented in educational material is often remarkably perfect, designed to show students how to find clear relationships between variables. So-called real-world data is a bit less simple.

Because of the complexity of real-world data, raw data has to be inspected for issues before you can reliably put it to use.

A best practice is to process your raw data before use to reduce errors and other issues, ordinarily by removing erroneous data points or modifying the data into a more useful form.

Real-world data issues

Real-world data can contain many different issues that affect the utility of the data and, consequently, your interpretation of the results.

It's important to realize that most real-world data can be influenced by factors that weren't recorded when the data was first collected. For example, we might have a table of race-car track times alongside engine sizes, but various other factors that weren't noted, such as the weather, probably also played a role. The influence of these factors can often be reduced by increasing the size of the dataset.

In other situations, data points that are clearly outside of what is expected, also known as outliers, can sometimes be safely removed from analyses, though care must be taken not to remove data points that provide real insights.

Another common issue in real-world data is bias. Bias refers to a human tendency to select certain types of values more frequently than others, in a way that misrepresents the underlying, or real-world, population. You can sometimes identify and prevent bias by exploring data while you're keeping in mind basic knowledge about where the data came from.

Real-world data will always have issues, but they're often surmountable if you remember to:

  • Check for missing values and badly recorded data.
  • Consider removal of obvious outliers.
  • Consider what real-world factors might affect your analysis and consider whether your dataset size is large enough to handle the issue.
  • Check for biased raw data and consider your options to fix it, if it's found.