Session

Show me the errors you didn't look for

with Claus Ekstrøm

useR!2017: Show me the errors you didn't look for

Keywords: Data cleaning, Quality control, Reproducible research, Data validation
Webpages: https://CRAN.R-project.org/package=dataMaid, https://github.com/ekstroem/dataMaid
The inability to replicate scientific studies has washed over many scientific fields in the last couple of years with potentially grave consequences. We need to give this problem its due diligence: Extreme care is needed when considering the representativeness of the data, and when we convey reproducible research information. We should not just document the statistical analyses and the data but also the exact steps that were part of the data cleaning process so we know which potential errors that we are unlikely to identify in the data.
Data cleaning and -validation are the first steps in any data analysis since the validity of the conclusions from the statistical analysis hinges on the quality of the input data. Mistakes in the data arise for any number of reasons, including erroneous codings, malfunctioning measurement equipment, and inconsistent data generation manuals. Ideally, a human investigator should go through each variable in the dataset and look for potential errors — both in input values and codings — but that process can be very time-consuming, expensive and error-prone in itself.
We present the R package dataMaid which implements an extensive and customizable suite of quality assessment tools to identify and document potential problems in the variables of a dataset. The results can be presented in an auto-generated, non-technical, stand-alone overview document intended to be perused by an investigator with an understanding of the variables in the dataset, but not necessarily knowledge of R. Thereby, dataMaid aids the dialogue between data analysts and field experts, while also providing easy documentation of reproducible data cleaning steps and data quality control. dataMaid also provides a suite of more typical R tools for interactive data quality assessment and -cleaning.