Have you ever had to rerun an analysis because you discovered something askew in your data?
Does everyone want your data before you think they are ready?
Is the demand for your data greater and more urgent than the time you have to prepare it?
Does data cleaning play second fiddle to data analysis in your shop?
Chances are if you are a social or health scientist, you may have said yes to at least one of these questions. With the pressure for real-time data, we got curious about the impact data cleaning has on study results and conducted a study to test the impact full data cleaning has on analytic results. Our findings have significant implications for anyone who relies on raw, uncleaned data to make decisions.
We conducted secondary analysis of participant-level survey data for two human services programs to determine the impact data cleaning has on participant demographics, program outcomes, predictors of program outcomes, and predictors of program retention. We found that data cleaning significantly affected the results and conclusions drawn from the analysis, and this impact increased with the complexity of the analysis.
These findings demonstrate that data quality is critical for data-driven decisions.
Analysis of dirty data or partially cleaned data may lead to ill-informed conclusions, yet little has been published on data cleaning processes and their influence on analytic findings.