Renee N. Saris-Baglama, Ph.D. and P. Allison Minugh, Ph.D.
Fourteen-year-old widows? Women aged 15 to 19 with 12 children? These are a couple of the strange statistics found in early U.S. Census data (Kruskal, 1981). In our experience cleaning longitudinal data we even found that time could go backwards! Or more accurately, survey dates could be out of order. A calculation that relies on survey administration dates to identify the number of days between two data points will result in inaccurate reporting and data loss. Simple errors—like an error in a person’s age—can set additional errors into motion and cause cascade effects that seriously impact data quality when other calculations are based on a variable like age.
For example, what if you wanted to know a person’s age when they first . . .
- Smoked cigarettes?
- Drank alcohol?
- Engaged in sexual behavior?
- Had contact with the police?
If the ages reported on these questions are higher than the reported current age, all of these data would be suspect.
Some common data errors to check for include:
- Out of Range Values: Values that fall outside the range of possible response options (e.g., a response of 6 on a scale from 1 to 5).
- Implausible Values: Values that have a ceiling beyond which data are impossible or highly unlikely (e.g., age is reported as 125 years old).
- Inconsistencies/Impossible Combinations of Values: The combination of two or more values is logically impossible (e.g., someone reports they never drank alcohol in their lifetime then reports past 30-day alcohol use).
- Missing Data: No data are entered where data are required (e.g., empty cells for key administrative variables, resulting in data loss).
- Formatting Errors: Data that do not adhere to format requirements (e.g., non-standard variable names and labels, data formats, unique identifiers).
- Duplicate Data: Identical records submitted.
As data managers, we understand people need the “right” data, right away, and it is far better for errors to be prevented in the first place. Indeed, with good planning, data entry systems can be designed to prevent most errors—especially the common ones—from making it into a dataset (e.g., rejecting duplicate entries and out-of-range values). For more complicated errors (e.g., inconsistent responses between two survey questions, inconsistencies across time points), someone with a good understanding of the content and expertise in data management can play a critical role in ensuring your data assets are protected.
Datacorp has technical experts that can assist you. Remember, when it comes to your data, an ounce of prevention is worth a pound of cure!
Contact us at dataqualityassessments@mjdatacorp.com to learn how our technical experts can help you prevent data errors and improve your data.
References
Kruskal, W. (1981). Statistics in society: Problems unsolved and unformulated. Journal of the American Statistical Association, 76, 505-515.