Cleaning your Dissertation’s Data

In our next webinar on Thursday 5^th October @ 7 pm (GMT + 2:00), Professor Linda Bloomberg and I will be discussing analysing your qualitative and quantitative data. In this post, I discuss the step before the analysis of quantitative data. This is where you inspect and clean the errors in your data before starting the statistical analysis.

Just a couple of notes before I begin. First, I’m referring here to quantitative data with inaccuracies and errors, also known as dirty data. I’m not considering missing data. Second, each set of data is unique and should be considered on its merits and in context. What follows may not necessarily apply to your data.

Why would my data be dirty?

Data can be dirty for several reasons. For example, you may have duplicate observations or capturing errors. Perhaps people outside your target population have been included in your dataset inadvertently. Some respondents may have given random answers to your questions just to get the survey done quickly. Others may have agreed throughout or chosen the neutral option for every item irrespective of the question, a phenomenon known as flatlining. These are all sources of dirty data.

Should I clean my data if it is dirty?

Cleaning dirty data is not just the right thing to do. It is essential.

First, if you have more than a few duplicate observations, the results of your analyses may be wrong. In many statistical analyses, we assume independent observations. Duplicate observations, besides presenting other problems, contravene this assumption and contribute to problematic results.

Second, capturing errors may cause outliers in your data. An extra zero in an entry may cause an outlier or extreme value. These values cause huge problems and must be corrected as even a few outliers can contribute to inaccurate or wrong results. Correlations, results of regressions, t-tests etc. can simply be wrong due to errors in your data.

Third, respondents who should not have been included in your data may have responded differently from your targeted respondents and cause problems in your results and their generalisability.

Fourth, flatlining and random responses introduce random error variance in your data, and can be catastrophic in your analyses. Random variance is unexplainable and undermines the potential of your predictors to explain variance.

This is all bad news indeed. Dirty data can lead to wrong statistical results, and wrong results can lead to wrong recommendations and wrong decisions.

How do I clean my data?

If you are trying to clean big data, you will probably need a specialised program or dedicated algorithm.

But if you are a student with some hundred cases, or fewer, there are several measures you can take to clean your data. For example, if you examine your data in Excel, you can quickly identify outliers by applying conditional formatting with a robot-style colour formatting which shades your highest values in green and lowest in red (or vice versa). But make sure that these extremes could not be valid responses before you delete or correct them. To identify flatlining, you can calculate the variability within the responses of each respondent and look for cases with no, or little, variability. You may also consider calculating how long the respondent took to answer the survey – online survey software records the start and end times for each respondent. If the respondent spent an unrealistically short time responding to your survey, it may indicate that they answered randomly. Also look for responses to items that should be consistent or opposite.

If you are using a stats package, you can run an analysis to identify duplicate observations, run descriptive statistics to find your minimum and maximum values for each variable, and Box and whisker plots to show up outliers and extreme values.

Should I keep a record of how I cleaned my data?

Yes, it is very important to document exactly how you clean your data. Keep a record of all the changes you make, from your initial dirty data to the cleaned data, as your supervisor or examiner may want to check what you’ve done. You should also summarise your steps in your methodology chapter or in an appendix, recounting exactly your procedure and which cases you cleaned or deleted at each step.

If you need help cleaning and analysing your data, contact me at [email protected].