In preparation for our upcoming webinar on Avoiding the Traps of Questionnaire Design and Analysis in Your Dissertation at 6 pm (GMT+2:00) on 19th September, I’m writing about an issue of reliability that is often problematic for students. An old bathroom scale explains the concept of reliability of the measurement instrument in your dissertation. Here goes…
If you use an old non-digital bathroom scale to weigh yourself, you may be like me. If I don’t like what I see, I step off and on it again to see if it changes its mind and gives me a better reading a second later. Possibly, I try this for the third time… (it could, of course, be worse ☹).
Our old scale teaches us about reliability. Its measurements of your weight (more correctly, your mass, but I’ll stay with weight) vary randomly within seconds. But they shouldn’t.
Unfortunately, old bathroom scales produce weights that are not very reliable or consistent. The weights or scores they display vary because they contain error variance, which is random. One moment, your weight is a bit up, and a moment later, it’s a bit down.
More formally, the old bathroom scale shows you weights or scores that are not reliable. Reliability refers to the consistency of scores.
Similarly, many dissertations and theses involve using questionnaires or tests that measure attitudes, perceptions, or some construct. If you are using a measurement instrument in your dissertation, you will need to describe the reliability of the scores it produces, among its other properties.
How do you measure the reliability of the scores produced by a measurement instrument?
Let’s say that we use the old bathroom scale to weigh many individuals. The weights of these individuals obviously vary between individuals, as some people are bigger than others, but there is also error variance in the scores or weights produced by the scale.
If 20% of the variability we observe in the weights of these individuals is error or random due to the inconsistencies in the scores produced by our old scale, then 80% of the variability would be free from such random error. Here, the reliability coefficient of the scores produced by our scale would equal .80. Similarly, if 30% of the observed variability in the scores is random or error variance, then the reliability coefficient of the scores produced by our scale would equal .70.
The bottom line here is that scores of a measurement instrument that measures consistently are reliable, while scores of a measurement instrument that measures inconsistently are unreliable.
Why is it important to have reliable scores?
Consider, for example, a situation in which we need to assess how well a new diet pill works. We measure the weights of a random sample of overweight individuals, and then they all take the diet pill for a month. Then, we re-measured their weights using the same measurement scale we had used before they started taking the pill.
If we use our old unreliable measurement scale for our pre- and post-measurements, some of the variability in the weights would be random. In the extreme situation, our very unreliable scores produced by our old inconsistent scale would contain so much random error that we would be unable to evaluate the changes in the individuals’ weights from before to after taking the diet pill. So, we wouldn’t know if the diet pill was working.
In summary, the first reason reliable scores are important is that poor score reliability reduces the potential for evaluating change. Any instrument that produces unreliable scores measures inconsistently, so changes in inconsistent scores cannot imply real change. Not cool.
The second reason is that the error variance in the scores of an unreliable measurement instrument affects what is being measured, as the scores are full of random error. This means that the validity of the scores – what the scale is supposed to be measuring – is compromised. So, measurement instruments that produce scores with poor reliability will also have compromised validity. Not cool.
The third, fourth and fifth reasons are that the random error, or noise, in our unreliable scores reduces the power of statistical tests. Recall that power is the probability of finding a significant difference if it exists. But this is increasingly difficult if there is random error in our scores as we need bigger differences for significance. So, unreliability reduces the power of statistical tests and, in turn, reduces the associated effect sizes. This also reduces the observed correlations between our scores and other measures so that our instrument is a poorer predictor of outcomes than it should be. It’s not cool at all.
In short, it is very important that the measurement instrument(s) that you use in your dissertation measure reliably. I hope I have convinced you of this.
We have discussed test-retest reliability. However, there are many other forms of score reliability, all involving consistency. These include parallel forms reliability measured by the consistency of scores on different versions of the same test and inter-rater reliability measured by the consistency of scores of different raters of the same individual.
However, the most reported reliability index is Coefficient alpha or Cronbach’s alpha.
Students almost always use Cronbach’s alpha to report the reliability of their measurement scale scores. However, alpha is one of the most frequently misunderstood and misinterpreted statistics in theses and dissertations involving measurement.
I’ll discuss the misuse of poor Cronbach’s alpha in a future post (and in our webinar).
Do contact me, [email protected] or Merle Werbeloff, PhD (Wits), if you need help with your statistical analysis or any other aspect of your dissertation.