Reliability of measurement refers to consistency of measurement. Other synonyms for reliability are repeatability, reproducibility, precision, dependability, fidelity, accuracy, and generalizability. Suppose you wanted to test the reliability (or, if you will, the consistency or reproducibility) of your car odometer. You drive from your house to the post office and measure the distance on your car’s odometer to be 5.1 miles. Then you drive back home again and the distance measured on your car’s odometer is 5.4 miles. Is your odometer a reliable measure? Well, not if you want accuracy to be within tenths of a mile. But it is reliable if you want accuracy to be measured in terms of whole miles. This leads to an important reliability principle: Reliability is relative. How reliable we need our measure to be depends on what we plan to use the measure for. If you are just going on a casual date, you do not need a measure of interpersonal compatibility to be very reliable. If we are hiring the president of a major company, we would want a very reliable measure of, say, leadership potential.
We are usually interested in the reliability of a set of scores. For example, if we give one version of the ACT to a group of 100 students and give another version of the ACT (specifically two parallel forms or forms which measure the same thing to the same degree) to the same group of 100 students, we would like to see the same scores for all people on both forms. We will probably never see exactly the same scores for all 100 people on both forms, but we would like to see a similar rank-ordering of the 100 students on both forms.
There are many different types of reliability:
- Inter-rater
- Test-retest
- Internal consistency and coefficient alpha
For example, if we want to generalize how reliable a measure is over time, we might want to assess test-retest reliability. By way of illustration, if we give a measure of need for achievement to 200 graduate students, then we give the same measure 1 week later to the same 200 graduate students and correlated the two sets of scores, we would be measuring test-retest reliability. We assume that any difference in the rank-ordering of scores is because of unreliability. It is important to point out that there is no one single test-retest reliability for any measure. To illustrate, we might estimate test-retest reliability over a period of 1-week, or 1-day, or 6-months or 10 years. There would surely be different test-retest reliability coefficients for each of these time intervals.
I have been watching the Olympics while I prepare these notes. More specifically, I have been watching the diving competition. At the end of each dive, the different judges give scores. Sometimes the judges are not consistent in their scores. Here we are dealing with inter-rater reliability. If we assessed the correspondence of scores between judges for a group of divers, we would be assessing inter-rater reliability. We might be similarly interested in inter-rater reliability when we look at the consistency of judges at the apple pie contest at the County Fair or the rulings of Supreme Court Justices or the health ratings given to restaurants by State health and safety inspectors. We want reliability to be high in all these cases, because we want to have confidence in the scores produced. I know I would not want to eat in restaurants where the health rating was not reliable. Or ride in an airplane where the safety inspectors’ ratings were not reliable. It is important to point out that, as in the case of test-retest reliability, there is no one single inter-rater reliability for any measure. Different inter-rater reliability coefficients would emerge depending on which raters (or judges or observers, etc.) we chose to study and how many raters we chose to study.
When we develop a measure of extroversion or aggression or intelligence or any other construct, we like to know that the items are all measuring the same thing. We want the items in a measure to be relatively homogeneous (just like you want your milk to be homogeneous and not have crud in it) and for the measure to demonstrate internal consistency reliability, which is achieved by having items that measure the same construct and are correlated with each other. Imagine that you have four items measuring attitudes toward iguana. Item one is “I like iguanas a lot.” Item two is “I would be willing to have an iguana as a pet.” Item three is “I would like to spend a lot of time with iguanas.” Item four is “I like Mozart.” If we used this scale to measure attitudes toward iguanas, I can tell you right now that the internal consistency reliability of the scale would be higher if we just used the first three items, because the fourth item is measuring something different. It is measuring attitudes toward Mozart or maybe attitudes toward classical music. As in the case of the other kinds of reliability, there is no one single internal consistency reliability for any measure. Different estimates would arise as different items and different numbers of items are studied.
One of the most common methods used to estimate internal consistency reliability is coefficient alpha, which was developed by that famous psychologist Lee J. Cronbach, and is sometimes called Cronbach’s alpha. I will not go into how this coefficient alpha is computed, but, suffice it to say, that coefficient alpha typically ranges between 0.0 and 1.0 and we like to see higher rather than lower values. For example, if coefficient alpha for a measure is .80 or higher, we have confidence that the measure is relatively homogeneous and all of the items are measuring a common construct. If the items are measuring the same thing, coefficient alpha increases as the number of items increases. That is one of the reasons measures like the GRE, SAT, ACT, LSAT, and Myers-Briggs have so many items—to increase reliability.
If you want to increase the chance of getting significant results, use measures with higher reliabilities.