17 January 2012

Multiple raters

This article was first published on 11 April 2005.

Multiple raters

So you have something that has been independently rated by several people. How can you tell if the ratings are the same? This article discusses how to approach the intraclass correlation coefficient in the area of inter-rater reliability.

Problem - are several ratings consistent?

Very often in usability research, an investigator will want to get an idea of how things are rated. Perhaps the investigator has been tasked with assessing the subjective readability of ten pages on a website and gets five people to offer ratings on a 5-point Likert scale. How then can he or she see whether the ratings are the same or not?

How then can he or she see whether the ratings are the same or not?

The best way to look at this is to use statistical analysis. Good analysis will allow the investigator to easily say whether there is a significant difference or not. However, I have commonly encountered people using correlations assuming that if all are significantly associated, then the ratings are the same. This is certainly a possible solution, but it’s tricky: using the above example, our investigator will have to perform ten different calculations: if any result is not statistically significant, then the ratings are not the same. In addition, by subjecting the same data to several, similar analyses, the investigator might be causing alpha inflation.. Because every analysis has a probability of one in twenty of happening due to chance, repeating analysis reduces this. With ten different tests, the probability of getting a significant result drops to one in two. The investigator would be making a serious error in doing this.

Intraclass correlations

But fear not! There are good tests that can be used to test the consistency of several raters in just one go. These are known as tests of inter-rater reliability.

Commonly, the statistic of interest is alpha, and the best test to use is the intraclass correlation. It’s available in newer versions of SPSS and was first discussed by Shrout & Fliess (1979) in the Psychological Bulletin. What this test does is very similar to performing several correlations all together, but in one test. This saves our investigator a lot of time (only one test to perform), the results are simpler, and there is no risk of alpha inflation.

However, not all is simple. There are three ways to perform an intraclass correlation and there are two statistics to use for each: six possibilities!

The choice of statistics depends upon whether the rating to be used will be assessed by one rater, or by more than one. In SPSS, these are referred to as single measures and average measures respectively.

The type of test to be used depends upon how the raters are selected. For the first type, each rater is selected at random from the population and rates only one case. For the second type, the raters judge every case, but the raters used are selected at random from the population. For the third type, raters judge every case, just the like second case. However, these are the only raters available.

For rules of thumb: if you have a set of raters and all of them are used, then use the third type of test (known as a fully-crossed 2-way anova mixed model design). Otherwise, you have the first or second type. The way to tell which one to use if only some of the possible raters are used is this: if the raters will judge only some cases, then you have a 1-way anova design and the first type should be used. The second type is to be used therefore if the raters judge every case (a fully-crossed 2-way anova design with random effects)

If you are using SPSS to perform an intraclass correlation, you will notice that 3 different statistics are given: the single measures, the group measures, and the alpha. Quite often, the alpha is the same as the group measures, but not always. The single measures is invariably lower than the average measures. Sometimes, results can be good (in terms of answering the research question) when the average measures statistic is used even if not appropriate whereas the single measures is too low. Don’t be tempted though to report the incorrect statistics: good peer review will ferret this out.

Other measures

Of course, there are other measures that can be used. The Kappa statistic can be calculated in many ways, but the intraclass correlation coefficient should be good enough to get you through most eventualities


Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420-428.

No comments: