This article was first published on 16 December 2004.

Likert scales are one of the most commonly used scales in testing. Frequently appearing in questionnaires, they are simple categorical scales where the subject has to indicate their preference. Let’s see an example:

Question:

The subject then simply indicates their preference by ticking a box and the questionnaire is returned to the analyst for their work who then reports some measure like the mean and maybe the variance r standard deviation to understand the dispersal around the central tendency.

However, there are two main problems I would like to address there, and the first is common to all questionnaires.

The above question is not phrased very well. The wording only asks “how often do you use a search engine?", but it doesn’t expand in any way. Is the analyst interested in how often, on average, the subject uses a search engine, or when they have a lot to do. This can lead to a possible confusion about what it is that is actually being asked.

The answer isn’t to provide a small essay on what is expected, but simple clarification can often be useful:

How often do you, on average, use a search engine?

This provides some clarification. However, because a lot of HCI is concerned with real behaviour, it might be even better to pose the question in these terms:

It is a typical week for you. You go to work, you come home, you do what you do in your free time. In this typical week, how often would you look for something on the Internet using a search engine?

This places a context on the question which may allow a more accurate answer. The problem with the earlier questions is that there was no context present - it was free floating - which may not encourage the most accurate answer.

Another problem with the phrasing of the question is not the question itself, but rather the selection of answers. If a subject uses a searche engine twice a week, which answer do they select? Is it number 2 (once a day), or 3 (once a week)? Clearly their behaviour falls between both categories, and neither is appropriate for them. Two subjects who search the same amount of times may give different responses, so any findings are subject to a confounding effect of the phrasing of the questionnaire.

A more suitable set of answers would be:

This covers all eventualities as every situation can be properly accomodated.

The second problem with the use of Likert scales is how they are analysed. Even within peer reviewed articles in respected journals I have encountered situations where descriptive analysis is done by reporting the mean value (along with the standard deviation). This is erroneous, because the data here are non parametric data (particularly ordinal or categorical data). Reporting the mean should only be for continuous data.

The reasoning is thus: while even continuous data has discrete intervals (due to the resolution of the measuring instrument), it is considered continuous because (for example) a reaction time of 1 second is exactly half that of a reaction time of 2 seconds. With the example here, a response in category 2 is not necessarily twice or half that of category 1 - it cannot be described by a mathematical function, rather it is simply a categorisation that we apply to make sense. If I were to ask "

Nunally (1978) recommends that ordinal scales with eleven or more levels lose little information when compared to continuous scales, and it may be possible to use parametric analyses with them. However, with less, nonparametric analysis must be used.

If you are using a categorical or ordinal scale like the one above and you need to report the findings, the proper statistics to use are the median or mode for a measure of the central tendency (instead of the mean). There are also a range of suitable nonparametric “versions” of the common parametric tests:

Very often, these tests can be useful to use because a lot of them are resistant to the effect of outliers (they analyse data based upon their rank rather than their actual position) which makes them slightly more robust. However, if you have outliers, it is a very good idea to investigate them to see why they are there (they may be there due to a transcription error, ie, a typo, or because something interesting is happening).

It is common even for experienced analysts to analyse data from Likert scales incorrectly, for example reporting the mean instead of the median or mode. A range of nonparametric tests are available for researchers who wish to analyse the data, and these tests are more suitable than tests like the t-test or the analysis of variance.

Likert scales and their use

Likert scales are one of the most commonly used scales in testing. Frequently appearing in questionnaires, they are simple categorical scales where the subject has to indicate their preference. Let’s see an example:

Question:

*How often do you use a search engine?*- Several times a day;
- Once a day;
- Once a week;
- Less frequently than the above

The subject then simply indicates their preference by ticking a box and the questionnaire is returned to the analyst for their work who then reports some measure like the mean and maybe the variance r standard deviation to understand the dispersal around the central tendency.

However, there are two main problems I would like to address there, and the first is common to all questionnaires.

**Wording the questions**The above question is not phrased very well. The wording only asks “how often do you use a search engine?", but it doesn’t expand in any way. Is the analyst interested in how often, on average, the subject uses a search engine, or when they have a lot to do. This can lead to a possible confusion about what it is that is actually being asked.

The answer isn’t to provide a small essay on what is expected, but simple clarification can often be useful:

How often do you, on average, use a search engine?

This provides some clarification. However, because a lot of HCI is concerned with real behaviour, it might be even better to pose the question in these terms:

It is a typical week for you. You go to work, you come home, you do what you do in your free time. In this typical week, how often would you look for something on the Internet using a search engine?

This places a context on the question which may allow a more accurate answer. The problem with the earlier questions is that there was no context present - it was free floating - which may not encourage the most accurate answer.

Another problem with the phrasing of the question is not the question itself, but rather the selection of answers. If a subject uses a searche engine twice a week, which answer do they select? Is it number 2 (once a day), or 3 (once a week)? Clearly their behaviour falls between both categories, and neither is appropriate for them. Two subjects who search the same amount of times may give different responses, so any findings are subject to a confounding effect of the phrasing of the questionnaire.

A more suitable set of answers would be:

- More than once a day;
- Between once a day and once a week;
- Between once a week and once a month;
- Less than once a month;

This covers all eventualities as every situation can be properly accomodated.

**Analysis**The second problem with the use of Likert scales is how they are analysed. Even within peer reviewed articles in respected journals I have encountered situations where descriptive analysis is done by reporting the mean value (along with the standard deviation). This is erroneous, because the data here are non parametric data (particularly ordinal or categorical data). Reporting the mean should only be for continuous data.

The reasoning is thus: while even continuous data has discrete intervals (due to the resolution of the measuring instrument), it is considered continuous because (for example) a reaction time of 1 second is exactly half that of a reaction time of 2 seconds. With the example here, a response in category 2 is not necessarily twice or half that of category 1 - it cannot be described by a mathematical function, rather it is simply a categorisation that we apply to make sense. If I were to ask "

*How many times a month do you use a search engine?*" this would provide continuous data. Searching twice a week is exactly twice as much as searching once a week.Nunally (1978) recommends that ordinal scales with eleven or more levels lose little information when compared to continuous scales, and it may be possible to use parametric analyses with them. However, with less, nonparametric analysis must be used.

**Nonparametric analysis**If you are using a categorical or ordinal scale like the one above and you need to report the findings, the proper statistics to use are the median or mode for a measure of the central tendency (instead of the mean). There are also a range of suitable nonparametric “versions” of the common parametric tests:

- Paired t-test: Wilcoxon signed ranks
- Unpaired t-test: Wilcoxon ranks sums, Mann-Whitney U test
- Between subjects anova: Kruskal Wallis
- Within subjects anova: Friedman chi square

Very often, these tests can be useful to use because a lot of them are resistant to the effect of outliers (they analyse data based upon their rank rather than their actual position) which makes them slightly more robust. However, if you have outliers, it is a very good idea to investigate them to see why they are there (they may be there due to a transcription error, ie, a typo, or because something interesting is happening).

**Conclusion**It is common even for experienced analysts to analyse data from Likert scales incorrectly, for example reporting the mean instead of the median or mode. A range of nonparametric tests are available for researchers who wish to analyse the data, and these tests are more suitable than tests like the t-test or the analysis of variance.