17 January 2012

Outliers - a conundrum

I originally published this article on 28 October 2004 at Milui Articles

Outliers
a conundrum


Outliers are points of data that lie outside of what would be expected. They can be due to typographical errors (i.e., typing in an enormously incorrect number), participant error (i.e., forgetting to respond to a visual stimulus causing a very long reaction time), or due to an interesting effect that the researcher may not have considered before. However, calculating an outlier is may result in erroneous data being used for the proper analysis of research.

Consider these data: 1, 2, 3, 94. Of these four point, three of them are close together (the 1, 2 and 3), but the 94 value is way out of line with them. If a test showed these data together, the 94 could be considered an outlier because it is so different from the others.

In general terms, outliers are dealt with by either deleting them, transforming them, or investigating them further. What happens depends upon what type of outlier the data point is and what the researcher decides is the best way to deal with it.

Probably the most common process for determining outliers is to take the mean and a variance term of the data, and use these to examine which data points lie well outside the norm. A common method is to take the first quartile from the third (the interquartile range), and then calculating the boundaries by adding/subtracting them from the median. Anything lying between 1.5 - 3 times the median plus or minus this value may be considered a mild outlier, whereas anything more than the median plus or minus 3 times the interquartile range may be considered an extreme outlier.

A problem is however in working out what values should be used for these calculations. Using the first method described above (median +/- [1.5 * interquartile range]) is straightforward for univariate data, but what exactly goes into calculating the median and interquartile range? Should outliers be included into the calculation of the median and interquartile range which they are being compared to?

The rationale behind this idea is that an outlier (often) should not be there in the first place (definitely so with typographical errors), and is thus erroneous. Comparing data to a median and interquartile range that include erroneous values will produce an erroneous response, often in favour of the outlier (i.e., not identifying it). The GIGO (garbage in, garbage out) principle implies that any subsequent analyses are likely to be erroneous.

However, this is not true for all cases: if there are indeed no true outliers, then there is no problem as all the data are correct. However, if outliers are present, then we may have a problem. The best way to solve it is to exclude the outliers from the calculation of the median and interquartile range - that way, the erroneous data are being compared to correct values.

The conundrum in the title of this essay refers to this particular situation. If a researcher has a data set with outliers, they should do one thing, but if they don’t then they should do another. But the conundrum is that this decision cannot be made until the outliers have been identified.

In short, you need to know what the outliers are before you can discover that they are outliers!

How this could be resolved is discussed in a future article.

No comments: