This article was first published on 24 January 2005.
Terra Fantastica
The problem with a lot that is written about validity (and validation) is that it goes into so much detail, often on obscure (or is that opaque!) points. The minutae are incredible and need to be read to be believed.
But whether a study is valid or not should be fairly simple, shouldn't it? I guess things got confusing when the concept of construct validity was introduced. This (from what I gather) type of validity is when we cannot define what it is that we are looking at, but rather it is defined from what we observe. In some ways, the name that we assign to a theoretical construct is not important - instead of, say, intelligence, we could say construct 12328A and get the same thing. Intelligence though has some meaning.
So how can we really clean this mess up? I think that perhaps the best way is to examine what causes research to be invalid rather than valid (I'm trying to be a good little empiricist here so I'm going to falsify and not confirm).
The world of Perfection
In order to understand how a piece of research can go wrong (as regards validity), maybe would should compare some real-world research with some perfect stuff. However, perfect research does not exist, at least not in the absolute sense. To get around this, I will introduce a world called Terra Fantastica. It is a planet in a perfect universe, occupied by researchers who do perform perfect research. They know exactly what every measurement is, exactly what it means, and their papers are out of this world (literally and figuratively), though of course, fully accurate.
Human imperfection
So if we sent a team of our own, very fallible researchers up there to do some research, what would the Fantasticans tell us was wrong? Clearly our research would differ from theirs, sometimes significantly, othertimes not so much, but differ it would.
For a start, every measurement we take is an approximation of what really happens. The Fantasticans have a method that allows them to know what is really going on, but we can only approximate it, by converting an observable event into numbers or categories or some other convenient form of shorthand. All measurement are therefore discrete - even something like reaction time or temperature. A doctor of ours might take a persons temperature and say "Wow! It's 37.6 degrees centigrade - this patient doesn't seem ill (so far)". A Fantastican would look over the doctors shoulder and say, "Well, actually the temperature is 37.59832745292746583939393839..." and carry on to however many decimal places they need to. However, the important factor in this situation would be that both temperatures do not have any effect on the eventual finding: that the patients temperature is within a normal range even though the reading of the temperature differed between the human doctor and the Fantastican.
However, there may be other occasions when there is an larger difference. For example, a human researcher may create a test that they say measures intelligence. Showing some results off proudly, a Fantastican might glance at it and say, "well, actually that's not how intelligence is measured. It's done by a test like this", and out comes a different test that would draw different conclusions from the same person (or the Fantastican might say that there is no such concept as intelligence - I cannot say which because I'm not a Fantastican). The human researcher would then (presumably) go quite and sneak a look at the Fantasticans test to try and derive a newer, more accurate one (or being human, he or she might start a war).
Because the tests drew different conclusions from what one would assume are the same circumstances, we can conclude that as the Fantastican test is accurate, the human one is erroroneous. That's okay because we are, well, human. However, we also want to strive to excel, and a good place to start might be to understand the sources of the errors present in our test. For example, a human intelligence test might not include anything about spatial manipulation when it should (or it might when it shouldn't!).
Therefore, it seems reasonable to assume that a valid experiment would be free of all these sources of error. Because our researchers are human measuring things in a human world, they will always be subject to a certain amount of error, but as I showed with the doctor taking a patients temperature, we can conclude that sometimes the error is so insignificant as to be unimportant - if all the error is unimportant, we can assume that our study is valid (ie, it is so close to the method and results of a Fantastican study as to make no real difference).
That's good, because we have gone from a situation where we have lots of error occuring regardless of our best intentions, and yet we can still possibly draw some accurate conclusion about it. That's heartening, even if the Fantasticans shake their heads sadly when they see our joy.
The problem then remains of being able to quantify the amount of error that our studies have. If we can do that, we can get an understanding of whether our studies fall short of the Fantasticans ideal, so that they wouldn't shake their heads so vigorously. Short of kidnapping a Fantastican and nicking their tests, we have to roll up our sleeves, get our hands dirty, and begin to criticise our study. We can do this by taking Fantastican versions of our tests and comparing them. They let us do this because they feel so sorry for us, and deep down they are really nice people, even if a bit pedantic about research methods.
Tool implementation reliability
First up is a study looking at human body temperature for which the Fantasticans use proto-mega-thermodials. These work perfectly. A Fantastican takes the temperature of a patient, and we use a thermometer. However, our thermometer (unknown to us) is broken and reads ten degrees centigrade. Before we sympathetically tell our patient that he or she is dead, the Fantasican tells is that our measurement tool is broken. Thermometers are fine for taking temperatures, just not this one in particular. With glee, we get a new thermometer, take the temperature again, and get a result. The Fantastican is happy because we have a tool that (up to a reasonably amount) can reliably take temperatures whereas the broken one didn't: the measurements are reliable in that they are consistently close to those of the proto-mega-thermodial.
So one source of unreliability might be the implementation of a tool. Note that this concept of reliability is very close to what some might call validity: however, we are examining the validity of our entire study, so instead we reckon that the operation of the particular thermometer can be assumed to be reliable. Note that if we were interested in extremely small variations in temperature, we might find an ordinary thermometer to be unreliable: the consistency between our thermometer and the proto-mega-thermodial might be poor enough that our conclusions are bunkum. The consequence of this would be that our study is invalid - the tool isn't measuring what it is supposed to to the correct degree (if you'll pardon the pun - it was unintentional).
Tool purpose reliability
Emboldened by our success, we try to measure our patients level of extraversion. Our resident Fantastican is not happy though because we tried to measure it by inserting our thermometer into the patient and using the reading. "No, no no! That will not tell you how extravert this person is!" says our Fantastican. For a moment, we are stunned and humbled, but then we realise that we have just been exposed to another source of error: tool purpose error. We confirm this when we insert the thermometer into two people, one of whom is like Robin Williams on speed, the other of whom just sits in a chair quietly sulking by themselves (perhaps it was the indignity of having thermometers repeatedly inserted at regular intervals). Both measurements read exactly the same, yet both patients seem to us different somehow.
The Fantasican pulls out a Mega-Hyper-Shouterometer which accurately gauges extraversion, applies it to both patients and shows us the difference which bares no resemblance to what we observed. The Fantasican tells us that "no, your tool is not broken this time! However, you used the wrong tool to measure what you wanted to measure!"
That's actually good for us. We now know that whatever purpose our tool could be put to, measuring extraversion is not one of them. To me, this is very much the original concept of validity which assessed the use of a tool and not the tool itself. However, using a tool for an incorrect purpose could be said to be unreliable. If we used a thermometer to predict people's level of extraversion, we would find the results were inconsistent. Loud, outgoing people would record the same temperatures as quite sulkers. Likewise, using the high score from a space invaders game (which of course we never show the Fantastican in case they get offended) may also be an unreliable measure of extraversion (but it may not - I'm just hypothesising here) because there is no relation between the two scores.
Rater reliability
While the Fantastican gets rid of the extravert (because they won't stop talking to us and interrupting our work - we leave the sulker alone because he's quiet), we sneak and photocopy its test for extraversion. Hooray! We can now measure it accurately, so we adminster it to a lot of people and get some results. The Fantastican notices this and admonishes us - not for stealing its test, but rather because we administered it.
"Ah, but you have to be trained to administer this!" says the Fantastican. "See these scales - you have to rate your patients on them and to do that properly requires training". Comparing results, we see that another source of error has crept in: recording error.
Using a group of Fantasticans as a "gold standard", we can see that results recorded by humans are not consistent: they are unreliable. This issue becomes more important when we are using qualitative tests (such as interviews or protocol analyses), which of course become so clear when we compare our findings to our aliens' ideal. Our study was therefore unreliable in that the findings of researchers untrained in the use of the Mega-Hyper-Shouterometer were not consistent with those of our ideal: they were unreliable.
Interpretation reliability
Fed up with this, we go back to taking temperatures. We noticed that one group (who were standing in a draughty corridor) have a lower temperature than the others (who were wrapped up warm in bed). However, one researcher concludes that the colder group have a lower temperature because they are more intelligent. The Fantastican disagrees and says, "No, it's because they were standing in a cold corridor before you took their measurement."
"Ah", we say as we realise that we have just been opened to another source of unreliability: interpretation reliability. In this, the interpretation that a researcher applies to the results of analysis is not consistent with what is actually happening.
Of course the big problem is that our researchers conclusion was not consistent with the gold standard as represented by the conclusions of the Fantasticans. This could be discussed as a problem of validity (or rather, a lack of), but the inconsistency with the truthful conclusion leads it to be primarily a problem of reliability.
Validity?
Having ridden roughshod over the existing concepts of validity, this begs the question of where does validity come in to this? How can we say whether a study is valid?
The simple answer is that if all of the previously mentioned sources of unreliability are controlled adequately (i.e., they are minimised to insignificance), then the study is valid. As I mentioned earlier, validity is simply an emergent property of when we cannot find evidence to support the hypothesis that unreliability (of one kind or another) exists in the study.
Of course, this doesn't really simplify the debate. Tool purpose reliability is much the same as the original concept of validity (i.e., does the tool measure what it is supposed to measure?). However, by recognising this simplification of purpose down to one of reliability, one may find that the process of assessing validity becomes easier.
In addition, this method allows the assessment not just of the purpose of the tool, but rather the entire study. When used as a framework, it can be used to examine the entirety of a piece of research.
In real life though, there is no gold standard. Fantasticans don't exist (at least - and sadly - not on my street), so we have no ideal measurement with which to compare. This is where the real skill and talent of the researcher comes into play.
I'll discuss ways to approach these reliability assessments in a later article.
A new view of research validity theory
Terra Fantastica
The problem with a lot that is written about validity (and validation) is that it goes into so much detail, often on obscure (or is that opaque!) points. The minutae are incredible and need to be read to be believed.
But whether a study is valid or not should be fairly simple, shouldn't it? I guess things got confusing when the concept of construct validity was introduced. This (from what I gather) type of validity is when we cannot define what it is that we are looking at, but rather it is defined from what we observe. In some ways, the name that we assign to a theoretical construct is not important - instead of, say, intelligence, we could say construct 12328A and get the same thing. Intelligence though has some meaning.
So how can we really clean this mess up? I think that perhaps the best way is to examine what causes research to be invalid rather than valid (I'm trying to be a good little empiricist here so I'm going to falsify and not confirm).
The world of Perfection
In order to understand how a piece of research can go wrong (as regards validity), maybe would should compare some real-world research with some perfect stuff. However, perfect research does not exist, at least not in the absolute sense. To get around this, I will introduce a world called Terra Fantastica. It is a planet in a perfect universe, occupied by researchers who do perform perfect research. They know exactly what every measurement is, exactly what it means, and their papers are out of this world (literally and figuratively), though of course, fully accurate.
Human imperfection
So if we sent a team of our own, very fallible researchers up there to do some research, what would the Fantasticans tell us was wrong? Clearly our research would differ from theirs, sometimes significantly, othertimes not so much, but differ it would.
For a start, every measurement we take is an approximation of what really happens. The Fantasticans have a method that allows them to know what is really going on, but we can only approximate it, by converting an observable event into numbers or categories or some other convenient form of shorthand. All measurement are therefore discrete - even something like reaction time or temperature. A doctor of ours might take a persons temperature and say "Wow! It's 37.6 degrees centigrade - this patient doesn't seem ill (so far)". A Fantastican would look over the doctors shoulder and say, "Well, actually the temperature is 37.59832745292746583939393839..." and carry on to however many decimal places they need to. However, the important factor in this situation would be that both temperatures do not have any effect on the eventual finding: that the patients temperature is within a normal range even though the reading of the temperature differed between the human doctor and the Fantastican.
However, there may be other occasions when there is an larger difference. For example, a human researcher may create a test that they say measures intelligence. Showing some results off proudly, a Fantastican might glance at it and say, "well, actually that's not how intelligence is measured. It's done by a test like this", and out comes a different test that would draw different conclusions from the same person (or the Fantastican might say that there is no such concept as intelligence - I cannot say which because I'm not a Fantastican). The human researcher would then (presumably) go quite and sneak a look at the Fantasticans test to try and derive a newer, more accurate one (or being human, he or she might start a war).
Because the tests drew different conclusions from what one would assume are the same circumstances, we can conclude that as the Fantastican test is accurate, the human one is erroroneous. That's okay because we are, well, human. However, we also want to strive to excel, and a good place to start might be to understand the sources of the errors present in our test. For example, a human intelligence test might not include anything about spatial manipulation when it should (or it might when it shouldn't!).
Therefore, it seems reasonable to assume that a valid experiment would be free of all these sources of error. Because our researchers are human measuring things in a human world, they will always be subject to a certain amount of error, but as I showed with the doctor taking a patients temperature, we can conclude that sometimes the error is so insignificant as to be unimportant - if all the error is unimportant, we can assume that our study is valid (ie, it is so close to the method and results of a Fantastican study as to make no real difference).
That's good, because we have gone from a situation where we have lots of error occuring regardless of our best intentions, and yet we can still possibly draw some accurate conclusion about it. That's heartening, even if the Fantasticans shake their heads sadly when they see our joy.
The problem then remains of being able to quantify the amount of error that our studies have. If we can do that, we can get an understanding of whether our studies fall short of the Fantasticans ideal, so that they wouldn't shake their heads so vigorously. Short of kidnapping a Fantastican and nicking their tests, we have to roll up our sleeves, get our hands dirty, and begin to criticise our study. We can do this by taking Fantastican versions of our tests and comparing them. They let us do this because they feel so sorry for us, and deep down they are really nice people, even if a bit pedantic about research methods.
Tool implementation reliability
First up is a study looking at human body temperature for which the Fantasticans use proto-mega-thermodials. These work perfectly. A Fantastican takes the temperature of a patient, and we use a thermometer. However, our thermometer (unknown to us) is broken and reads ten degrees centigrade. Before we sympathetically tell our patient that he or she is dead, the Fantasican tells is that our measurement tool is broken. Thermometers are fine for taking temperatures, just not this one in particular. With glee, we get a new thermometer, take the temperature again, and get a result. The Fantastican is happy because we have a tool that (up to a reasonably amount) can reliably take temperatures whereas the broken one didn't: the measurements are reliable in that they are consistently close to those of the proto-mega-thermodial.
So one source of unreliability might be the implementation of a tool. Note that this concept of reliability is very close to what some might call validity: however, we are examining the validity of our entire study, so instead we reckon that the operation of the particular thermometer can be assumed to be reliable. Note that if we were interested in extremely small variations in temperature, we might find an ordinary thermometer to be unreliable: the consistency between our thermometer and the proto-mega-thermodial might be poor enough that our conclusions are bunkum. The consequence of this would be that our study is invalid - the tool isn't measuring what it is supposed to to the correct degree (if you'll pardon the pun - it was unintentional).
Tool purpose reliability
Emboldened by our success, we try to measure our patients level of extraversion. Our resident Fantastican is not happy though because we tried to measure it by inserting our thermometer into the patient and using the reading. "No, no no! That will not tell you how extravert this person is!" says our Fantastican. For a moment, we are stunned and humbled, but then we realise that we have just been exposed to another source of error: tool purpose error. We confirm this when we insert the thermometer into two people, one of whom is like Robin Williams on speed, the other of whom just sits in a chair quietly sulking by themselves (perhaps it was the indignity of having thermometers repeatedly inserted at regular intervals). Both measurements read exactly the same, yet both patients seem to us different somehow.
The Fantasican pulls out a Mega-Hyper-Shouterometer which accurately gauges extraversion, applies it to both patients and shows us the difference which bares no resemblance to what we observed. The Fantasican tells us that "no, your tool is not broken this time! However, you used the wrong tool to measure what you wanted to measure!"
That's actually good for us. We now know that whatever purpose our tool could be put to, measuring extraversion is not one of them. To me, this is very much the original concept of validity which assessed the use of a tool and not the tool itself. However, using a tool for an incorrect purpose could be said to be unreliable. If we used a thermometer to predict people's level of extraversion, we would find the results were inconsistent. Loud, outgoing people would record the same temperatures as quite sulkers. Likewise, using the high score from a space invaders game (which of course we never show the Fantastican in case they get offended) may also be an unreliable measure of extraversion (but it may not - I'm just hypothesising here) because there is no relation between the two scores.
Rater reliability
While the Fantastican gets rid of the extravert (because they won't stop talking to us and interrupting our work - we leave the sulker alone because he's quiet), we sneak and photocopy its test for extraversion. Hooray! We can now measure it accurately, so we adminster it to a lot of people and get some results. The Fantastican notices this and admonishes us - not for stealing its test, but rather because we administered it.
"Ah, but you have to be trained to administer this!" says the Fantastican. "See these scales - you have to rate your patients on them and to do that properly requires training". Comparing results, we see that another source of error has crept in: recording error.
Using a group of Fantasticans as a "gold standard", we can see that results recorded by humans are not consistent: they are unreliable. This issue becomes more important when we are using qualitative tests (such as interviews or protocol analyses), which of course become so clear when we compare our findings to our aliens' ideal. Our study was therefore unreliable in that the findings of researchers untrained in the use of the Mega-Hyper-Shouterometer were not consistent with those of our ideal: they were unreliable.
Interpretation reliability
Fed up with this, we go back to taking temperatures. We noticed that one group (who were standing in a draughty corridor) have a lower temperature than the others (who were wrapped up warm in bed). However, one researcher concludes that the colder group have a lower temperature because they are more intelligent. The Fantastican disagrees and says, "No, it's because they were standing in a cold corridor before you took their measurement."
"Ah", we say as we realise that we have just been opened to another source of unreliability: interpretation reliability. In this, the interpretation that a researcher applies to the results of analysis is not consistent with what is actually happening.
Of course the big problem is that our researchers conclusion was not consistent with the gold standard as represented by the conclusions of the Fantasticans. This could be discussed as a problem of validity (or rather, a lack of), but the inconsistency with the truthful conclusion leads it to be primarily a problem of reliability.
Validity?
Having ridden roughshod over the existing concepts of validity, this begs the question of where does validity come in to this? How can we say whether a study is valid?
The simple answer is that if all of the previously mentioned sources of unreliability are controlled adequately (i.e., they are minimised to insignificance), then the study is valid. As I mentioned earlier, validity is simply an emergent property of when we cannot find evidence to support the hypothesis that unreliability (of one kind or another) exists in the study.
Of course, this doesn't really simplify the debate. Tool purpose reliability is much the same as the original concept of validity (i.e., does the tool measure what it is supposed to measure?). However, by recognising this simplification of purpose down to one of reliability, one may find that the process of assessing validity becomes easier.
In addition, this method allows the assessment not just of the purpose of the tool, but rather the entire study. When used as a framework, it can be used to examine the entirety of a piece of research.
In real life though, there is no gold standard. Fantasticans don't exist (at least - and sadly - not on my street), so we have no ideal measurement with which to compare. This is where the real skill and talent of the researcher comes into play.
I'll discuss ways to approach these reliability assessments in a later article.
No comments:
Post a Comment