I originally published this article on Thursday 21 October 2004. This was before remote testing became commonplace so I like to indulge myself as being a bit ahead of the curve here.
Or is it correct to test anonymous people over the Internet?
A tricky question indeed, and the cautious amongst us may say “no way!".
Why is that? Is there something about Internet users that automatically make them worthless as experimental participants? Not inherently, but the difficulty exists because it is impossible to verify that people are doing the test correctly.
If for example I wanted to test somebodies ability to navigate around a small website, how can I be sure that they haven’t done the test already and are redoing it (at a different computer and time) just to show that they can complete it to their satisfaction? The demand characteristics should be accounted for because they can confound the experiment.
From this there are three questions:
- Does this really happen?
- If it happens, is it of any statistical significance?
- If so, how can it be controlled?
The first question will depend largely upon the people who visit a website. A University with student-only parts may be able to ensure that they know exactly who is doing the test, but ‘out in the wild’, things are different. All it takes is a few jokers and the whole set of results would be worthless. Though I have no evidence for this, I expect my audience to be people who take this stuff more or less seriously: they are here because they are interested in HCI issues. If this is the case, then I think it is safe to assume that the people taking part in an online experiment can be trusted to be decent about it (but then I’m a hopeless optimist when it comes to human nature!).
The second question again depends upon the audience. As mentioned, if the first issue doesn’t arise, then this question (and the last one) are both moot which is good - go ahead and analyse the data. I would reckon that the best way forward would be to utilise some L-scores or something similar to test people. This though has the drawback that valid participants who fall outside the mean performance will be excluded (some might say that having an automatic exclusion for outliers would be a good thing, but I would rather examine the raw data first before making this decision).
The third question: how does one control for this? Again, this depends on the first two questions being issues. If not, then there is nothing to worry about, but if there is, the invalid participants have to be recognised and dealt with appropriately.
So what does this mean?
So now onto pragmatics.
Is there a way of testing whether a set of results are good or not? The basic idea within psychmetrics is the L score. This was designed to test responses by asking the same question from different points of view: the presence of inconsistencies indicate that the participant isn’t being entirely truthful. Two questions that could be used would be:
- I am the life and soul of parties;
- At busy social occasions, I prefer to stay quietly with the crowd.
Clearly, contradictory answers to these questions imply that the participant might (for example) be trying to answer positively to each question.
However, for a lot of cognitive psychology, these questions are hard to ask or incorporate into the design of an experiment: the designer must be cautious, or else the questions and their nature will stick out like a sore thumb, causing problems with the data. How would one ask the above two questions when one it trying to understand somebodies mental model of a web browser’s navigation system? Such questions must also not interfere with the testing itself: they must not provide cues to the answers of other questions. The above two examples, for instance, could easily apply to a measure of a persons extraversion. Possibly the best way is to ask questions central to the research aim but from different points of view. This may aid knowledge elicitation.
Statistical comparison may also be a viable method: outliers can be tested against the population average and disqualified if they lie outside the bounds of normality. This is often performed for many different analyses and is therefore valid, but the experimenter will need to be sure that the population average isn’t the basis of invalid responses (i.e., if you are testing 20 people and 4 of them are way out of line, can you be sure that the remaining 16 aren’t jokers?).
Following up a test with participants might be useful: contacting them at a later date (preferably in real time using chat or ICQ) to qualify their responses can often help the experimenter to estimate the participants’ likely intent (whether good or bad). In addition, this may also help the knowledge elicitation process as well.
A more rigorous solution would be to vet participants: test only those participants whose veracity can be ascertained beforehand. Of course, how this is done can be difficult, and it severely limits the utility of online testing by effectively reducing the test population significantly. However, this depends upon the design of the experiment and the experimenters wishes.
A final solution would be to alter the design of the experiment to increase its power: test more participants. The rationale behind this is that with more people tested, the invalid participants’ responses become less statistically significant during analysis. Indeed, in my own research , I have found that sometimes living with lots of variance can be possible within an empirical framework.
In short, online testing can allow access to a larger population than could normally be tested. In addition, many people may be tested at once, reducing the workload of the experimenter significantly. However, in the way of many things, there are drawbacks in that the experimenter cannot know that the participant was actually tested properly. There are means of coping with this, but with online testing, the time and effort gains made will have to be set against the time and effort losses made in ensuring that the results are veracious.