This article was first published on 13 June 2005. It's interesting to see that information retrieval as a topic has recently got itself fashionable again with the rise in big data and natural language processing considering that many thought search was a solved problem back then.
I read about an interesting project called Aquaint the other day. It is partly funded by the CIA and the NSA and concerns information retrieval. The project is in response to the weaknesses of the current methods of searching for relevant information. While it is very possible to use Google to find just about everything, it is not all plain sailing.
Misleading Information
For one of my experiments, I had people create their own concerns (information needs) and queries to get information about them. Using their information, I found that approximately 50% of searches will produce some information of relevance. However, the percentage of completely relevant information was much lower (under 29%). This is still a reputable figure considering the billions of irrelevant documents that a search engine like Google catalogues.
However, when trying to satisfy a complex information need, the majority of work still lies with the user and not an automatic system. Current search engines are superb for answering simple fact-based questions (such as “what is the height of Mount Everest?"), but trying to understand complex issues that are only covered by a multitude of disparate documents has little solution outside of humans working hard.
Information Overload
I would guess that the NSA and CIA want to be able to trawl through massive databases of information and from that derive useful knowledge (such as a sudden gathering of terrorists in a US city). The amount of information being conveyed across the planet is just too much for humans to cope with, even if systems such as Echelon provide enormous amounts of filtering. Certainly from my own work, I become acutely aware that relying on automatic (read: simple and relatively stupid) methods of dealing with large databases of information was going to present a problem. Most information needs are not simple but are very complex. What is submitted to search engines is often just a small part of an information need, broken down to make the task more manageable. A system capable of addressing these needs would be very valuable indeed.
So how does this system work? From what I can gather, a lot of the work lies in classic (and currently very unfashionable) information retrieval. NB - I say it’s unfashionable because most attention is focused upon interactive search in HCI rather than plain retrieval.
The first problem is having the system understand what it is that the user is searching for. Already, there are good tools to facilitate this task (such as categorisation systems, query expansion), but there was one tool which piqued my interest: contextual interpretation.
Inferring Context
The presence of keywords in the snippets encouraged the user to assume that the document was more relevant than it actually was.
My own research indicated that understanding the context of what or whoever is being interacted with is of prime importance. Certainly, humans, even when they generated their own searches using areas that they were familiar with, made mistakes when context was misunderstood. For example, take Google’s text snippets (the 2 lines of text with keywords embedded in bold characters). These are a traditional feature of the Google interface, and Google is unlikely to ever change them. However, I found that if the document was not relevant to the concern (and at this stage, only the user can truly know this), the user often inferred the context wrongly and selected what turned out to be a non relevant document. The presence of keywords in the snippets encouraged the user to assume that the document was more relevant than it actually was.
Without being able to adequately understand context, searches are always going to throw up non-relevant documents. The problem for search engines is trying to understand the users’ context from just a handful of search terms. Even other humans cannot do this with any degree of accuracy, but search engines can be phenomenally incorrect sometimes.
As an example, if I want to read a review of a digital camera, I can enter the word “review” along with the camera manufacturer and the camera’s model. However, the returned documents almost always consist of online stores trying to sell me the camera, but with very few reviews - the object of my exercise. This is because the word “review” appears often on many pages, but there is no relevant content to my task there. The search engine doesn’t know this and (currently) cannot, so I am left ploughing through many pages of returns to find an objective review.
in and of itself, they can make an accurate assessment of whether the information is relevant or not. The problem remains though of simply having too much information to realistically be sorted through even by a large team of trained analysts. A system that could accurately understand the context of the information need would reduce the dependence on human supervision, but the reliability would need to be extremely high indeed.
Concluding Thoughts
Personally, I think this is an enormously ambitious project and I truly wish them the best of luck (indeed, if I could help I would like to). However, the resources required to tackle this task would be great indeed which I am only too aware of (my Ph.D. addressed only a couple of aspects of this project and then so much simpler as to be unrecogniseable in operation).
Just as speculation, I guess that if the human operators and the system they use understand each others context perfectly, a very effective information retrieval system would be feasible. Indeed, it would also indicate the start of true human-computer interaction.
Advanced Search
I read about an interesting project called Aquaint the other day. It is partly funded by the CIA and the NSA and concerns information retrieval. The project is in response to the weaknesses of the current methods of searching for relevant information. While it is very possible to use Google to find just about everything, it is not all plain sailing.
Misleading Information
For one of my experiments, I had people create their own concerns (information needs) and queries to get information about them. Using their information, I found that approximately 50% of searches will produce some information of relevance. However, the percentage of completely relevant information was much lower (under 29%). This is still a reputable figure considering the billions of irrelevant documents that a search engine like Google catalogues.
However, when trying to satisfy a complex information need, the majority of work still lies with the user and not an automatic system. Current search engines are superb for answering simple fact-based questions (such as “what is the height of Mount Everest?"), but trying to understand complex issues that are only covered by a multitude of disparate documents has little solution outside of humans working hard.
Information Overload
I would guess that the NSA and CIA want to be able to trawl through massive databases of information and from that derive useful knowledge (such as a sudden gathering of terrorists in a US city). The amount of information being conveyed across the planet is just too much for humans to cope with, even if systems such as Echelon provide enormous amounts of filtering. Certainly from my own work, I become acutely aware that relying on automatic (read: simple and relatively stupid) methods of dealing with large databases of information was going to present a problem. Most information needs are not simple but are very complex. What is submitted to search engines is often just a small part of an information need, broken down to make the task more manageable. A system capable of addressing these needs would be very valuable indeed.
So how does this system work? From what I can gather, a lot of the work lies in classic (and currently very unfashionable) information retrieval. NB - I say it’s unfashionable because most attention is focused upon interactive search in HCI rather than plain retrieval.
The first problem is having the system understand what it is that the user is searching for. Already, there are good tools to facilitate this task (such as categorisation systems, query expansion), but there was one tool which piqued my interest: contextual interpretation.
Inferring Context
The presence of keywords in the snippets encouraged the user to assume that the document was more relevant than it actually was.
My own research indicated that understanding the context of what or whoever is being interacted with is of prime importance. Certainly, humans, even when they generated their own searches using areas that they were familiar with, made mistakes when context was misunderstood. For example, take Google’s text snippets (the 2 lines of text with keywords embedded in bold characters). These are a traditional feature of the Google interface, and Google is unlikely to ever change them. However, I found that if the document was not relevant to the concern (and at this stage, only the user can truly know this), the user often inferred the context wrongly and selected what turned out to be a non relevant document. The presence of keywords in the snippets encouraged the user to assume that the document was more relevant than it actually was.
Without being able to adequately understand context, searches are always going to throw up non-relevant documents. The problem for search engines is trying to understand the users’ context from just a handful of search terms. Even other humans cannot do this with any degree of accuracy, but search engines can be phenomenally incorrect sometimes.
As an example, if I want to read a review of a digital camera, I can enter the word “review” along with the camera manufacturer and the camera’s model. However, the returned documents almost always consist of online stores trying to sell me the camera, but with very few reviews - the object of my exercise. This is because the word “review” appears often on many pages, but there is no relevant content to my task there. The search engine doesn’t know this and (currently) cannot, so I am left ploughing through many pages of returns to find an objective review.
in and of itself, they can make an accurate assessment of whether the information is relevant or not. The problem remains though of simply having too much information to realistically be sorted through even by a large team of trained analysts. A system that could accurately understand the context of the information need would reduce the dependence on human supervision, but the reliability would need to be extremely high indeed.
Concluding Thoughts
Personally, I think this is an enormously ambitious project and I truly wish them the best of luck (indeed, if I could help I would like to). However, the resources required to tackle this task would be great indeed which I am only too aware of (my Ph.D. addressed only a couple of aspects of this project and then so much simpler as to be unrecogniseable in operation).
Just as speculation, I guess that if the human operators and the system they use understand each others context perfectly, a very effective information retrieval system would be feasible. Indeed, it would also indicate the start of true human-computer interaction.
No comments:
Post a Comment