17 January 2012

Latent Semantic Analysis - an overview

This article was originally published on 27 November 2004

Latent Semantic Analysis - an overview

One of the major problems that faces machine text processing is how to deal with synonyms and polysyms. Polysyms are where a single word has several meanings (such as “tear” which can refer to tear-drop or tear as in rip), and synonyms are where a single meaning can have several words (such as automobile and car). While the context of the text usually makes the intended meaning unambiguous to humans, machines do not have the advantages that humans do.

When processing a document, humans will form an overall comprehension of it (a gist) that enables them to deal with any ambiguity by placing it firmly within the context of the document (Kintsch & van Dijk, 1978), though I wonder if priming the reader through the cues presented within the text has something to do with it.

Machines find it hard to do this. Even the best programs when comparing different documents will likely select another with literal similarity rather than analogical similarity (for examples, see Gentner’s work with the Karla the Hawk stories). Humans can easily understand the meaning of documents whereas machines cannot effectively and reliably do this.

Latent semantic analysis (or LSA; Landauer, Foltz & Laham, 1998) is one step further in enabling machine comprehension of a document. It is a statistical technique used to analyse a corpus and be able to note co-occurences of the contexts of words. This is interesting because hitherto, only brute-force methods allowed this, and LSA remains a relatively elegant solution.

Consider: if somebody is faced with 2 documents, one of which refers to the things we know as cars as cars and nothing else. The other document refers to them as automobiles and nothing else. Despite these differences, the topics are exactly the same (for example, how to change the oil), so the meaning is not different.

Typical machine processing would judge the documents to be different unless it was programmed to recognise that car and automobile referred to the same thing. As I mentioned, this is a brute-force method and has to be explicitly defined for each and every possible synonym, and has the disadvantage that novel uses of words cannot be dealt with until the programmers “code-in” a synonymical similarity.

When a corpus is subject to analysis by LSA, it will probably note that both documents, while featuring different words for car, will still have many other words co-occurring. With our example of changing the oil, it will show that words like “oil", “drain", “sump” and so on co-occur with car in the first document, and with automobile in the second. LSA can therefore derive a semantic connection between the words “car” and “automobile", even when they do not appear in the same document, and with no explicit instructions needed. This can account for novel uses of words.

Some may say that this allows the meaning of words to be known by the machine. However, this is incorrect. Firstly, it is a blind statistical technique. Consider:

“For sale: garden rake from man with iron teeth";
“For sale: garden rake from man, with iron teeth";

Both of the previous two sentences are subtely different. The first refers to a man with iron teeth selling a garden rake, whereas the latter refers to a garden rake with iron teeth being sold by a man. LSA would not be able to tell the difference between these two sentences. Though the meanings are different, LSA would not be able to show this.

This problem was illustrated by a wonderful study by XXX in which a group of skilled writers were asked to fool an automatic essay scoring system (which was based on LSA). First of all, they had to submit essays that produced low marks from human raters and high marks from the automatic system, and then vice versa.

The first task was accomplished easily. Some participants simply wrote a single paragraph and repeated it up to the word limit, while others just wrote a single repeated sentence. Obviously the paragraphs or sentences contained words relevant to the question. Others simply repeated the question.

The automatic system marked these essays highly: comparison to an “ideal” essay showed a high degree of similarity and thus a high mark was awarded. The human raters on the other hand could see that there was no content worth a high mark. However good, repeating a single sentence fifty times does not make an essay fifty times as good! Consequently, the human raters awarded low marks.

The converse task (high marks from humans, low marks from automatic raters) was a little more difficult. The essay had to communicate meaning that was pertinent to the question, and yet did not contain any words that were characteristic of a good answer.

The solution was for the writers to use similar, analogy and metaphor. By indirectly referring to the content via metaphor, the meaning could be communicated. As expected, the automatic system rated them with a significantly lower mark as the expected content was not present (or rather, was not present in a way that could be used). However, the differences here were less than in the first task.

This illustrates that LSA is not s sufficient model of human text comprehension. The meaning of a document may not be tied directly to its syntactic content, but rather to what that syntactic content communicates. Considering the variety of interpretations placed on certain texts (such as religious texts), this process is not a reliable one between humans either, so by expecting a machine to do this perfectly, we might be expecting too much of silicon life.

Applications of LSA

LSA can be tremendously useful. Because it can draw our attention to co-occurences of terms, it allows a degree of context to be inferred. Current applications include (as mentioned) automated essay scoring (but with limitations), text comparison, keyword extraction and others. However, one possible use is that of spam filtering. If the context of an emails content can be inferred, it should be much more likely that it could be accurately categorised either as spam or as genuine.

Some companies are already investigating this use, but it seems yet to have a wide acceptance.

Could LSA be improved?

One way of perhaps improving LSA is by improving the materials it has to work with. Current methods of LSA extract only “meaningful” words: functional words (such as “and” or “not") are omitted completely from the analysis, yet these may play an important role in interpreting meaning. As an example, take a sentence and reverse its propositions by inserting the word “not” into a suitabel place. The meaning is therefore completely reversed, yet the LSA process would consider both versions to be the same.

But this may cause problems in the analysis. Such words, while important to the meaning of a text, may be difficult to incorporate into an analysis. Certainly the co-occurence of the word “and” does not provide us with any insight into the meaning of different texts. Further work is needed to allow an effective way of allowing a good analysis without an impoverished source of materials.

A second suggestion might be to expand the LSA analysis: instead of a single analysis for a universe of documents, it might be possible to break these down into separate analyses. However, this creates the problem of reducing how well LSA covers contexts across a range of documents.


Kintsch, W., and van Dijk, T. (1978) Toward a model of text comprehension and production. Psychological Review, 85 (5), 363 - 394.

Landauer, T.K., Foltz, P.W., and Laham, D. (1998) Introduction to latent semantic analysis. Discourse Processes, 25, 259-284.

Powers, Burstein, Chodorow, Fowles & Kukich (2001) Stumping e-rater:® Challenging the validity of automated essay scoring (GRE No. 98-08bP, ETS RR-01-03). Princeton, NJ: Educational Testing Service.

No comments: