Automatic word sense discrimination

Automatic word sense discrimination was publish in 1998 by Hinrich Schütze and can be seen as a further development of the HAL approach. He calls the underlying semantic vector space, Word Space, but it relates to the same basic matrix of word co-occurrences in a word by word matrix. His aim is to identify Senses in the vector spaces which one could imagine to be categories or topics. Furthermore, his approach attempts to attribute occurrences of ambiguous words to Senses.

Schütze introduces Context Vectors which are second order co-occurrences while Word Vectors are first order. Word vectors are created in form of a HAL matrix. Context vectors are a summation of word vectors close to the a single occurrence of the word under investigation. As a result word vectors a general representation of a word while a context vector is representation of a context of a single occurrence of a word. Latter are more focused and also only valid for the word in the particular context.

Context vectors are then clustered to identify areas of meaning which is a collection of close context vectors. The centre of such a cluster according to Schütze is a Sense. An example he makes illustrates this. The context vectors of suite might be attribute to different senses. If suite has a legal context and appears with words like judge and law it would be attributed to a sense vector (topic) representing legal meanings. Another time the word might be encountered surrounded by word like tailor and shirt resulting in an attribution of the context vector to a clothing sense.

To reduce the dimensionality of his matrix and take advantage of its positive characteristics Schütze employs Singular Value Decomposition (SVD). He assumes it helps to uncover latent meaning. I would tend to attribute SVD’s positive influence to a combination of amplification and noise filtering of the matrix. If I understood his paper SVD is only employed on the initial word matrix and not the context matrix. This would make sense as latter should have been much less sparse than former.

To test his work Schütze uses pseudo-words. To construct them he picks two word( vector)s which have very little in common, e.g. door and banana, and conflates them into one pseudo-word. Once he parses his text and clusters it, it allows him to easily identify if a context of the pseudo-word was related to door or banana and as such attributed to the right sense. His results show that these artificial ambiguous words are identified to a higher degree than the natural ambiguous words like suite. Furthermore, abstract senses that are encountered in words like space or pairs like wide range are harder to attribute. This is likely due to their appearance in more contexts than other words and as a result higher ambiguity.