Vector Space Model Using The Information Mapping Project (INFOMAP)

The INFOMAP project is an older but nevertheless interesting introduction into semantic vector space models. The related software is freely available. It uses a combination of approaches but mostly relies on Schütze‘s Automatic word sense discrimination work. However, it does not use context vectors and concentrates on a SVD compressed HAL matrix.

In my research I intensely worked with it and found it to be useful with large corpora. It parses a corpus of a or several documents and generates the word vectors in a HAL matrix excluding words contained in a stop list. A stop list is a collection of words and letters that have ambiguous features and are semantically expressionless, for example I, you, are, a, b, c, d and so on. This allows to use a simple parser and despite the lack of stemming reduces the amount of word significantly. To limit the number of columns and rows (which can be set by a parameter) frequency is used. So only the x most frequent words are used for columns and y most frequent ones for the rows. The columns have an additional gap feature where by default the 50 most frequent words are ignored. According to INFOMAP these words often are ambiguous because of their high frequency. The HAL matrix is not a simple count of occurrences of words but uses a Term Frequency – Inverse Document Frequency (TF-IDF) measure to weight a word and uses this value when parsing the text and incrementing the matrix.

Once the corpus is parsed and the HAL matrix computed a SVD based on a Lanczos algorithm is performed on it. The resulting left matrix U is then truncated to the columns pre-set in the parameters (default 100) or less if the Lanczos algorithm converges earlier.

Documents are mapped into the space once the SVD and dimensional reduction is completed. Each document vector is a summation of the word( vector)s contained in a document. The query engine of the software allows to query for terms or combination of terms and finds the closest and documents. A nice extra feature is the implementation of a logical NOT to the query engine. So one could query suite NOT clothes to remove possible clothing meaning from the query and focus on alternative meanings like a lawsuit. This is done by creating the query vector of the first part of the query and making it orthogonal to the second part, the NOT, of the query. As a result the final query vector will be orthogonal (unrelated) to the NOT part but retain all other information of the positive part of the query. This simple but brilliant approach was developed by Dominic Widdows and published in Orthogonal Negation in Vector Spaces for Modelling Word-Meanings and Document Retrieval.

While the results of INFOMAP are good and intuitively right, my research has revealed some shortcomings and subsequently lead me to develop my own implementation with several improvements. I will not discuss the details of the problems as its part of my thesis and maybe part of publications to come. Nevertheless, it is a great starting point for anyone interested in playing around with semantic vector spaces. I certainly gained a great insight by using it.