Stop words or stopwords are used in Natural Language Processing (NLP) to eliminate words that bear no content or relevant semantics. Search engines use stop words to improve the search queries. Google’s FAQ gives a short explanation here. A stop word list consists mostly of some basic combination of letters and numbers as well as pronouns, adverbs, prepositions, some verbs, adjectives and conjunctions.
For example the sentence “The government did not introduce the tax bill” could be represented by “S government S S introduce S tax bill” with ‘S’ standing for a stop word. As a result the amount of data that has to be processed is reduced with a simple matching and removing/replacing of stop words with no or minimal impact on the information contained. There are several lists freely available. (Continued)
If you are a student, financially constraint or just interested in programming a little, there is a way to program for free and well at the same time! First of all lets assume you are a common guy with common needs so you likely will be using Microsoft Windows, looking to program for Windows possibly using C#. If you prefer Linux or other systems and look to program in something more ‘outlandish’, e.g. your own assembler language then surely there is help out there but I am not addressing it here(, yet). (Continued)
Thursday, February 28, 2008
![The Voronoy Tessellation of a random set of points in the plane (all points lie within the image). [Source: http://en.wikipedia.org/wiki/Image:Coloured_Voronoi_2D.png, GNU Free Documentation license]](http://www.semantikoz.com/wp-content/uploads/2008/02/coloured_voronoi_2d.thumbnail.png)
The Voronoy (or Voronoi) Tessellation (Voronoy 1908) is a technique that enables the division of a such multi-dimensional spaces into subspaces. Its application defines geometric areas equivalent to subspaces by defining several vectors as centres of subspaces. Any other vector in space can then be attributed to the closest centre vector effectively dividing the whole space in subspaces. Thus an excellent choice to divide semantic vector spaces.
(Continued)
Tuesday, February 26, 2008
The INFOMAP project is an older but nevertheless interesting introduction into semantic vector space models. The related software is freely available. It uses a combination of approaches but mostly relies on Schütze’s Automatic word sense discrimination work. However, it does not use context vectors and concentrates on a SVD compressed HAL matrix. (Continued)
Tuesday, February 26, 2008
Automatic word sense discrimination was publish in 1998 by Hinrich Schütze and can be seen as a further development of the HAL approach. He calls the underlying semantic vector space, Word Space, but it relates to the same basic matrix of word co-occurrences in a word by word matrix. His aim is to identify Senses in the vector spaces which one could imagine to be categories or topics. Furthermore, his approach attempts to attribute occurrences of ambiguous words to Senses. (Continued)
Tuesday, February 26, 2008
Apperceptual comments on an interesting problem in one of his blog posts. He is discussing the importance of high order co-occurrences on word similarity measures in LSA. The part that interested me was the discussion of Singular Value Decomposition (SVD). My gut feeling has always been that SVD’s most useful characteristic was to amplify the information content and reduces noise. Certainly an interesting question that comes to mind is how to measure such an improvement. A dimensional reduction (or for that matter any noise reduction) is only useful when applied appropriately or it falls short of its ability or worse reduces the (useful) information content. To test this run a semantic vector space with increasingly harsh dimensional reduction on the vector space. The vectors start focusing, then clumping until the reduction is too high and they collapse on a handful of dimensions. (Continued)
Monday, February 25, 2008
Also known as semantic memory it was developed by Kevin Lund and Curt Burgress from the University of California, Riverside, California. You can download the corresponding paper, Producing high-dimensional semantic spaces from lexical co-occurrence, in PDF format.
The basic premise the work relies on is that words with similar meaning repeatedly occur closely (also known as co-occurrence). As an example in a large corpus of text one could expect to see the words mountain, valley and river appear often close to each other. The same might be true for mouse, cat and dog. (Continued)
Sunday, February 24, 2008