Skip to content

Free Stop Word Lists in 23 Languages

StopStop words or stopwords are used in Natural Language Processing (NLP) to eliminate words that bear no content or relevant semantics. Search engines use stop words to improve the search queries. Google’s FAQ gives a short explanation here. A stop word list consists mostly of some basic combination of letters and numbers as well as pronouns, adverbs, prepositions, some verbs, adjectives and conjunctions.

For example the sentence “The government did not introduce the tax bill” could be represented by “S government S S introduce S tax bill” with ‘S’ standing for a stop word. As a result the amount of data that has to be processed is reduced with a simple matching and removing/replacing of stop words with no or minimal impact on the information contained. There are several lists freely available. (Continued)

Quick, easy, professional – Program for free under Microsoft Windows

Wooden Piggy BankIf you are a student, financially constraint or just interested in programming a little, there is a way to program for free and well at the same time! First of all lets assume you are a common guy with common needs so you likely will be using Microsoft Windows, looking to program for Windows possibly using C#. If you prefer Linux or other systems and look to program in something more ‘outlandish’, e.g. your own assembler language then surely there is help out there but I am not addressing it here(, yet). (Continued)

Voronoi/Voronoy Tessellation

The Voronoy Tessellation of a random set of points in the plane (all points lie within the image). [Source: http://en.wikipedia.org/wiki/Image:Coloured_Voronoi_2D.png, GNU Free Documentation license]

The Voronoy (or Voronoi) Tessellation (Voronoy 1908) is a technique that enables the division of a such multi-dimensional spaces into subspaces. Its application defines geometric areas equivalent to subspaces by defining several vectors as centres of subspaces. Any other vector in space can then be attributed to the closest centre vector effectively dividing the whole space in subspaces. Thus an excellent choice to divide semantic vector spaces.

(Continued)

Information Mapping Project (INFOMAP)

The INFOMAP project is an older but nevertheless interesting introduction into semantic vector space models. The related software is freely available. It uses a combination of approaches but mostly relies on Schütze’s Automatic word sense discrimination work. However, it does not use context vectors and concentrates on a SVD compressed HAL matrix. (Continued)

Automatic word sense discrimination

Automatic word sense discrimination was publish in 1998 by Hinrich Schütze and can be seen as a further development of the HAL approach. He calls the underlying semantic vector space, Word Space, but it relates to the same basic matrix of word co-occurrences in a word by word matrix. His aim is to identify Senses in the vector spaces which one could imagine to be categories or topics. Furthermore, his approach attempts to attribute occurrences of ambiguous words to Senses. (Continued)

The Mystery of Singular Value Decomposition

MysteryApperceptual comments on an interesting problem in one of his blog posts. He is discussing the importance of high order co-occurrences on word similarity measures in LSA. The part that interested me was the discussion of Singular Value Decomposition (SVD). My gut feeling has always been that SVD’s most useful characteristic was to amplify the information content and reduces noise. Certainly an interesting question that comes to mind is how to measure such an improvement. A dimensional reduction (or for that matter any noise reduction) is only useful when applied appropriately or it falls short of its ability or worse reduces the (useful) information content. To test this run a semantic vector space with increasingly harsh dimensional reduction on the vector space. The vectors start focusing, then clumping until the reduction is too high and they collapse on a handful of dimensions. (Continued)

Hyperspace Analogue to Language (HAL) Introduction

Red HALAlso known as semantic memory it was developed by Kevin Lund and Curt Burgress from the University of California, Riverside, California. You can download the corresponding paper, Producing high-dimensional semantic spaces from lexical co-occurrence, in PDF format.

The basic premise the work relies on is that words with similar meaning repeatedly occur closely (also known as co-occurrence). As an example in a large corpus of text one could expect to see the words mountain, valley and river appear often close to each other. The same might be true for mouse, cat and dog. (Continued)

Welcome

Thanks for dropping by and reading this blog. Are you wondering what it is all about and why you should read it and not one of the other millions of blogs? Well that makes two of us. I do not plan to waste your and my time with random ramblings but rather will write about what I work on and what interests me. That might be anything from semantic research to travelling. If you are still reading then you should be interested enough to check out one of my posts before you head off again.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]