Skip to content

Free Stop Word Lists in 23 Languages

by Christian on April 2nd, 2008

StopStop words or stopwords are used in Natural Language Processing (NLP) to eliminate (very frequent) words that contain no or little information to help discriminate the text they occur in. Search engines, for example, use stop words to improve the search queries. Google’s FAQ gives a short explanation here [link not online anymore]. A stop word list consists mostly of some basic combination of letters and numbers as well as pronouns, adverbs, prepositions, some verbs, adjectives and conjunctions.

For example the sentence “The government did not introduce the tax bill” could be represented by “S government S S introduce S tax bill” with ‘S’ standing for a stop word. As a result the amount of data that has to be processed is reduced with a simple matching and removing/replacing of stop words with no or minimal impact on the information contained. There are several lists freely available.

Find Catalan, Czech, Danish, Dutch, French, English, German, Hungarian, Italian, Norwegian, Polish, Portugese, Spanish, and a Turkish stop word list at Ranks.nl.

Arabic, Bulgarian, Czech, French, English, Finish, German, Hungarian, Italian, Roumanian, Russian, Spanish, Swedish, Polish and Portuguese stop word lists are available from Jacques Savoy’s page.

The snowball project offers English, French, Spanish, German, Portuguese, Italian, Dutch, Swedish, Norwegian, Danish, Russian, Finnish and Hungarian stop lists. As it is part of a stemmer project the lists are not in one place and have to be downloaded from each language page.


5 Comments
  1. David Novakovic permalink

    Christian, you don’t happen to be Peter’s other student do you?

  2. Need the list of stopwords for the 24 languages.
    Especially interested in Search Engines behavior on these.
    Thanks

  3. Ralf, the links to the lists are in the post :)

Trackbacks & Pingbacks

  1. Stop words untuk Bahasa Indonesia « Blog Yudi Wibisono
  2. Stop Words Lists

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS