Free Stop Word Lists in 23 Languages
Stop words or stopwords are used in Natural Language Processing (NLP) to eliminate (very frequent) words that contain no or little information to help discriminate the text they occur in. Search engines, for example, use stop words to improve the search queries. Google’s FAQ gives a short explanation here [link not online anymore]. A stop word list consists mostly of some basic combination of letters and numbers as well as pronouns, adverbs, prepositions, some verbs, adjectives and conjunctions.
For example the sentence “The government did not introduce the tax bill” could be represented by “S government S S introduce S tax bill” with ‘S’ standing for a stop word. As a result the amount of data that has to be processed is reduced with a simple matching and removing/replacing of stop words with no or minimal impact on the information contained. There are several lists freely available.
Find Catalan, Czech, Danish, Dutch, French, English, German, Hungarian, Italian, Norwegian, Polish, Portugese, Spanish, and a Turkish stop word list at Ranks.nl.
Arabic, Bulgarian, Czech, French, English, Finish, German, Hungarian, Italian, Roumanian, Russian, Spanish, Swedish, Polish and Portuguese stop word lists are available from Jacques Savoy’s page.
The snowball project offers English, French, Spanish, German, Portuguese, Italian, Dutch, Swedish, Norwegian, Danish, Russian, Finnish and Hungarian stop lists. As it is part of a stemmer project the lists are not in one place and have to be downloaded from each language page.