Shell One Liner: Sum file sizes

Ever wanted to find some files in a directory or by pattern and sum up their file size? read more →

Hint: Hive 0.8.1.1 on AWS – avoid a regression bug

read more →

Promiscuous Pairing: Do it often, do it fast and learn from it 7

The past week we started at Rangespan a new pairing model based on the experience at Silver Platter described in the research paper ‘Promiscuous Pairing and Beginner’s Mind‘. Let me first introduce the traditional flow model and compare it to promiscuous pairing before I give some feedback from my experience. The paper provides compelling insight in an alternative mode of ... read more →

Big Data at Mendeley

Big Data at Mendeley is about similarity measures and comparing documents, groups, and users for search, deduplication, recommendations and classification. read more →

Free Stop Word Lists in 23 Languages 5

Stop words or stopwords are used in Natural Language Processing (NLP) to eliminate (very frequent) words that contain no or little information to help discriminate the text they occur in. Search engines, for example, use stop words to improve the search queries. Google’s FAQ gives a short explanation here [link not online anymore]. A stop ... read more →

Quick, easy, professional – Program for free under Microsoft Windows

If you are a student, financially constraint or just interested in programming a little, there is a way to program for free and well at the same time! First of all lets assume you are a common guy with common needs so you likely will be using Microsoft Windows, looking to program for Windows possibly ... read more →

Voronoi Tessellation

The Voronoi Tesselation (or Voronoy Tessellation) by Georgy Feodosevich Voronoy/Вороной Георгий Феодосьевич (1908) is a technique that enables the division of a such multi-dimensional spaces into subspaces. Its application defines geometric areas equivalent to subspaces by defining several vectors as centres of subspaces. Any other vector in space can then be attributed to the closest centre ... read more →

Vector Space Model Using The Information Mapping Project (INFOMAP)

The INFOMAP project is an older but nevertheless interesting introduction into semantic vector space models. The related software is freely available. It uses a combination of approaches but mostly relies on Schütze‘s Automatic word sense discrimination work. However, it does not use context vectors and concentrates on a SVD compressed HAL matrix. read more →

Automatic word sense discrimination

Automatic word sense discrimination was publish in 1998 by Hinrich Schütze and can be seen as a further development of the HAL approach. He calls the underlying semantic vector space, Word Space, but it relates to the same basic matrix of word co-occurrences in a word by word matrix. His aim is to identify Senses ... read more →

The ‘Mystery’ of Singular Value Decomposition

Apperceptual comments on an interesting problem in one of his blog posts [not online anymore]. He is discussing the importance of high order co-occurrences on word similarity measures in LSA. The part that interested me was the discussion of Singular Value Decomposition (SVD). My interpretation has always been that SVD’s most useful characteristic was to ... read more →