Big Data at Mendeley is about similarity measures and comparing documents, groups, and users for search, deduplication, recommendations and classification. We work with an ever increasing document collection, currently of the magnitude of 100,000,000. Besides the documents, we process related PDFs, extracted and user generated meta-data, user information, user libraries and groups. Together the data set and its application at Mendeley is large and complex. While scale is a challenge, after closer inspection, we can identify another one, a core process. In almost every feature/product, internally and client-facing, we have to compare data items. This basic operation becomes challenging even without the scale. We deal with a noisy data set with different types of information coming from users, meta-data extraction and partner archives. In short, we have to compare items in a huge set efficiently and effectively. This is a common topic at the heart of big data experienced in variations in many companies. Like Mendeley, most if not all real world big data services face some kind of noise in their data and use comparisons extensively in their algorithms/products.
There are three main classes of comparison coming to mind in our context:
- Search – comparing patterns and frequencies within and across items, e.g. text query against documents.
- Recommendation – comparing items based on their occurrence, e.g. collaborative filtering of co-occurring items
- Classification/clustering – comparing items and groups of items based on their features, e.g. clustering and merging (near) duplicate items.
There are products along these classes of comparison available, e.g. Lucene or Solr. The problem is that products specialise on a use case, for example search in case of Lucene and Solr. The specialisation commonly focuses on one or small subset of aspects of the information we have available, e.g. patterns or relationships. Some of the data are only poorly or not at all utilised and comparison across types is often impossible or hard. Moreover, we have to (internally) in many situations do similar comparisons but utilizing specialised products is not always a sensible approach. Where we do use existing technologies and algorithms we are limited by their abilities and insight (or lack of).
We pose these challenges:
- To unify the data comparison classes in one system to extract value from the full data set (patterns, frequencies, relationships, co-occurrence, …) and access it transparently from different services (search, recommendation, de-duplication) according to their needs.
- To scale it for Mendeley (to 10^8 and beyond).
- To be as effective or even better than dedicated products.
We will solve these challenges as part of the TEAM project applying and extending state-of-the-art research. The outcome will a) extend knowledge in form of peer reviewed research publications, and b) result in a real-world, working system at Mendeley.