GraphChi: How a Mac Mini outperformed a 1,636 node Hadoop cluster

GraphChi: How a Mac Mini outperformed a 1,636 node Hadoop cluster
Last year GraphChi, a spin-off of GraphLab, a distributed graph-based high performance computation framework, did something remarkable. GraphChi outperformed a 1,636 node Hadoop cluster processing a Twitter graph (dataset from 2010) with 1.5 billion edges – using a single Mac Mini. The task was triangle counting and the Hadoop cluster required over 7 hours while ... read more →

ORC: An Intelligent Big Data file format for Hadoop and Hive 11

ORC: An Intelligent Big Data file format for Hadoop and Hive
RCFile (Record Columnar File), the previous Hadoop Big Data storage format on Hive, is being challenged by the smart ORC (Optimized Row Columnar) format. read more →

Faster Big Data on Hadoop with Hive and RCFile 5

Faster Big Data on Hadoop with Hive and RCFile
SQL on Hadoop with Hive makes Big Data accessible. Yet performance can lack. RCFile (Record Columnar File) are great optimisation for Big Data with Hive. The previous two posts in this four parts series explained the reasons why to use text on the periphery of an ETL process and optimisations for text. The inside of a Hive ... read more →

Optimising Hadoop and Big Data with Text and Hive

Optimising Hadoop and Big Data with Text and Hive
Hadoop’s Hive SQL interface reduces costs and to gets results fast with Big Data from Text. Simple optimisations improve the performance significantly. The previous post Getting Started with Big Data with Text and Apache Hive described the case for using text format to import and export data for a Hive ETL and reporting process. These ... read more →

Getting Started with Big Data with Text and Apache Hive 3

Getting Started with Big Data with Text and Apache Hive
Big Data more often than expected is stored and exchanged as text. Apache Hadoop’s Hive SQL interface helps to reduce costs and to get results fast. Often, things have to get done fast rather than perfectly. However, with big data even a small decision like a file format could have a great impact. What are ... read more →

32x Faster Hadoop and MapReduce With Indexing 8

32x Faster Hadoop and MapReduce With Indexing
Hadoop and map reduce’s simplicity, and especially lack of indices, significantly limits its performance. I described how map reduce 2.0 and alternatives bypassing map reduce will change Hadoop’s application and speed it up in the next year or two. Another approach is the introduction of indices to data stored on Hadoop Distributed File System (HDFS). At its inception, ... read more →

4 Free DIY Twitter Visualisations: The Shahbag Protest

4 Free DIY Twitter Visualisations: The Shahbag Protest
Earlier this year a mass movement occurred in Bangladesh, which received little global news coverage. It was an immensely important event to Bangladeshi’s at home and abroad. This prompted me to try and illustrate the event with Twitter data myself, merely utilizing some free web services and a few hours time. Amazingly the results are ... read more →

Big Data Transforms Online Education

Big Data Transforms Online Education
Online education’s recent success and scale creates unique big data. Together it changes online education and enables personalized, adaptive learning. This development challenges traditional education services. An educational paradigm shift The End of Higher Education as We Know It, explains how online education will transform higher education. The fact that (prestigious) universities scramble to be ... read more →

Democratize Big Data With Hadoop and Hive

Democratize Big Data With Hadoop and Hive
You have started to process data with cloud computing platforms like Amazon Web Service (AWS)’s Elastic MapReduce (EMR). Now that you use it regularly, other stakeholders are getting curious. You increasingly find yourself firing up an EMR cluster to quickly answer a question or try something out. It may be time to change the way ... read more →

Get Started: Big Data Crunching in the Cloud

Get Started: Big Data Crunching in the Cloud
Even if your business case is constantly evolving, you will still want to leverage big data, but being tied to a single infrastructure will limit your capital and your options. Big data is not a starting point but a destination for many startups and teams. It becomes a conversation point as a result of the ... read more →