ORC: An Intelligent Big Data file format for Hadoop and Hive 11

ORC: An Intelligent Big Data file format for Hadoop and Hive
RCFile (Record Columnar File), the previous Hadoop Big Data storage format on Hive, is being challenged by the smart ORC (Optimized Row Columnar) format. My first post on the topic, Getting Started with Big Data with Text and Apache Hive, presented a common scenario to illustrate why Hive file formats are significant to its performance and ... read more →

Faster Big Data on Hadoop with Hive and RCFile 5

Faster Big Data on Hadoop with Hive and RCFile
SQL on Hadoop with Hive makes Big Data accessible. Yet performance can lack. RCFile (Record Columnar File) are great optimisation for Big Data with Hive. The previous two posts in this four parts series explained the reasons why to use text on the periphery of an ETL process and optimisations for text. The inside of a Hive ... read more →

Getting Started with Big Data with Text and Apache Hive 3

Getting Started with Big Data with Text and Apache Hive
Big Data more often than expected is stored and exchanged as text. Apache Hadoop’s Hive SQL interface helps to reduce costs and to get results fast. Often, things have to get done fast rather than perfectly. However, with big data even a small decision like a file format could have a great impact. What are ... read more →

32x Faster Hadoop and MapReduce With Indexing 8

32x Faster Hadoop and MapReduce With Indexing
Hadoop and map reduce’s simplicity, and especially lack of indices, significantly limits its performance. I described how map reduce 2.0 and alternatives bypassing map reduce will change Hadoop’s application and speed it up in the next year or two. Another approach is the introduction of indices to data stored on Hadoop Distributed File System (HDFS). At its inception, ... read more →

Online Education Revolution

Online Education Revolution
For hundreds of years, brick-and-mortar universities were at the unchallenged pinnacle of education. In the last decades, remote and online education appeared with Massive Open Online Course (MOOC) being the latest incarnation. At first it was little more than traditional education channelled through alternative media, with limited success and appeal. This is changing fast and ... read more →

Hadoop 2.0: Beyond MapReduce with YARN, Drill, Tez

Hadoop 2.0: Beyond MapReduce with YARN, Drill, Tez
Hadoop 1.0 is increasingly challenged as slow and limited in its application, now that the hype is dying down. Marketing departments, riding the Big Data wave, wildly exaggerated Hadoop’s ability. Hadoop 2.0, surprisingly, is about to prove them somewhat right with two major developments. read more →

4 Free DIY Twitter Visualisations: The Shahbag Protest

4 Free DIY Twitter Visualisations: The Shahbag Protest
Earlier this year a mass movement occurred in Bangladesh, which received little global news coverage. It was an immensely important event to Bangladeshi’s at home and abroad. This prompted me to try and illustrate the event with Twitter data myself, merely utilizing some free web services and a few hours time. Amazingly the results are ... read more →

Bangladesh Budgets 2004-2013: A Decade of Government Spending

Bangladesh Budgets 2004-2013: A Decade of Government Spending
Every Bangladeshi wants to know if her country is becoming wealthier, if her government spends within the country’s means, and if her money is spent wisely. The 2013 budget for Bangladesh has been released on Thursday and it holds some answers to these questions. However, the budget is part accounting and part politics, and is ... read more →

Big Data Transforms Online Education

Big Data Transforms Online Education
Online education’s recent success and scale creates unique big data. Together it changes online education and enables personalized, adaptive learning. This development challenges traditional education services. An educational paradigm shift The End of Higher Education as We Know It, explains how online education will transform higher education. The fact that (prestigious) universities scramble to be ... read more →

Hadoop cluster cost of Amazon EC2 vs EMR 10

Hadoop cluster cost of Amazon EC2 vs EMR
What is the price of a small Elastic MapReduce (EMR) vs an EC2 Hadoop cluster? This article explores the price tag of switching to a small, permanent EC2 Cloudera cluster from AWS EMR. Cloud computing with Hadoop – maybe using AWS EMR or EC2 –  makes experiments with temporary clusters and big data crunching easy and ... read more →