Skip to content
Mar 15 14

ORC: An Intelligent Big Data file format for Hadoop and Hive

by Christian Prokopp
ORC File Layout

RCFile (Record Columnar File), the previous Big Data storage format on Hive, is being challenged by the smart ORC (Optimized Row Columnar) format.

My first post on the topic, Getting Started with Big Data with Text and Apache Hive, presented a common scenario to illustrate why Hive file formats are significant to its performance and big data processing. The second post, Optimising Hadoop and Big Data with Text and Hive, discussed the case for and against text format. It highlighted why for intermediate tables RCFile is the best choice — which we covered in Faster Big Data on Hadoop with Hive and RCFile. This about to change: the next evolution of the format, ORC file, is around the corner. How much better is it? read more…

Mar 8 14

Faster Big Data on Hadoop with Hive and RCFile

by Christian Prokopp
Hive RCFile format

SQL on Hadoop with Hive makes Big Data accessible. Yet performance can lack. RCFile (Record Columnar File) are great optimisation for Big Data with Hive.

The previous two posts in this four parts series explained the reasons why to use text on the periphery of an ETL process and optimisations for text. The inside of a Hive data processing pipeline can be optimised further for performance though. Commonly, derived, intermediate tables are queried heavily. Optimising these tables speeds up the whole pipeline greatly. How can these potentially huge tables be query agnostically optimised to only read and process the relevant fraction of the data stored? read more…

Mar 1 14

Optimising Hadoop and Big Data with Text and Hive

by Christian Prokopp
Hive and text can benefit greatly from optimisation

Hadoop’s Hive SQL interface reduces costs and to gets results fast with Big Data from Text. Simple optimisations improve the performance significantly.

The previous post Getting Started with Big Data with Text and Apache Hive described the case for using text format to import and export data for a Hive ETL and reporting process. These formats are a common denominator and are convenient to get products started quickly. read more…

Feb 22 14

Getting Started with Big Data with Text and Apache Hive

by Christian Prokopp
Apache Hive Logo

Big Data more often than expected is stored and exchanged as text. Apache Hadoop’s Hive SQL interface helps to reduce costs and to get results fast.

Often, things have to get done fast rather than perfectly. However, with big data even a small decision like a file format could have a great impact. What are some of the best-practices, storage formats, and strategies to balance optimization with getting things done? read more…

Sep 16 13

32x Faster Hadoop and Map Reduce With Indexing

by Christian Prokopp
HAIL, Hadoop with indexing is fast

HAIL, Hadoop with indexing is fast (image source)

Hadoop and map reduce’s simplicity, and especially lack of indices, significantly limits its performance. I described how map reduce 2.0 and alternatives bypassing map reduce will change Hadoop’s application and speed it up in the next year or two. Another approach is the introduction of indices to data stored on Hadoop Distributed File System (HDFS). Since Hadoop is not a Database Management System (DBMS), indices have been largely ignored until recently with exciting research exploring indexing and its performance benefits to Hadoop.

At its inception, Hadoop was diametrical to the idea of parallel DBMS, with their well-tuned schema and indices. This perception implied that Hadoop, by design – massively parallel, scalable, data agnostic, and hardwired to full row or column scans – is not suitable for indexing. This essential technology is key to traditional DBMS, keeping the performance high-ground over Hadoop on (near) real-time analytical tasks.

This may change, and not only because of Corona, YARN, Tez, Impala, and Drill. Research at the Information System Group at the Saarland University is nothing less than a paradigm shift. Researchers demonstrated that Hadoop, with little to no changes, can generate indices, which provide stunning performance improvements. read more…

Aug 12 13

Online Education Revolution

by Christian Prokopp

Traditional Universities ((c) hundreds of years, brick-and-mortar universities were at the unchallenged pinnacle of education. In the last decades, remote and online education appeared. At first it was little more than traditional education channelled through alternative media, with limited success and appeal. This is changing fast and dramatically. read more…

Jun 27 13

Hadoop 2.0: Beyond MapReduce with YARN, Drill, Tez

by Christian Prokopp

Hadoop 1.0 is increasingly challenged as slow and limited in its application, now that the hype is dying down. Marketing departments, riding the Big Data wave, wildly exaggerated Hadoop’s ability. Hadoop 2.0, surprisingly, is about to prove them somewhat right with two major developments. read more…

Jun 22 13

4 Free DIY Twitter Visualisations: The Shahbag Protest

by Christian Prokopp

Earlier this year a mass movement occurred in Bangladesh, which received little global news coverage. It was an immensely important event to Bangladeshi’s at home and abroad. This prompted me to try and illustrate the event with Twitter data myself, merely utilizing some free web services and a few hours time. Amazingly the results are beautiful and informative visualisations which did not cost a any money and very little effort and time. read more…

Jun 8 13

Bangladesh Budgets 2004-2013: A Decade of Government Spending

by Christian Prokopp

Every Bangladeshi wants to know if her country is becoming wealthier, if her government spends within the country’s means, and if her money is spent wisely. The 2013 budget for Bangladesh has been released on Thursday and it holds some answers to these questions. However, the budget is part accounting and part politics, and is subsequently abstract and inaccessible to most people. The magnitude and quantity of numbers reported are confusing and are not relatable. This is an opportunity to take a step back and visualise the big picture.

Bangladesh’s budget has changed significantly in the last decade. It is much easier to understand this year’s budget when it is compared to the last decade’s budgets and economic indicators.

Bangladesh’s GNI, the Gross National Income, roughly a measure of how much economic wealth is created by a country, can put the budget’s size in perspective. Correspondingly the absolute increase is not equal to the impact per citizen since Bangladesh’s population is growing. Normalising the GNI and budget per capita can alleviate this. Illustrating the last decade of GNI and budget expense per capita highlights the long-term economic and spending trends.

GNI vs Budget per capita Bangladesh Budget 2004-2013

GNI vs Budget per capita

read more…

Apr 22 13

Big Data Transforms Online Education

by Christian Prokopp

Online education’s recent success and scale creates unique big data. Together it changes online education and enables personalized, adaptive learning. This development challenges traditional education services.Image by Sean MacEntee

An educational paradigm shift
The End of Higher Education as We Know It, explains how online education will transform higher education. The fact that (prestigious) universities scramble to be part of this paradigm shift shows how serious the development is. Today, we can take an increasing array of courses online, often of the highest quality, higher than many are privileged to enjoy in brick-and-mortar universities. The change started off as recordings of courses given in a traditional setting. Increasingly, we see online tailored courses by experts in their field like Andrew Ng from Stanford with his Machine Learning course on Coursera. read more…