Skip to content
Mar 15 14

ORC: An Intelligent Big Data file format for Hadoop and Hive

by Christian Prokopp
ORC File Layout

RCFile (Record Columnar File), the previous Hadoop Big Data storage format on Hive, is being challenged by the smart ORC (Optimized Row Columnar) format.

My first post on the topic, Getting Started with Big Data with Text and Apache Hive, presented a common scenario to illustrate why Hive file formats are significant to its performance and big data processing. The second post, Optimising Hadoop and Big Data with Text and Hive, discussed the case for and against text format. It highlighted why for intermediate tables RCFile is the best choice — which we covered in Faster Big Data on Hadoop with Hive and RCFile. This about to change: the next evolution of the format, ORC file, is around the corner. How much better is it? read more…

Mar 8 14

Faster Big Data on Hadoop with Hive and RCFile

by Christian Prokopp
Hive RCFile format


The previous two posts in this four parts series explained the reasons why to use text on the periphery of an ETL process and optimisations for text. The inside of a Hive data processing pipeline can be optimised further for performance though. Commonly, derived, intermediate tables are queried heavily. Optimising these tables speeds up the whole pipeline greatly. How can these potentially huge tables be query agnostically optimised to only read and process the relevant fraction of the data stored? read more…

Mar 1 14

Optimising Hadoop and Big Data with Text and Hive

by Christian Prokopp
Hive and text can benefit greatly from optimisation


The previous post Getting Started with Big Data with Text and Apache Hive described the case for using text format to import and export data for a Hive ETL and reporting process. These formats are a common denominator and are convenient to get products started quickly. read more…

Feb 22 14

Getting Started with Big Data with Text and Apache Hive

by Christian Prokopp
Apache Hive Logo

Big Data more often than expected is stored and exchanged as text. Apache Hadoop’s Hive SQL interface helps to reduce costs and to get results fast.

Often, things have to get done fast rather than perfectly. However, with big data even a small decision like a file format could have a great impact. What are some of the best-practices, storage formats, and strategies to balance optimization with getting things done? read more…

Sep 16 13

32x Faster Hadoop and Map Reduce With Indexing

by Christian Prokopp
HAIL, Hadoop with indexing is fast

HAIL, Hadoop with indexing is fast (image source)

Hadoop and map reduce’s simplicity, and especially lack of indices, significantly limits its performance. I described how map reduce 2.0 and alternatives bypassing map reduce will change Hadoop’s application and speed it up in the next year or two. Another approach is the introduction of indices to data stored on Hadoop Distributed File System (HDFS).

At its inception, Hadoop was diametrical to the idea of parallel DBMS, with their well-tuned schema and indices. This perception implied that Hadoop, by design – massively parallel, scalable, data agnostic, and hardwired to full row or column scans – is not suitable for indexing. This essential technology is key to traditional DBMS, keeping the performance high-ground over Hadoop on (near) real-time analytical tasks.

This may change, and not only because of Corona, YARN, Tez, Impala, and Drill. Research at the Information System Group at the Saarland University is nothing less than a paradigm shift. Researchers demonstrated that Hadoop, with little to no changes, can generate indices, which provide stunning performance improvements. read more…

Aug 12 13

Online Education Revolution

by Christian Prokopp

Traditional Universities ((c)
read more…

Jun 27 13

Hadoop 2.0: Beyond MapReduce with YARN, Drill, Tez

by Christian Prokopp

read more…

Jun 22 13

4 Free DIY Twitter Visualisations: The Shahbag Protest

by Christian Prokopp

read more…

Jun 8 13

Bangladesh Budgets 2004-2013: A Decade of Government Spending

by Christian Prokopp


Bangladesh’s budget has changed significantly in the last decade. It is much easier to understand this year’s budget when it is compared to the last decade’s budgets and economic indicators.

Bangladesh’s GNI, the Gross National Income, roughly a measure of how much economic wealth is created by a country, can put the budget’s size in perspective. Correspondingly the absolute increase is not equal to the impact per citizen since Bangladesh’s population is growing. Normalising the GNI and budget per capita can alleviate this. Illustrating the last decade of GNI and budget expense per capita highlights the long-term economic and spending trends.

GNI vs Budget per capita Bangladesh Budget 2004-2013

GNI vs Budget per capita

read more…

Apr 22 13

Big Data Transforms Online Education

by Christian Prokopp

Image by Sean MacEntee

An educational paradigm shift
The End of Higher Education as We Know It, explains how online education will transform higher education. The fact that (prestigious) universities scramble to be part of this paradigm shift shows how serious the development is. Today, we can take an increasing array of courses online, often of the highest quality, higher than many are privileged to enjoy in brick-and-mortar universities. The change started off as recordings of courses given in a traditional setting. Increasingly, we see online tailored courses by experts in their field like Andrew Ng from Stanford with his Machine Learning course on Coursera.
read more…