Skip to content

Posts from the ‘Semantikoz’ Category

Mar 15 14

ORC: An Intelligent Big Data file format for Hadoop and Hive

by Christian Prokopp
ORC File Layout

RCFile (Record Columnar File), the previous Big Data storage format on Hive, is being challenged by the smart ORC (Optimized Row Columnar) format.

My first post on the topic, Getting Started with Big Data with Text and Apache Hive, presented a common scenario to illustrate why Hive file formats are significant to its performance and big data processing. The second post, Optimising Hadoop and Big Data with Text and Hive, discussed the case for and against text format. It highlighted why for intermediate tables RCFile is the best choice — which we covered in Faster Big Data on Hadoop with Hive and RCFile. This about to change: the next evolution of the format, ORC file, is around the corner. How much better is it? read more…

Mar 1 14

Optimising Hadoop and Big Data with Text and Hive

by Christian Prokopp
Hive and text can benefit greatly from optimisation

Hadoop’s Hive SQL interface reduces costs and to gets results fast with Big Data from Text. Simple optimisations improve the performance significantly.

The previous post Getting Started with Big Data with Text and Apache Hive described the case for using text format to import and export data for a Hive ETL and reporting process. These formats are a common denominator and are convenient to get products started quickly. read more…

Sep 16 13

32x Faster Hadoop and Map Reduce With Indexing

by Christian Prokopp
HAIL, Hadoop with indexing is fast

HAIL, Hadoop with indexing is fast (image source)

Hadoop and map reduce’s simplicity, and especially lack of indices, significantly limits its performance. I described how map reduce 2.0 and alternatives bypassing map reduce will change Hadoop’s application and speed it up in the next year or two. Another approach is the introduction of indices to data stored on Hadoop Distributed File System (HDFS). Since Hadoop is not a Database Management System (DBMS), indices have been largely ignored until recently with exciting research exploring indexing and its performance benefits to Hadoop.

At its inception, Hadoop was diametrical to the idea of parallel DBMS, with their well-tuned schema and indices. This perception implied that Hadoop, by design – massively parallel, scalable, data agnostic, and hardwired to full row or column scans – is not suitable for indexing. This essential technology is key to traditional DBMS, keeping the performance high-ground over Hadoop on (near) real-time analytical tasks.

This may change, and not only because of Corona, YARN, Tez, Impala, and Drill. Research at the Information System Group at the Saarland University is nothing less than a paradigm shift. Researchers demonstrated that Hadoop, with little to no changes, can generate indices, which provide stunning performance improvements. read more…

Jan 17 13

Crowd Funding Activism: Put Your Money Where Your ‘Like’ Is

by Christian Prokopp

In the Internet age many people utilise technology to spread information, make themselves heard, organise demonstrations, or simply click like to show their support for a cause. The latter is sometimes belittled as the equivalent of a couch potato’s self-gratifying response to a pressing issue. It gives you a warm feeling but has no impact. In the last years, though, there has been a development that allows each of us to help. The problem was that for the thousands of people who like and (re)tweet an issue, there are only a few that are willing to put themselves forward and do something about it. Unfortunately, a million likes won’t buy them equipment, pay for materials, or feed them when they spend months of their lives to change something which you, I and so many others agree should be changed. Until now, enter crowd funding. read more…