32x Faster Hadoop and Map Reduce With Indexing
Hadoop and map reduce’s simplicity, and especially lack of indices, significantly limits its performance. I described how map reduce 2.0 and alternatives bypassing map reduce will change Hadoop’s application and speed it up in the next year or two. Another approach is the introduction of indices to data stored on Hadoop Distributed File System (HDFS). Since Hadoop is not a Database Management System (DBMS), indices have been largely ignored until recently with exciting research exploring indexing and its performance benefits to Hadoop.
At its inception, Hadoop was diametrical to the idea of parallel DBMS, with their well-tuned schema and indices. This perception implied that Hadoop, by design – massively parallel, scalable, data agnostic, and hardwired to full row or column scans – is not suitable for indexing. This essential technology is key to traditional DBMS, keeping the performance high-ground over Hadoop on (near) real-time analytical tasks.
This may change, and not only because of Corona, YARN, Tez, Impala, and Drill. Research at the Information System Group at the Saarland University is nothing less than a paradigm shift. Researchers demonstrated that Hadoop, with little to no changes, can generate indices, which provide stunning performance improvements.
20x speedup with Hadoop++
The researchers’ first approach, Hadoop++, was to generate indices with drop-in replacements for some of the user-defined functions Hadoop uses for intermediated steps in map reduce jobs. They created separate indices for each HDFS block. These data blocks are stored, distributed and redundantly, in HDFS on the Hadoop cluster for reliability and performance.
The indexing achieved a speed-up of up to 20 times by mainly reducing the data required to be read by mappers. This requires anticipation of the workload for optimal index design, and extra time to create the indices. Such an approach lacks flexibility and ease of use, but it was an encouraging starting point for further investigations.
64x speedup with Hadoop Aggressive Indexing Library
In subsequent work, the researchers developed Hadoop Aggressive Indexing Library (HAIL), addressing the indexing problem at the beginning of the Hadoop pipeline when data is uploaded to HDFS. Hadoop, by default, keeps three identical copies of each data block distributed between datanodes in the cluster for redundancy and parallel performance through data locality. HAIL intervenes at upload time and creates an efficient binary Partition Attributes Across (PAX) data block.
HAIL goes further, and creates different indices and sort orders for each of the copies, effectively prematurely optimizing them for different workloads. The redundancy and availability is not compromised since the data in each copy is the equivalent, and merely organized and indexed differently. Each map-reduce job can then select the best fitting index and sorted block to improve performance.
HAIL excels by utilizing spare CPU time on datanodes at upload time, which generates efficient binary PAX blocks that speed up upload and distribution time, despite indexing and sorting the data. HAIL can create and distribute up to six differently-indexed, redundant copies in the time Hadoop usually uploads the same data to three datanodes.
Lazy Indexing and Adaptivity in Hadoop
The researchers from Saarland appear to have cracked this nut with their latest work. They developed Lazy Indexing and Adaptivity in Hadoop (LIAH) to approach indexing as a dynamic runtime problem optimizing recurring workloads. LIAH does not require prior knowledge of the workloads. It can be used as an extension to HAIL or independently on a standard Hadoop distribution.
LIAH creates an index at runtime in parallel to the map task. It uses the data already read for the running map task in the job without additional disk IO. Subsequent map reduce jobs then benefit from the (partial) index. LIAH does not index all blocks in a job to limit the overhead for the first job(s). Depending on a preset or adaptively derived parameter p LIAH indexes only a certain percentage of previously unindexed blocks to limit its resource consumption.
Eventually LIAH converges as the index reaches full coverage of all blocks. At the same time, subsequent jobs speed up as more and more index information becomes available. This lazy approach spreads the overhead of the indexing across several jobs. It prevents the initial job(s) from becoming excessively slow while benefiting from the partial index from the second job.
Hadoop is maturing
Hadoop is still a young technology compared to DBMS. And that is exciting since current development shows that it has many opportunities for performance improvements and new applications. I am not aware of a push of HAIL into the mainstream Hadoop package (if you are, please write a comment). However, the benefits are significant, and I am confident that someone will soon be working on an incubator project to bring these benefits to the mainstream.
I can see great benefits in adaptive indexing, e.g. for ETL jobs that feed off incrementally growing, massive data sinks without the need for data architects to come up with multiple representations of the data in partitions, rows, and columns to fit separate use-cases on the same data source.
- Hadoop 2.0: Beyond MapReduce with YARN, Drill, Tez
- Hadoop cluster cost of Amazon EC2 vs EMR
- Democratize Big Data With Hive
- Get Started: Big Data Crunching in the Cloud
This article was first published as a series of posts on the Big Data Republic.