32x Faster Hadoop and MapReduce With Indexing

HAIL, Hadoop with indexing is fast

HAIL, Hadoop with indexing is fast (image source)

Hadoop and map reduce’s simplicity, and especially lack of indices, significantly limits its performance. I described how map reduce 2.0 and alternatives bypassing map reduce will change Hadoop’s application and speed it up in the next year or two. Another approach is the introduction of indices to data stored on Hadoop Distributed File System (HDFS).

At its inception, Hadoop was diametrical to the idea of parallel DBMS, with their well-tuned schema and indices. This perception implied that Hadoop, by design – massively parallel, scalable, data agnostic, and hardwired to full row or column scans – is not suitable for indexing. This essential technology is key to traditional DBMS, keeping the performance high-ground over Hadoop on (near) real-time analytical tasks.

This may change, and not only because of Corona, YARN, Tez, Impala, and Drill. Research at the Information System Group at the Saarland University is nothing less than a paradigm shift. Researchers demonstrated that Hadoop, with little to no changes, can generate indices, which provide stunning performance improvements.

20x speedup with Hadoop++

The researchers’ first approach, Hadoop++, was to generate indices with drop-in replacements for some of the user-defined functions Hadoop uses for intermediated steps in map reduce jobs. They created separate indices for each HDFS block. These data blocks are stored, distributed and redundantly, in HDFS on the Hadoop cluster for reliability and performance.

The indexing achieved a speed-up of up to 20 times by mainly reducing the data required to be read by mappers. This requires anticipation of the workload for optimal index design, and extra time to create the indices. Such an approach lacks flexibility and ease of use, but it was an encouraging starting point for further investigations.

64x speedup with Hadoop Aggressive Indexing Library

In subsequent work, the researchers developed Hadoop Aggressive Indexing Library (HAIL), addressing the indexing problem at the beginning of the Hadoop pipeline when data is uploaded to HDFS. Hadoop, by default, keeps three identical copies of each data block distributed between datanodes in the cluster for redundancy and parallel performance through data locality. HAIL intervenes at upload time and creates an efficient binary Partition Attributes Across (PAX) data block.

HAIL outperforms Hadoop (no indexing) and Hadoop++ in upload speed

HAIL outperforms Hadoop and Hadoop++ in upload speed

PAX is a combination of columnar and row-based storage. It takes a row-based split of data for a HDFS block, and then organizes it columnar-oriented. It outperforms both columnar and row formats for a range of workloads.

HAIL goes further, and creates different indices and sort orders for each of the copies, effectively prematurely optimizing them for different workloads. The redundancy and availability is not compromised since the data in each copy is the equivalent, and merely organized and indexed differently. Each map-reduce job can then select the best fitting index and sorted block to improve performance.

HAIL excels by utilizing spare CPU time on datanodes at upload time, which generates efficient binary PAX blocks that speed up upload and distribution time, despite indexing and sorting the data. HAIL can create and distribute up to six differently-indexed, redundant copies in the time Hadoop usually uploads the same data to three datanodes.

HAIL performs up to 64x faster on map-reduce jobs than Hadoop without indexing

HAIL performs up to 64x faster on map-reduce jobs than Hadoop

The speedup achievable with these indices over Hadoop is dramatic. However, for complex schema, the number of potential indexes and sorting grows unwieldy. Hadoop needs to know in advance which data columns to index, or modify the pipeline to inject hints for the indexing and querying.

Lazy Indexing and Adaptivity in Hadoop

The researchers from Saarland appear to have cracked this nut with their latest work. They developed Lazy Indexing and Adaptivity in Hadoop (LIAH) to approach indexing as a dynamic runtime problem optimizing recurring workloads. LIAH does not require prior knowledge of the workloads. It can be used as an extension to HAIL or independently on a standard Hadoop distribution.

LIAH creates an index at runtime in parallel to the map task. It uses the data already read for the running map task in the job without additional disk IO. Subsequent map reduce jobs then benefit from the (partial) index. LIAH does not index all blocks in a job to limit the overhead for the first job(s). Depending on a preset or adaptively derived parameter p LIAH indexes only a certain percentage of previously unindexed blocks to limit its resource consumption.

Eventually LIAH converges as the index reaches full coverage of all blocks. At the same time, subsequent jobs speed up as more and more index information becomes available. This lazy approach spreads the overhead of the indexing across several jobs. It prevents the initial job(s) from becoming excessively slow while benefiting from the partial index from the second job.

Hadoop indexing, LIAH performance for various p and on two cluster setups versus Hadoop and HAIL

LIAH performance compared to HAIL and Hadoop without index

The performance gains even for a small p are significant even after the first jobs. The researchers found that the final performance gains over HAIL and Hadoop are 24 and 32 times. They also tested an eager version that adjusts p to a maximum while keeping each job runtime approximately at HAIL’s runtime. Expectedly, the initial jobs are of similar performance as HAIL and then drop to the full index performance once all data has been indexed. This would be an interesting feature for SLA-bound workloads. It throttles runtime while aggressively optimizing jobs as quickly as possible.

Hadoop is maturing

Hadoop is still a young technology compared to DBMS. And that is exciting since current development shows that it has many opportunities for performance improvements and new applications. I am not aware of a push of HAIL into the mainstream Hadoop package (if you are, please write a comment). However, the benefits are significant, and I am confident that someone will soon be working on an incubator project to bring these benefits to the mainstream.

I can see great benefits in adaptive indexing, e.g. for ETL jobs that feed off incrementally growing, massive data sinks without the need for data architects to come up with multiple representations of the data in partitions, rows, and columns to fit separate use-cases on the same data source.

This article was first published as a series of posts on the Big Data Republic.

Summary
Article Name
32x Faster Hadoop and MapReduce With Indexing
Author
Description
Hadoop and MapReduce's simplicity, and especially lack of indices, significantly limits its performance. Since Hadoop is not a Database Management System (DBMS), indices have been largely ignored until recently with exciting research exploring indexing and its performance benefits to Hadoop.

Leave a Reply