32x Faster Hadoop and MapReduce With Indexing 8

HAIL, Hadoop with indexing is fast

HAIL, Hadoop with indexing is fast (image source)

Hadoop and map reduce’s simplicity, and especially lack of indices, significantly limits its performance. I described how map reduce 2.0 and alternatives bypassing map reduce will change Hadoop’s application and speed it up in the next year or two. Another approach is the introduction of indices to data stored on Hadoop Distributed File System (HDFS).

At its inception, Hadoop was diametrical to the idea of parallel DBMS, with their well-tuned schema and indices. This perception implied that Hadoop, by design – massively parallel, scalable, data agnostic, and hardwired to full row or column scans – is not suitable for indexing. This essential technology is key to traditional DBMS, keeping the performance high-ground over Hadoop on (near) real-time analytical tasks.

This may change, and not only because of Corona, YARN, Tez, Impala, and Drill. Research at the Information System Group at the Saarland University is nothing less than a paradigm shift. Researchers demonstrated that Hadoop, with little to no changes, can generate indices, which provide stunning performance improvements.

20x speedup with Hadoop++

The researchers’ first approach, Hadoop++, was to generate indices with drop-in replacements for some of the user-defined functions Hadoop uses for intermediated steps in map reduce jobs. They created separate indices for each HDFS block. These data blocks are stored, distributed and redundantly, in HDFS on the Hadoop cluster for reliability and performance.

The indexing achieved a speed-up of up to 20 times by mainly reducing the data required to be read by mappers. This requires anticipation of the workload for optimal index design and extra time to create the indices. Such an approach lacks flexibility and ease of use, but it was an encouraging starting point for further investigations.

64x speedup with Hadoop Aggressive Indexing Library

In subsequent work, the researchers developed Hadoop Aggressive Indexing Library (HAIL), addressing the indexing problem at the beginning of the Hadoop pipeline when data is uploaded to HDFS. Hadoop, by default, keeps three identical copies of each data block distributed between datanodes in the cluster for redundancy and parallel performance through data locality. HAIL intervenes at upload time and creates an efficient binary Partition Attributes Across (PAX) data block.

HAIL outperforms Hadoop (no indexing) and Hadoop++ in upload speed

HAIL outperforms Hadoop and Hadoop++ in upload speed

PAX is a combination of columnar and row-based storage. It takes a row-based split of data for an HDFS block and then organizes it columnar-oriented. It outperforms both columnar and row formats for a range of workloads.

HAIL goes further, and creates different indices and sort orders for each of the copies, effectively prematurely optimizing them for different workloads. The redundancy and availability are not compromised since the data in each copy is the equivalent, and merely organized and indexed differently. Each map-reduce job can then select the best fitting index and sorted block to improve performance.

HAIL excels by utilizing spare CPU time on datanodes at upload time, which generates efficient binary PAX blocks that speed up upload and distribution time, despite indexing and sorting the data. HAIL can create and distribute up to six differently-indexed, redundant copies in the time Hadoop usually uploads the same data to three datanodes.

HAIL performs up to 64x faster on map-reduce jobs than Hadoop without indexing

HAIL performs up to 64x faster on map-reduce jobs than Hadoop

The speedup achievable with these indices over Hadoop is dramatic. However, for complex schema, the number of potential indexes and sorting grows unwieldy. Hadoop needs to know in advance which data columns to index, or modify the pipeline to inject hints for the indexing and querying.

Lazy Indexing and Adaptivity in Hadoop

The researchers from Saarland appear to have cracked this nut with their latest work. They developed Lazy Indexing and Adaptivity in Hadoop (LIAH) to approach indexing as a dynamic runtime problem optimizing recurring workloads. LIAH does not require prior knowledge of the workloads. It can be used as an extension to HAIL or independently on a standard Hadoop distribution.

LIAH creates an index at runtime in parallel to the map task. It uses the data already read for the running map task in the job without additional disk IO. Subsequent map-reduce jobs then benefit from the (partial) index. LIAH does not index all blocks in a job to limit the overhead for the first job(s). Depending on a preset or adaptively derived parameter p LIAH indexes only a certain percentage of previously unindexed blocks to limit its resource consumption.

Eventually, LIAH converges as the index reaches full coverage of all blocks. At the same time, subsequent jobs speed up as more and more index information becomes available. This lazy approach spreads the overhead of the indexing across several jobs. It prevents the initial job(s) from becoming excessively slow while benefiting from the partial index from the second job.

Hadoop indexing, LIAH performance for various p and on two cluster setups versus Hadoop and HAIL

LIAH performance compared to HAIL and Hadoop without index

The performance gains even for a small p are significant even after the first jobs. The researchers found that the final performance gains over HAIL and Hadoop are 24 and 32 times. They also tested an eager version that adjusts p to a maximum while keeping each job runtime approximately at HAIL’s runtime. Expectedly, the initial jobs are of similar performance as HAIL and then drop to the full index performance once all data has been indexed. This would be an interesting feature for SLA-bound workloads. It throttles runtime while aggressively optimizing jobs as quickly as possible.

Hadoop is maturing

Hadoop is still a young technology compared to DBMS. And that is exciting since current development shows that it has many opportunities for performance improvements and new applications. I am not aware of a push of HAIL into the mainstream Hadoop package (if you are, please write a comment). However, the benefits are significant, and I am confident that someone will soon be working on an incubator project to bring these benefits to the mainstream.

I can see great benefits in adaptive indexing, e.g. for ETL jobs that feed off incrementally growing, massive data sinks without the need for data architects to come up with multiple representations of the data in partitions, rows, and columns to fit separate use-cases on the same data source.

This article was first published as a series of posts on the Big Data Republic.

8 thoughts on “32x Faster Hadoop and MapReduce With Indexing

  1. Reply amir kazemi Dec 18,2014 14:17

    Hi Christian;

    great article. can hadoop benefit from ssd or flash technologies to make it run faster in addition to indexing?
    there is a wsj article that points out when multiple users interact with hadoop, it gets slow.


    • Reply Christian Prokopp Dec 18,2014 14:37

      Hi Amir,

      Absolutely. Indexing is not really used in the wild. HDFS increasingly supports heterogenous storage (on going development), which allows you to define where to keep replications of blocks, e.g. keep one copy on SSD, one on disk, one on archive (S3), to balance performance and availability.

      WSJ is behind a paywall so I can’t read the article. Generally more users will create more load, naturally. The trick is to scale the cluster and give everyone guaranteed resources to ensure performance.

  2. Reply Bruce Sep 9,2015 15:33

    If the industry is turning Hadoop into a Database, then why not just start with a database. In order for the data to be indexed, then the column must be structured. Why not just load the data into a system like Vertica. Vertica also supports semi-structured data with its flex tables. Everyone has realized that Hadoop is not free and is very slow and now they have to find ways to cover up their mistakes by re-inventing Hadoop. It’s also fun to watch as everyone works for free to let someone else make all the money.

    • Reply Christian Prokopp Sep 20,2015 09:23

      Hi Bruce,

      There are some parallels and some aspects are being reinvented, true. However, the underlying premise of horizontal scalability and decoupling the resource management from storage and processing capabilities are new and change how data is utilised. In large organisations, the application of Hadoop goes far beyond anything any database could achieve on scale or functionality. And many people make a living working on open source software, e.g. Hortonworks has employed many of the committers to Hadoop (and its ecosystem) paying them handsomely.


  3. Reply Simon Dec 15,2015 13:22

    Hi Christian,
    Great article,and thank you for your sharing.

    Recently,There is a new work about the index in hadoop at VLDB,which used Generalized Search Trees to maintain different types of indices.(http://www.vldb.org/pvldb/vol7/p1797-lu.pdf).What do you think of it?

    I am from China and studying for my master degree in CS,who is going to do some work about this issue.Could you give me some advise?

    Thank you.


    • Reply Christian Prokopp Dec 18,2015 07:33

      Hi Simon,
      Interesting paper. You will find that the industry is moving away from mapreduce towards technologies like Spark (and generally streaming) and in memory caches/data structures. Hope that helps.

  4. Reply Peter Kulek May 27,2016 16:30

    Looks like the new NoSQL data based technologies are just rehashes of old technologies like navigational, graph, hierarchical etc, with the same old problems.
    Storing large data is old technology aka RAID, LVM etc.
    Indexing large data of any type :eg finger prints, documents, face recognition, geo spacial etc has been around for decades.
    Achievable with a relational database which has nothing to do with SQL or RDBMS and can model graph, network, hierarchical etc.

    • Reply Christian Prokopp Jun 6,2016 20:14

      Hi Peter,
      Naturally, there are existing technologies that reappear in different guises augmenting the new technologies. Brushing big data technologies off as relabeled old tech that has no new capabilities is unfair though.

Leave a Reply




This site uses Akismet to reduce spam. Learn how your comment data is processed.