Hadoop 1.0 is increasingly challenged as slow and limited in its application, now that the hype is dying down. Marketing departments, riding the Big Data wave, wildly exaggerated Hadoop’s ability. Hadoop 2.0, surprisingly, is about to prove them somewhat right with two major developments.
Leaps in accessibility, but not performance
The first development concerns the ease and speed of accessing data. A few years ago, writing software for Hadoop was slow and cumbersome. You had to write Java software, and you had to understand how to fit it into the MapReduce framework. Luckily, this is the past. Today, we can extract, transform, and load data with a plethora of tools, making development much easier and faster.
On the lower level, Crunch is making Java-based MapReduce programming and testing easier. Pig lets you write data transformation in a high level data flow expression, abstracting away low level programming scaffolds. Hive brings SQL-like abilities to Hadoop, opening the data up to a large number of people. Alternative frameworks, especially around Python, emerged, like Hadoop streaming, mrjob, dumbo, hadoopy, and pydoop to name a few.
Accessibility and ease of programming of MapReduce has leaped forward in the last few years.
Projects like HBase, HCatalog, Zookeeper, Mahout, and others expanded the Hadoop ecosystem, catering to a wide variety of use-cases to fully utilize Hadoop’s HDFS and MapReduce core. That core, however, did not change significantly, and has become the bottleneck for new use-cases around cluster computation and resource management. It’s at the center of the criticisms of Hadoop being slow and limited.
There are some projects around the corner that will change this, and open Hadoop to new applications. Most importantly, Hadoop will become much faster and more flexible.
The change is coming from two directions. First, two data querying services building on top of Hadoop but bypassing MapReduce have emerged: Impala (in beta and built by Cloudera), and Drill (in development and supported by MapR). They draw their inspiration from Google’s Dremel paper and the subsequent Google BigQuery service.
The latter delivers interactive querying of billions of rows of data, often returning results faster than current Hadoop clusters can fire up the first of many MapReduce jobs resulting from an equivalent Hive query.
The speed is achieved by localizing query execution on, or close to, the cluster nodes as much as possible, exchanging as little data as late as possible between nodes, while accessing the data on the nodes directly without incurring MapReduce job overheads. Columnar-oriented storage is introduced to reduce the data loaded for a query. This can often greatly reduce data loading, since full table and row scans are rare in practice.
Localizing the execution takes advantage of local caches and fast memory, reducing the slow and expensive network interactions.
I had a chance to chat with a Google engineer involved with BigQuery at a conference last year. He divulged that BigQuery works not only because of the architectural changes, but also because of specialized hardware — super fast, specialized, tightly-coupled network switches between a very large number of nodes. Consequently, BigQuery performance is unlikely to be matched by small to medium-sized companies for a long time, despite the achievements made with Impala and Drill.
I wouldn’t be surprised to see large IaaS providers like Amazon fill the gap in a year or two with an Impala/Drill service on optimized hardware, following the successful Elastic MapReduce service.
The second development is best summarised by Hortonworks’ announcement of the Stinger Initiativelast week claiming to make Hive 100 times faster utilising and introducing new core Hadoop technologies. It could result in a near-BigQuery-style performance for a wide range of Hadoop users. That would enable ad hoc and interactive data querying for large datasets, something that requires sophisticated, large, traditional data warehouse setups. Hadoop clusters would instantly solve a highly significant use-case in many companies and potentially become dual use.
The core parts of the Stinger roadmap are the adoption of MapReduce 2.0 in the form of Apache YARN, columnar data storage, and Apache Tez. YARN splits the current rigid Jobtracker into a ResourceManager and ApplicationMaster. They can manage resources like CPU, RAM, and network capacity flexibly across a cluster and for many parallel running applications. For example, a CPU-intense and a memory-intense application could run in parallel on the cluster utilising each node’s CPU and memory optimally instead of relying on the current rigid, suboptimal map-and-reduce slot framework.
An interesting development is the release of Corona by Facebook as an alternative to YARN. Corona is proven to work reliably at the upper end of scalability needs. YARN is API-compatible with current MapReduce jobs and is already being delivered with Cloudera’s latest distribution (for testing purposes currently, not production). It has to be to see whether YARN or Corona will take over the market or if they will split the market between them.
Apache Tez highly optimises data processing applications, e.g., output of a Hive query or Pig program, by replacing the current paradigm of modelling every application as a directed acyclic graph of multiple MapReduce jobs. Currently, each job requires IO synchronisation and time and resources to be started and managed. Tez can express these applications as a single job with a directed acyclic graph of map-and-reduce tasks in arbitrary order — for example, a reduce task can feed directly into another reduce task without a (dummy) intermediate map task, IO synchronisation, or overheads for a new job. Lastly, ORCFile, proposed by Stinger as a columnar storage, can dramatically speed up and reduce the data needed to access for an application.
Together these improvements will change the way we perceive and use Hadoop. YARN (or Corona) as a cluster management framework will open Hadoop to many new data processing paradigms. The query acceleration in Hive and Pig makes Hadoop an option for interactive data processing and warehousing, and lowers costs on existing use-cases.
These changes will become part of production systems in the next couple of years, so I am expecting some interesting developments spinning off along the way. At the same time I am not discounting Impala and Drill. They are more focused in their applications and promise even greater performance gains. Hadoop is to here to stay, and there are exciting times ahead.
- Hadoop cluster cost of Amazon EC2 vs EMR
- Democratize Big Data With Hive
- 32x Faster Hadoop and Map Reduce With Indexing
- Get Started: Big Data Crunching in the Cloud