Big data architecture paradigms are commonly separated into two (supposedly) diametrical models, the more traditional batch and the (near) real-time processing. The most popular technologies representing the two are Hadoop with MapReduce and Storm. However, a hybrid solution, the Lambda Architecture, challenges the idea that these approaches have to exclude each other. The Lambda Architecture combines a slow and fast lane of data processing to achieve the best of both worlds. Fast results and deep, large scale processing.
Usually one or the other architecture has been implemented due to a business requirement. Commonly, business users or customers eventually arrive at the point where they either would like to get a more historic view or more real time insight either of which can not be provided by the deployed architecture. At this point a hybrid solution becomes the only realistic solution. One which brings some surprising benefits with it.
Lambda Architecture explained
The Lambda Architecture centrally receives data and does as little as possible processing before copying and splitting the data stream to the real time and batch layer. The batch layer collects the data in a data sink like HDFS or S3 in its raw form. Hadoop jobs regularly process the data and write the result to a data store.
Since this process is fully batched the data store can have some significant simplification. It should support random reads, i.e. needs some kind of index, however, it can do away with random writing, locking, and consistency issues. This simplifies the store significantly. An example of such a system is ElephantDB.
The problem with batch processing is the time it takes. For example, the above process may take hours or days. In the meantime data has been arriving and subsequent processes or services continue to work with hours or days old information. The real time layer solves this by taking its copy of the data and processing it in seconds or minutes and stores it in a fast random read and write store. This store is more complex since it has to be constantly updated.
The complexity of the real time layer and it’s store is manageable since it only has to store and serve a sliding window of data, which needs to be roughly as long as the batch process takes. Both layers’ results are merged and real time information is replaced in favour of batch layer data. In many cases this enables for the real time process to work with good approximations since its results are replaced by highly precise data within a short period.
Lambda Architecture benefits
The addition of another layer to an architecture has major advantages. Firstly, data can (historically) be processed with high precision and involved algorithms without losing short-term information, alerts, and insights provided by the real time layer. Secondly, the addition of a layer is offset by dramatically reducing the random write storage requirements. The batch write storage provides also the option to switch data at predefined times and version data.
Lastly and importantly, the addition of the data sink of raw data offers the option to recover from human mistakes, i.e. deploying bugs which write erroneous aggregated data from which other architectures can not recover. Another option is to retrospectively enhance data extraction or learning algorithms and apply them on the whole of the historic dataset. This is extremely helpful in agile and startup environments where MVPs push what can be done down the track.
More information to the Lambda Architecture.