You have started to process data with cloud computing platforms like Amazon Web Service (AWS)’s Elastic MapReduce (EMR). Now that you use it regularly, other stakeholders are getting curious. You increasingly find yourself firing up an EMR cluster to quickly answer a question or try something out. It may be time to change the way you work — time for a permanent cluster and easy access for other users.
In Get Started: Big Data Crunching in the Cloud, I described one path to get started with big data processing on EMR. We were processing our daily and periodical big data jobs at Rangespan in this manner until recently. Currently, we are transitioning to a permanent cluster utilizing AWS EC2 for the cluster nodes and Cloudera for the cluster installation.
Why switch to a permanent cluster?
First, our processing demands are growing. Besides daily ETL and periodic machine learning, we are increasingly generating reports and metrics. The latter two were implemented to run against (No)SQL stores. It was a good decision at the time — fast and easy to implement. But their numbers are increasing, and reports and metrics often require scanning whole collections or databases, and can cost hours of processing time.
Reports and metrics interfere with fast transactional processes, and are better done in the background (a little more on that another time).
Second, and what I’d like to highlight here, is business reporting and utilizing big data by opening it up to the rest of the company — democratizing it, if you like.
Currently, we have a RDBMS instance mirroring our transactional data for our business team to query and develop ad-hoc reports against. However, it lacks the data from our NoSQL store. The latter is growing rapidly beyond 100GB — hundreds of millions of documents, and more in the long-term. Mapping that data one-to-one into the RDBMS would become increasingly difficult, expensive, and eliminate the NoSQL benefits. Furthermore, our business team naturally employs aggregated views on the larger data sets. They need timely, consolidated data, and flexibility, but not the full data sets.
Unfortunately, the transient EMR cluster does not lend itself well to developing such a view on the fly when inspiration strikes a business stakeholder or new issues arise:
- There is no user-friendly interface for business users.
- The cluster requires 10 to 20 minutes to start and additional processing time when the query involves temporary data from intermediate job-flow steps.
- The same issues apply to debugging the job-flow to investigate potential errors, which makes agile development difficult.
Hive allows access
We decided to trial a permanent cluster. The difference a permanent cluster made to these challenges was tremendous. Of course, we do more data processing, i.e. metrics and reports, but this is only one advantage, rather than a subsequent outcome. Most importantly, the workflow across the teams has dramatically improved.
The permanent cluster, in combination with Hive (developed at Facebook and now open-source), makes data exploration, reporting, and metrics trivial. Hive translates HiveQL (Hive’s SQL dialect) into map-reduce jobs. It opens big data up to a lot of people through a common skill — SQL.
Hive is much more, though. Advanced users in your data team can inject code into the queries, adding a Python or Java program as a map or reduce step, enabling rapid development of powerful data processing pipelines.
We have an excellent, tech-savvy business team, and they draft complex SQL queries before engaging with the data team. Together, it usually takes a data and business team member under an hour to adjust the drafted query, test it, and create a job with Oozie. Oozie will generate the data and export it regularly to our RDBMS. There, the business team consumes it in day-to-day analysis and reporting, and joins it with transactional data.
The productivity and turnaround for ideas with this arrangement is stunning. The business team embraced the technology instantly, improving our business intelligence. At the same time, the set-up frees precious development time in the data and development teams. Notably, the development and maintenance has also improved. We can access the data, develop queries, and investigate running jobs faster and with more ease, which saves many development days over the year.
One inefficiency remains, though. The business stakeholder still has to go through a data-team member to interact with the data via Hive. This is one reason we chose Cloudera as a distribution for the cluster installation. It comes with Hue, which provides a web-based interface to develop Hive queries, check mapreduce jobs’ status, browse HDFS, develop Oozie workflows, and much more. We are now in the process of introducing the business team to this interface so they can query and export the data independently.