What is the price of a small Elastic MapReduce (EMR) vs an EC2 Hadoop cluster? This article explores the price tag of switching to a small, permanent EC2 Cloudera cluster from AWS EMR.
Cloud computing with Hadoop – maybe using AWS EMR or EC2 – makes experiments with temporary clusters and big data crunching easy and affordable. These usually provide use-cases that benefit from a more permanent cluster, i.e., to give business teams access to big data or long-running computations. This raises questions around cost and system management. Which setup is efficient, avoiding upfront capital investment, and achievable with in-house know-how?
In a recent post, Democratize Big Data With Hive, I described why at Rangespan we moved away from a transient Amazon Web Service Elastic MapReduce (EMR) cluster to a permanent one. The decision was born out of increasing demand for computing time and the lack of interactivity of a setup that required long startup time and had no user-friendly interface to work with. We initially worked with transient AWS EMR clusters. Our cluster is comparatively modest, merely four m1.large EC2s on EMR, and there is significant uncertainty around how fast and large it will grow in the future. Will we double our computing needs in the next year or raise it by a magnitude? This depends on many variables of our success. [Comment: We have doubled our cluster size within 3 months of writing the original post and we will continue to grow it.]
As a startup we can only invest capital on proven products and rather spend operational expenditure when possible to retain the agility to grow, shrink, abandon, or raise architectures quickly in reaction to customer demand and product development.
The effortless solution would have been to continue with EMR and keep the cluster running 24/7. This was undesirable for two reasons. First, EMR costs $0.06/h per machine, which comes to $2,102.40 for our four machines per year. Second, and more importantly, EMR is simple at the expense of flexibility.
Compare it with a distribution like Cloudera. It provides the latest software version and flexibility like simplified installation of additional services, e.g., Hue, Yarn, Zookeeper, HBase, Flume, and Impala. In particular, Hue, a browser-based interface to Hadoop and its services like Hive, was a service we wanted. It proved very beneficial to opening access to our data, our cross-team development process, and improving business intelligence. Lastly, Cloudera comes with the Cloudera Manager, which streamlines managing clusters — installing services or upgrading software clusterwide.
Consequently, we installed Cloudera on four m1.large EC2 instances using a m1.micro for the manager installation. We mostly use Hue, Hive, Oozie, and Sqoop at the moment, but use-cases for Flume and other services are already being discussed. The hassle-free installation of services with the Cloudera Manager is an added bonus when we want to experiment with them.
An EC2-based cluster is not a cheap proposition. An on-demand setup as described costs $9,285.60 (4 x $2,277.60 + $175.20) per year. Alternatively, buying reserved high-utilization instances for the m1.larges and light reserved for the m1.micro drops the cost to $5,490.68 (4 x $1,340.64 + $128.12), a reduction of more than 40 percent.
An alternative would be a mixed cluster with spot and on-demand instances or a full spot-instance cluster. This requires that you can deal with losing a cluster (or parts of it) for a period of time. Spot instances are pulled from you without a warning when your bid price is below market rate. Such a setup can be implemented by retaining checkpoint data on S3 for example. In this case you can achieve a cost as low as $2,295.12 (4 x $560.64 + $52.56) per year in the best-case scenario (current floor price of $0.064/h for m1.large in EU-West). That is a potential saving of more than 75 percent over on-demand non-reserve instances. In the long run, we will discuss whether owning the hardware is not a more cost-effective solution. At the moment, however, we appreciate the flexibility we have with AWS.
Lastly, such a setup does not hamper the needs of a startup. Companies or departments trialing new products or changing architectures have the opportunity to pilot them with modest funds before applying for substantial investments. Furthermore, complete electronic service companies work in the cloud, as Netflix, Amazon’s poster child, demonstrates — as reported in InformationWeek. It operates nearly its whole business on EC2 using Hadoop and Cassandra clusters, growing and shrinking them with demand.