Hadoop cluster cost of Amazon EC2 vs EMR 4

What is the price of a small Elastic MapReduce (EMR) vs an EC2 Hadoop cluster? This article explores the price tag of switching to a small, permanent EC2 Cloudera cluster from AWS EMR.

Cloud computing with Hadoop – maybe using AWS EMR or EC2 –  makes experiments with temporary clusters and big data crunching easy and affordable. These usually provide use-cases that benefit from a more permanent cluster, i.e., to give business teams access to big data or long-running computations. This raises questions around cost and system management. Which setup is efficient, avoiding upfront capital investment, and achievable with in-house know-how?

In a recent post, Democratize Big Data With Hive, I described why at Rangespan we moved away from a transient Amazon Web Service Elastic MapReduce (EMR) cluster to a permanent one. The decision was born out of increasing demand for computing time and the lack of interactivity of a setup that required long startup time and had no user-friendly interface to work with. We initially worked with transient AWS EMR clusters. Our cluster is comparatively modest, merely four m1.large EC2s on EMR, and there is significant uncertainty around how fast and large it will grow in the future. Will we double our computing needs in the next year or raise it by a magnitude? This depends on many variables of our success. [Comment: We have doubled our cluster size within 3 months of writing the original post and we will continue to grow it.]

As a startup we can only invest capital on proven products and rather spend operational expenditure when possible to retain the agility to grow, shrink, abandon, or raise architectures quickly in reaction to customer demand and product development.

The setup
The effortless solution would have been to continue with EMR and keep the cluster running 24/7. This was undesirable for two reasons. First, EMR costs $0.06/h per machine, which comes to $2,102.40 for our four machines per year. Second, and more importantly, EMR is simple at the expense of flexibility.

Compare it with a distribution like Cloudera. It provides the latest software version and flexibility like simplified installation of additional services, e.g., Hue, Yarn, Zookeeper, HBase, Flume, and Impala. In particular, Hue, a browser-based interface to Hadoop and its services like Hive, was a service we wanted. It proved very beneficial to opening access to our data, our cross-team development process, and improving business intelligence. Lastly, Cloudera comes with the Cloudera Manager, which streamlines managing clusters — installing services or upgrading software clusterwide.

Consequently, we installed Cloudera on four m1.large EC2 instances using a m1.micro for the manager installation. We mostly use Hue, Hive, Oozie, and Sqoop at the moment, but use-cases for Flume and other services are already being discussed. The hassle-free installation of services with the Cloudera Manager is an added bonus when we want to experiment with them.

The cost
An EC2-based cluster is not a cheap proposition. An on-demand setup as described costs $9,285.60 (4 x $2,277.60 + $175.20) per year. Alternatively, buying reserved high-utilization instances for the m1.larges and light reserved for the m1.micro drops the cost to $5,490.68 (4 x $1,340.64 + $128.12), a reduction of more than 40 percent.

Amazon Web Service Elastic MapReduce Prices

Amazon Web Service Elastic MapReduce Prices

An alternative would be a mixed cluster with spot and on-demand instances or a full spot-instance cluster. This requires that you can deal with losing a cluster (or parts of it) for a period of time. Spot instances are pulled from you without a warning when your bid price is below market rate. Such a setup can be implemented by retaining checkpoint data on S3 for example. In this case you can achieve a cost as low as $2,295.12 (4 x $560.64 + $52.56) per year in the best-case scenario (current floor price of $0.064/h for m1.large in EU-West). That is a potential saving of more than 75 percent over on-demand non-reserve instances. In the long run, we will discuss whether owning the hardware is not a more cost-effective solution. At the moment, however, we appreciate the flexibility we have with AWS.

Lastly, such a setup does not hamper the needs of a startup. Companies or departments trialing new products or changing architectures have the opportunity to pilot them with modest funds before applying for substantial investments. Furthermore, complete electronic service companies work in the cloud, as Netflix, Amazon’s poster child, demonstrates — as reported in InformationWeek. It operates nearly its whole business on EC2 using Hadoop and Cassandra clusters, growing and shrinking them with demand.

This article was written by Christian Prokopp for and first published by the Big Data Republic.

 Related Articles

4 thoughts on “Hadoop cluster cost of Amazon EC2 vs EMR

  1. Reply Dean Wee Jun 22,2013 13:42

    Hi Christian,

    I like your writeup with the cost comparison between EC2 and EMR.

    Is this the best direction to go, versus purchasing you own cloud infrastructure in the long term?

    I know I played with the EC2 a couple years ago and it was nice that I didn’t have to purchase and power/cool any machines in my office. But my short time using it didn’t allow me to see any other advantages.

    Is the cost savings in having system administrators available to support the servers? Or is that the trade-off?

    Thanks,

    Dean

    • Reply Christian Prokopp Jun 23,2013 08:55

      Hi Dean,

      It depends on your use-case. If you need a server 24/7 and have it easily replaceable or HA then cloud computing can make sense if you don’t have enough machines to manage to justify the expense for a sysadmin. If you can live with downtime then your own server plugged into your Internet connection may do. Even in this case a spot-instance server – especially for light loads – may be as cheap as a few USD/month which is hard to beat. The extrem version of that is spiking load for data mining where you may want to spawn hundreds or thousands of tiny instances to crawl the web or extract data. If this occurs at specific times or intervals then cloud computing can make you live easy and reduce costs (good example is EMR).

      If you have a defined (high) continuous load, e.g. a well utilised Hadoop cluster of significant size then buying your own HW and employing a sysadmin quickly makes economic sense.

      Cheers,
      Christian

  2. Reply Shailesh Aug 8,2013 18:00

    Hi Christian,

    I guess in your case data is not dictating the cluster sizing at all which is not the case always. When you build dedicated cluster for example using EC2, your data resides in the HDFS, which requires you to maintain 3 copies of the data, which essentially leads to bigger cluster using EC2 (to add on this, its advisable to add machines when HDFS usage reaches 60%). In EMR case data resides in S3 where you pay for single copy of the data, you just need to size your cluster on your processing needs. So cost of dedicated cluster using EC2 would not be always cheaper when compared to EMR.

    Am I missing something ?

    Thanks,
    Shailesh

    • Reply Christian Prokopp Aug 10,2013 10:43

      Hi Shailesh,

      You are right and wrong. You could use an EC2 setup to do the same as with an EMR setup, i.e. store the input and output data on S3 and stop the cluster when now processing is needed.

      However, EMR is very comfortable in this scenario and especially if your frequency of data processing is low, i.e. every week or month. In such a case it may not be economical to deploy an EC2 cluster with your own Puppet/Chef/Ansible/Salt configuration since that incurs development and maintenance costs potentially outweighing the EMR costs.

      Cheers,
      Christian

Leave a Reply