Big data is not a starting point but a destination for many startups and teams. It becomes a conversation point as a result of the growing amount of data and the technical challenges in handling it, or perhaps a business requirement to extract value from a large dataset.
For example, at Rangespan, we use SQL and NoSQL data stores, one for transactional information and one for storing multifaceted inert facts like product descriptions. Early on, we faced challenges combining the growing data in a timely way into data products and analyzing it for technical and business metrics. We are processing around 100GB every day, and the volume is growing, so we require a scalable solution. At the same time, our business cases are evolving, and we want to embrace the uncertainty. Consequently, we require the flexibility to scale up or down, or to change technologies as needed, without betting our capital on one infrastructure.
The cloud offers freedom
At Rangespan, like many other companies, we achieve this with cloud computing. We rent from Amazon Web Services, avoiding any long-term investments. We chose the Elastic MapReduce (EMR) service. It allows us to raise a Hadoop cluster for our periodic, intermittent computational needs and pay only for the machines, services, and time we use. The EMR job flow model allows us to break a cluster workload into smaller jobs and queue them in a specific order. You can choose from Apache Hadoop, Hive, and Pig processing steps, giving you a great Swiss army knife to get started.
Amazon has integrated EMR with its EC2 and S3 services, so EMR can read from and write to S3 and utilize EC2 spot instances.
Making the numbers work
Take an example similar to what we experienced to estimate some costs. Imagine a daily processing job in the eastern US taking three hours on five m1.large EC2 instances. The costs would be $0.26/hour per EC2 instance, along with $0.06/hour per instance for running EMR. The total cost is $4.80/day or $1,752/year. You can use spot instances and bidding for idle instances to save 50-90 percent of the EC2 instance costs.
If you want to cut down on processing time, you can increase the number of machines, reducing your processing time (somewhat) equivalently. Of course, with the overhead of starting machines, MapReduce as a framework, and distributing tasks in Hadoop, your mileage varies. Lastly, you have to add expenses like S3 storage or data transfer. The costs should be manageable, though, unless you move petabytes, in which case EMR might not be the best solution for you.
This model may fit you well if your business has a big data use case with intermittent or periodic need for large number crunching. You can find examples nearly everywhere — banks, logistics, retail, supermarket chains, advertising, and many more. They compute, update, and improve models regularly — for fraud detection, to simulate and optimize warehousing and distributions, for clickthrough and price predictions, to process and transform (web) logs about customer interactions, to create periodic reports, for OLAP generation, etc.
On the other hand, if you need continuous processing like streaming of data and live update of models, then a permanent cluster may be more economical.
We used this approach at Rangespan for more than a year to compute large machine-learning tasks periodically, as well as to handle daily data transformation and analysis jobs. With the increasing utilization of the service, we found one limitation: A lack of interactivity slowed us down. Raising a cluster takes several minutes. Consequently, developing complex job flows, particularly debugging them, can be time-consuming, and iterations over new ideas with business stakeholders is limited.
We are transitioning to a permanent cluster, moving our daily jobs over to address this. We are utilizing Hive as a means to interact with the data quickly, and we have already become very responsive to emerging reporting requests.