The popularity of Big Data lies within its broad definition of employing high volume, velocity, and variety data sets that are difficult to manage and extract value from. Unsurprisingly, most businesses can identify themselves as facing now or in future Big Data challenges and opportunities. This therefore is not a new issue yet it has a new quality as it has been exacerbated in recent years. Cheaper storage and ubiquitous data collection and availability of third party data outpaced the capabilities of traditional data warehouses and processing solutions. Businesses investigating Big Data regularly recognize that they lack the capacity to process and store it adequately. This manifests either in an incapability to utilise existing big data sets to the fullest or expand their current data strategy with additional data.
Today, as a consequence of the Big Data trend, Businesses can turn to Big Data as a Service (BDaaS) solutions to bridge the storage and processing gap. Interestingly, a definition and classification of BDaaS is missing today and various types of services compete in the space with very different business models and foci. Businesses investigating Big Data and BDaaS, however, would be well served to review the types of services and how they align with their business goals before drilling down and evaluating instances of these services. What are the different types of BDaaS available?
Three layers of cloud computing as a service
Big Data as a Service sometimes is wrongly equated with Hadoop as a Service and cloud computing. Public and hybrid cloud offerings are most rapidly developing, which is natural due to the sizeable market accessible to them and the ability to leverage existing technologies and infrastructures. Hadoop is currently the most prominent distributed storage and compute environment. The two as a result are the most popular technologies enabling and support most of BDaaS in the market yet they don’t define it.
Big Data as a Service is in the company of countless as-a-Service offerings. The most significant ones that allow us to classify any subsequent services are threefold. Infrastructure as a Service (IaaS), e.g. virtual machines, networks, storage, or servers, is the most basic building block and includes anything (real or virtual) you would expect inside a data center. One level up exists the Platform as a Service (PaaS) which includes commonly employed software like web and database servers, or Hadoop and its ecosystem. Next up is the Software as a Service (SaaS) which are still generic but more user facing services like web email, content or customer relationship management systems. Finally, beyond SaaS are usually domain or business specific applications.
Hadoop or an alternative distributed compute and storage technology at the platform level naturally builds the core of a BDaaS. Consequently, any BDaaS solution includes the PaaS layer and potentially SaaS and/or IaaS. This leaves us with four possible combinations for BDaaS:
- PaaS only – focusing on Hadoop
- IaaS and PaaS – focusing on Hadoop and optimised infrastructure for performance
- PaaS and SaaS – focusing on Hadoop and features for productivity and exchangeable infrastructure
- IaaS and PaaS and SaaS – focusing on complete vertical integration for features and performance
Four Big Data as a Service Business models
The core BDaaS would implement the minimal platform, e.g. Hadoop with YARN and HDFS and a few popular services like Hive. Amazon Web Service’s Elastic MapReduce (EMR) is the most prominent core BDaaS and representative of this model. EMR is one of countless services in Amazon’s offering and EMR integrates well with many of the other services like the NoSQL store DynamoDB or S3 storage. Users can combine them to build anything from data pipelines to full company infrastructures around the EMR service. However, the strength of Amazon, the composability of it’s services, also means that the core BDaaS offering is meant to stay generic to interact with the rest of the services.
One path of vertical integration for BDaaS is downwards to include an optimised infrastructure. This allows to do away with some overheads of virtualisation and specifically build hardware servers and networks that cater to Hadoop’s performance needs.
Altiscale, a startup by Yahoo’s former CTO Raymie Stata, has done exactly this. They serve businesses understanding and working with Hadoop that are growing and are held back by scale and complexity. They can outsource their infrastructure and platform needs and management around Hadoop to Altiscale. Business can then focus on putting Hadoop to work and the stack from SaaS upwards. Altiscale has data centers on the US west and east coast and dedicated connections to Amazon Web Services for its customers. A package pricing approach based on storage and compute usage aims to remove common headaches of choosing between performance and cost optimisation, and give predictable, fixed costs.
The other path of integration for BDaaS is upwards to include features beyond the common Hadoop ecosystem offerings. Qubole, a startup founded by Ashish Thusoo and Joydeep Sen Sarma who lead and built Facebook’s data infrastructure team, has taken this approach. The feature driven BDaaS focuses on productivity and abstraction to get users started with Big Data quickly. Their offering includes web and programming interfaces as well as database adapters pushing technologies like Hadoop into the background and their offering reaches into the SaaS layer. In fact, Hadoop clusters are started, scaled and even stopped transparently as load requires.
Similarly to the core BDaaS the feature approach uses IaaS to provide computing and storage though with a significant difference. The independence from a cloud provider allows a feature BDaaS to view computing and storage as a fully scalable and more importantly exchangeable commodity alike electricity or water. Qubole, for example, supports already Amazon and Google’s IaaS. Qubole’s pricing is pay as you go or prepaid packages for their service Interestingly, the compute and storage from IaaS are pass through pay as you go and thus ideal for very variable, unpredictable, or exploratory workloads.
Lastly, another option is a fully vertically integrated BDaaS that combines the performance and feature benefits of the previous two BDaaS. I am not aware of any service, which does this at this point (drop me a line if you do). In theory, this is an appealing approach since it could result in the perfect BDaaS, which is productive and supports business users and experts, and provides maximum performance. At this point it does not look like Altiscale nor Qubole are planning to expanding into the remaining service layer, and for good reason. Both feature and performance BDaaS are at early stages and the integrated BDaaS could in practice turn out to be a squaring the circle problem.
As Big Data is maturing as a topic business and service models are emerging and we can see the advantages and differences between the three competing types of Big Data as a Service. The core BDaaS has been around for a few years and is in use by many companies especially as part of a larger architecture or for irregular workloads. It has settled as a model supporting the provider’s wider service architecture.
The feature and performance BDaaS attack the segment with very different value propositions and there are good reasons for both of them to continue to attract customers. Both will have to address some features of the other in the long run. For example, the feature BDaaS needs to proof to be competitive on a performance level though the commoditization and service level abstraction means that at the end of the day not the model wins that squeezes the most performance from comparable hardware but on a dollar to dollar basis. The performance BDaaS, will face business demands from companies that decreasingly are willing to take on the complex challenges of building their own data architecture and related SaaS layer, and increasingly want to focus on their value adding domain specific processes. So while neither of the semi-integrated BDaaS approaches want to square the circle their customer demand may yet push them to try it.