Hadoop, known to be powerful and challenging to manage, is increasingly becoming available as-a-Service in numerous varieties. Initially, do-it-yourself distributions like Cloudera, MapR, and Hortonworks made up a great part of the market. In recent years, following the success of Amazon Web Services ElasticMapReduce (EMR), Hadoop/data services like Qubole are becoming popular. Last year, quietly, another entrant in the field proposed to have an even better answer to your Hadoop needs, meet Altiscale (Altiscale has been acquired by SAP in 2016).
Altiscale was founded by Raymie Stata who worked seven years at the birthplace of Hadoop, Yahoo. Raymie was Yahoo’s chief architect and CTO and certainly brings significant technical and business experience to Altiscale. Something the investors impressed enough to fund Altiscale with $12 million.
You may not have heard about Altiscale yet. They are still in the stage of building up the core business. At the moment they
are limited to one computing centre in California have data centres at the west and east coast of the USA with high-speed connections to AWS to ensure AWS customers fast transactions to and from their new clusters. This can be a limiting factor for European customers, who may have to process data in the European Union. Altiscale will have to address this and probably will add other data processing locations. This would have the added benefit of reducing their geographic risk exposure.
Hadoop as a Service, on metal
Altiscale sets itself apart from the likes of EMR and Qubole by going a little against the trend. It does not use (hypervisor) virtualization. Cloud service providers use virtualization to commoditize computing resources, i.e. to run somewhat standardized virtual servers on a variety of real hardware. Virtualising these commodity services, however, incurs a performance penalty for the additional abstraction layer between the operating system and the hardware.
Altiscale proposes that it can offer much more performant services by using containers instead of virtualization (although some might call containers just another sort of virtualization). Containers effectively share the same operating system on a machine but have separated resource, e.g. two guests would use the same Linux installation but each would have a separate filesystem and etc. This approach promises better performance over virtualization.
Additionally, using their own hardware and not using cloud services like EC2 allows Altiscale to optimize the network and servers to the specific needs of Hadoop deployments. The first customers (at the moment still running on separate clusters and not using shared servers with containers) experience speedups of up to tenfold over their previous cloud deployments according to Altiscale.
Hadoop-as-a-Service removes the choice of distribution from customers, which can be limiting. Currently, Altiscale is aiming at supporting BigTop releases of Hadoop. BigTop is an Apache project focusing on packaging and testing Hadoop releases. Altiscale focuses on Hadoop with YARN releases and aims to provide two versions of Hadoop, a stable and the latest version, which should cover most customers requests.
Another attempt to differentiate themselves is how Altiscale bills its services. It is not the usual pay-as-you-go and more like a monthly flat-rate. The idea might be tempting to businesses. Firstly, it provides a switch from capital to operational expenditure, something offered by other Hadoop cloud services too. Beyond that, the flat-rate allows predictability since all services are included and storage, machine hours, services, etc. are not billed separately. The details are not publicly available yet. Surely, you should expect different packages with different prices bands for different usage types though.
Data inertia is a challenge
The entrance of another Hadoop-as-a-Service provider increases the choice to customers. Providing the service outside of AWS offers performance benefits that may attract customers. On the other side, the inertia of big data could be a tremendous stumbling block for Altiscale. It could be overcome by spanning data pipelines across providers. It is imaginable, for example, to source data from cloud services, e.g. from data mining or primary services, store and archive it on services like S3 and Glacier, with incremental exports to providers like Altiscale to process the raw data and return aggregated, enriched views back to services in the cloud.