A recent question in the Hortonworks Community mentioned someone using Hadoop in a virtualized environment with EMC’s Isilon NAS (Network Attached Storage). While this may be a valid use case for some anyone who is looking at Hadoop as more than small number crunching cluster(s) will have to reflect on this approach. Here are some (and there more) points to think about before you go down the path of VMs and NAS on Hadoop:
The scale of storage and associated cost are often a fundamental decision point. We had customers who were happy with Isilon and wanted to use it for their data lake. However, once they evaluated the future storage needs and cost of Isilon versus DAS they quickly changed their mind. Another aspect of cost is support and licensing. Many vendors have a node based cost model and running large numbers of (virtual) nodes affects your cost.
Using virtualization and Isilon increase complexity in your infrastructure. Some argue that the technologies are already in place thus are not additional effort. However, when you, for example, have to find the cause of an unobvious performance issue you now have two more places to look at – virtualization and Isilon – and worse the interactions between all these technologies with the Hadoop ecosystem.
Performance; virtualization has some cost to your infrastructure. While some providers of software try to convince you of performance gains, this is an unlikely scenario and the benchmarks I have seen are cherry picked. Furthermore, sharing your infrastructure has the risk of noisy neighbours. Hadoop using a physical host via multiple (virtual) hosts can have dangerous impacts, e.g. with the loss of the physical node (in the case of using DAS with virtualization you can lose data). Also, the memory size per node has been steadily increasing in the Hadoop deployments to take advantage of technologies like Spark and running memory intensive computations and caches. That can become inefficient when VMs slice the hosts and unnecessary numbers of virtual nodes generate overhead or limit the computational capabilities.
There is a natural tipping point in each organisation where the above challenges become worthwhile running Hadoop as an infrastructure project that may break with longstanding shared storage and virtualization infrastructure. Usually, the point is when the overhead of doing so becomes more expensive than running it bare metal and taking the benefits from it. So check where your deployment may be in the next few years and if it is only a small cluster(s) don’t worry. If you are building a large cluster, do consider all the aspects.