Virtualizing Hadoop with NAS 3

IsilonA recent question in the Hortonworks Community ┬ámentioned someone using Hadoop in a virtualized environment with EMC’s Isilon NAS (Network Attached Storage). While this may be a valid use case for some anyone who is looking at Hadoop as more than small number crunching cluster(s) will have to reflect on this approach. Here are some (and there more) points to think about before you go down the path of VMs and NAS on Hadoop:

The scale of storage and associated cost are often a fundamental decision point. We had customers who were happy with Isilon and wanted to use it for their data lake. However, once they evaluated the future storage needs and cost of Isilon versus DAS they quickly changed their mind. Another aspect of cost is support and licensing. Many vendors have a node based cost model and running large numbers of (virtual) nodes affects your cost.

Using virtualization and Isilon increase complexity in your infrastructure. Some argue that the technologies are already in place thus are not additional effort. However, when you, for example, have to find the cause of an unobvious performance issue you now have two more places to look at – virtualization and Isilon – and worse the interactions between all these technologies with the Hadoop ecosystem.

Performance; virtualization has some cost to your infrastructure. While some providers of software try to convince you of performance gains, this is an unlikely scenario and the benchmarks I have seen are cherry picked. Furthermore, sharing your infrastructure has the risk of noisy neighbours. Hadoop using a physical host via multiple (virtual) hosts can have dangerous impacts, e.g. with the loss of the physical node (in the case of using DAS with virtualization you can lose data). Also, the memory size per node has been steadily increasing in the Hadoop deployments to take advantage of technologies like Spark and running memory intensive computations and caches. That can become inefficient when VMs slice the hosts and unnecessary numbers of virtual nodes generate overhead or limit the computational capabilities.

There is a natural tipping point in each organisation where the above challenges become worthwhile running Hadoop as an infrastructure project that may break with longstanding shared storage and virtualization infrastructure. Usually, the point is when the overhead of doing so becomes more expensive than running it bare metal and taking the benefits from it. So check where your deployment may be in the next few years and if it is only a small cluster(s) don’t worry. If you are building a large cluster, do consider all the aspects.

3 thoughts on “Virtualizing Hadoop with NAS

  1. Reply Andy Max May 9,2016 17:49

    What type of size do you have in mind when you say “If you are building a large cluster, do consider all the aspects.”

    • Reply Christian Prokopp May 9,2016 19:15

      Hi Andy,
      Very good question. I had larger organisations in mind, which think of headcounts and people they have to hire to administer different environments. There the tipping point is when cost of licences + extra HW for performance + complexity is greater than the manpower overhead (note that with large enough deployments where you need to hire more staff anyway this becomes a no brainer). It is important to think of that strategically, i.e. if you know you will move all your data and have 100s of nodes in a couple of years then going down the VM + NAS path is often wasted effort. For smaller companies this may not apply, however, for them the licensing cost may hurt even more and skilful staff could bridge the gap between two deployment types. Ymmv so I won’t put hard numbers down. Hope this helps.
      Cheers,
      Christian

  2. Reply Manish Kumar Jul 19,2016 20:42

    Excellent Article!!. Here is the link for Cloud
    Computing Interview Questions

Leave a Reply