What are advantages and disadvantages/pitfalls of using Isilon with a virtualized Hadoop cluster?
I ran Isilon with apache hadoop 1.x and 2.x on virtual machines, the beauty is that you can run multiple clusters of jobtrackers and tasktrackers in 1.x and yarn NodeManagers and RMs in 2.x side by side hitting same appliance. I had four 6 node clusters at a time all interfaced with same Isilon appliance and when I needed to scale up, I would clone VM and bring it up. HDP makes it even easier, I would hope Cloudbreak is going there next. We did have issues reaching the Isilon nodes once in awhile as each of OneFS nodes had own IPs we had to retry with next node until we succedded. It's great for convenience not so great for performance. Another benefit is migration, since data is in one place just point new compute to it and you're done. Backup and recovery is also taken care of as Isilon backs up to a mirror, so you just use compute as dumb slaves.
Besides the mentioned points anyone looking into virtualizing Hadoop has to consider various important factors. Most know the benefits of using virtualization and Isilon so I won't repeat them. Your mileage may vary and there may be good reasons for you to go down that path. Still it is worth looking at the downsides.
The scale of storage and associated cost are often a fundamental decision point. We had customers who were happy with Isilon and wanted to use it for their data lake. However, once they evaluated the future storage needs and cost of Isilon versus DAS they quickly changed their mind. Another aspect of cost is support and licensing. Many vendors have a node based cost model and running large numbers of (virtual) nodes affects your cost.
Using virtualization and Isilon increase complexity in your infrastructure. Some argue that the technologies are already in place thus are not additional effort. However, when you, for example, have to find the cause of an unobvious performance issue you now have two more places to look at - virtualization and Isilon - and worse the interactions between all these technologies with the Hadoop ecosystem.
Performance; virtualization has some cost to your infrastructure. While some providers of software try to convince you of performance gains, this is an unlikely scenario and the benchmarks I have seen are cherry picked. Furthermore, sharing your infrastructure has the risk of noisy neighbours. Hadoop using a physical host via multiple (virtual) hosts can have dangerous impacts, e.g. with the loss of the physical node (in the case of using DAS with virtualization you can lose data). Also, the memory size per node has been steadily increasing in the Hadoop deployments to take advantage of technologies like Spark and running memory intensive computations and caches. That can become inefficient when VMs slice the hosts and unnecessary numbers of virtual nodes generate overhead or limit the computational capabilities.
There is a natural tipping point in each organisation where the above challenges become worthwhile running Hadoop as an infrastructure project that may break with longstanding shared storage and virtualization infrastructure. Usually, the point is when the overhead of doing so becomes more expensive than running it bare metal and taking the benefits from it. So check where your deployment may be in the next few years and if it is only a small cluster(s) don't worry. If you are building a large cluster, do consider all the aspects.