Support Questions

Find answers, ask questions, and share your expertise

Virtual Machines in Hadoop cluster

avatar
Rising Star

Hello,

We are having an internal argument on whether its a good idea to have the cluster mainly running on VMs or is it better to have it on physical servers. What are the pros and cons of each hardware configuration. Also, is it a good idea to mix both physical and virtual machines in a single cluster, if need be.

1 ACCEPTED SOLUTION

avatar
Guru

There are pros and cons of both approaches.

VM based pros:

1. 'Easier' managing nodes. Some IT infrastructure teams insist on VMs even if you want to map 1 physical node to 1 virtual node because all their other infrastructure is based on VMs.

2. Taking advantage of NUMA and memory locality. There are some articles on this from virtual infrastructure providers that you can take a look at.

VM based disadvantages:

1. Overhead. As an example, if you are running 4VMs per physical node, you are running 4 OS, 4 Datanode services, 4 Nodemanagers, 4 ambari-agents, 4 metrics collectors and 4 of any other worker services instead of one. These services will have overhead compared to running 1 of each.

2. Data Locality and redundancy. Now, there is support to know physical nodes, so no two replicas go into same physical node but that is extra configuration. You might run into virtual disk performance problems if they are not configured properly.

Given a choice, I prefer using Physical servers. However, its not always your choice. In those cases, make sure you try to get following.

1. Explicit virtual disk to physical disk mapping. Say you have 2 VMs per physical node and each physical node has 16 data drives. Make sure to split 8 drives to one VM and 8 more to another VM. This way, physical disks are not shared between VMs.

2. Don't go for more than 2 VMs per physical node. This is so you minimize overhead from the services running.

Regarding your question of mixing physical and virtual machines, try to see that all your worker nodes are of similar hardware. While heterogenous hardware is supported, you can run into issues because nodes have different hardware profiles. However, we had some customers who used VMs for master services and physical nodes for worker nodes. This was one way to getting away from NN SPOF issues in Hadoop1 days.

View solution in original post

6 REPLIES 6

avatar

Hi @bigdata.neophyte,

Hadoop has been designed to run on commodity hardware. There are important concepts such as data locality and horizontal scalability that make a hardware cluster the first choice for Hadoop clusters today. The pro of this choice is performance. The cons is the cost for installing and managing the cluster.

Virtual machines are also used for Hadoop today. VM with a central storage (SAN) is not the best choice for performance since you will loose data locality and you will have concurrent jobs/tasks concurrently accessing to the same storage. Some solutions today support dedicating hard disks to VMs. This way you can have a good hybrid approach. The pros of VM is the flexibility.

VM Hadoop clusters are usually used for development environment because they provide flexibility. It's easy to create and kill clusters. Physical clusters are usually recommended for production where the application has strong SLAs.

The final choice depends on your use cases, your existing infrastructure, and resources available to manage your cluster.

I hope this helps.

avatar
Rising Star

@Abdelkrim Hadjidj great explanation, thanks! With the above understanding, I presume, if need be, there can be a mix of both physical and virtual in the same cluster without any additional overhead / performance impacts apart from the ones mentioned above?

avatar

@bigdata.neophyte I would not recommend having both physical and virtual nodes in the same cluster. I think it's best to identify your KPI and choose the best solution. This being said, having VMs cluster and physical clusters at the same time can be a good choice for implementing several environments (dev, testing, prod, etc)

avatar
Guru

There are pros and cons of both approaches.

VM based pros:

1. 'Easier' managing nodes. Some IT infrastructure teams insist on VMs even if you want to map 1 physical node to 1 virtual node because all their other infrastructure is based on VMs.

2. Taking advantage of NUMA and memory locality. There are some articles on this from virtual infrastructure providers that you can take a look at.

VM based disadvantages:

1. Overhead. As an example, if you are running 4VMs per physical node, you are running 4 OS, 4 Datanode services, 4 Nodemanagers, 4 ambari-agents, 4 metrics collectors and 4 of any other worker services instead of one. These services will have overhead compared to running 1 of each.

2. Data Locality and redundancy. Now, there is support to know physical nodes, so no two replicas go into same physical node but that is extra configuration. You might run into virtual disk performance problems if they are not configured properly.

Given a choice, I prefer using Physical servers. However, its not always your choice. In those cases, make sure you try to get following.

1. Explicit virtual disk to physical disk mapping. Say you have 2 VMs per physical node and each physical node has 16 data drives. Make sure to split 8 drives to one VM and 8 more to another VM. This way, physical disks are not shared between VMs.

2. Don't go for more than 2 VMs per physical node. This is so you minimize overhead from the services running.

Regarding your question of mixing physical and virtual machines, try to see that all your worker nodes are of similar hardware. While heterogenous hardware is supported, you can run into issues because nodes have different hardware profiles. However, we had some customers who used VMs for master services and physical nodes for worker nodes. This was one way to getting away from NN SPOF issues in Hadoop1 days.

avatar
Rising Star

@Ravi Mutual, perfect! Thanks!

avatar

I'd also add that if you decide to go virtual with VMWare make sure you implement BDE this can help with things like data locality. Also, even our largest virtual cluster environments still use DAS. All have tried SAN storage but the performance was terrible.