Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Which way of HDP cluster setup is best, having physical nodes or having multiple VMs with few physical nodes

avatar

Hi All,

We have six physical machines. Which way of cluster setup is good? Having those physical Machines as it is (or) Create multiple VMs on top of those machines and create a big cluster?

Those machines are highly available machines with more than 450 GB of RAM.

Please Suggest me!

1 ACCEPTED SOLUTION

avatar
Master Guru

There are pros/cons for both. VMs have a negative impact on performance so we would normally go for bare metal. Mapreduce is good in scaling to lots of discs/processes even on a single data node.

However there are limits on VERY big nodes ( there are new Apollo servers with 24 drives ) you want to increase the HDFS DataNode memory and you may have issues with very big block reports being sent around. In that case logically splitting a node into multiple smaller VMs might solve these issues.

But normally I would say go bare metal.

View solution in original post

2 REPLIES 2

avatar

Hi @Uday Vakalapudi Typically you will always be better off with multiple machines (scale out) rather than a smaller number of large machines (scale up).

If you consider the way that Hadoop works, jobs are effectively distributed across the whole cluster and all the resources can be utilised simultaneously. This is the opposite of what virtualisation typically handles, which is multiple machines with different workloads and different workload profiles (I/O, cpu, memory).

My short suggestion would be if you're just looking at a test/dev/pilot system, then multiple VM's is fine. But for production, consider scale out on bare metal.

Hope that helps.

,

Typically you will always be better off with multiple machines (scale out) rather than a smaller number of large machines (scale up).

If you consider the way that Hadoop works, jobs are effectively distributed across the whole cluster and all the resources can be utilised simultaneously. This is the opposite of what virtualisation typically handles, which is multiple machines with different workloads and different workload profiles (I/O, cpu, memory).

My short suggestion would be if you're just looking at a test/dev/pilot system, then multiple VM's is fine. But for production, consider scale out on bare metal.

avatar
Master Guru

There are pros/cons for both. VMs have a negative impact on performance so we would normally go for bare metal. Mapreduce is good in scaling to lots of discs/processes even on a single data node.

However there are limits on VERY big nodes ( there are new Apollo servers with 24 drives ) you want to increase the HDFS DataNode memory and you may have issues with very big block reports being sent around. In that case logically splitting a node into multiple smaller VMs might solve these issues.

But normally I would say go bare metal.