Created on 11-13-2019 10:18 PM - last edited on 11-14-2019 03:16 PM by cjervis
Me and My colleagues are having a discussion regarding the pros and cons on running a Cloudera cluster on Physical Servers versus running the cluster on several Virtual Machines on a Hyper Converge servers.
Created on 11-14-2019 12:37 AM - edited 11-14-2019 12:41 AM
I will try to comment my views inline -
1.) How different would the Setup and configuration be for Physical Servers as to VMs. Yes, Setting up the VMs would be faster as compared to the physical ones but are there any additional configurations or settings that we would need to look into?
-- If we talk regarding general configuration they below points will be taken in account which counting on performance -
a. Disks
b Network
c. Memory/CPU
d. SLA
2.) We've read that one possible issue with setting the cluster on VMs is with Data Locality and redundancy. On how no 2 replicas should not be in the same physical node but since one physical node may house several VMs, would there be a way around this issue?
-- VM with external storage[like SAN] will be impacting data locality. You can go with dedicated disk for the VM's which will be a good hybrid approach.
'YES' , also for data locality addon components from virtual vendors[like vmware] are provided - such as BDE [Big Data Extensions] also for Network compromises of NSX technology which will help to speed up systems to avoid performance impacts. But you need to take licensing cost into account.
3.) Since the specs of the VMs would be restricted to the specs of the physical node and its resources be split depending on how many VMs it is housing, wouldn't it be better to have separate servers to house 1 node of a cluster to get better performance? and would having several VMs in one physical node affect the parallelism of the jobs that will run on the cluster?
-- Its difficult to put decision at first moment based upon actual experiences. This decision purely depends upon your sla. At start while running hadoop applications, you might not be aware of how much time it takes for your application to process or meet the SLA.
This can be purely POC base approach you need to test and also run benchmarking before you go for actual dev/uat/prod implementations.
benchmarking results will give you fair idea about performance and computational stats. That can be easy then to take the decision.
Pls do check below links which might be useful -
https://community.cloudera.com/t5/Support-Questions/Virtual-Machines-in-Hadoop-cluster/td-p/119675
https://www.kdnuggets.com/2015/12/myths-virtualizing-hadoop-vsphere-explained.html
https://pubs.vmware.com/bde-2/index.jsp
Created 11-13-2019 11:06 PM
Every technology has its pros and cons. The above comment is very broad and every lasting if we discuss.
Do you have any specific question/issue regarding implementations/architecture ? Will try to comment accordingly.
Created 11-13-2019 11:37 PM
Hi @sagarshimpi ,
Right now the team is more inclined to doing it in Virtual Machines since the Hyper converge servers are already set up, as to buying and setting up new physical servers. As of the moment, I do not have the specs of the HCI servers that has been set up.
As of the moment the big questions we asking is:
1.) How different would the Setup and configuration be for Physical Servers as to VMs. Yes, Setting up the VMs would be faster as compared to the physical ones but are there any additional configurations or settings that we would need to look into?
2.) We've read that one possible issue with setting the cluster on VMs is with Data Locality and redundancy. On how no 2 replicas should not be in the same physical node but since one physical node may house several VMs, would there be a way around this issue?
3.) Since the specs of the VMs would be restricted to the specs of the physical node and its resources be split depending on how many VMs it is housing, wouldn't it be better to have separate servers to house 1 node of a cluster to get better performance? and would having several VMs in one physical node affect the parallelism of the jobs that will run on the cluster?
I am unfamiliar with Hyper converge infrastructure and how it will affect the functionality and performance of VMs as compared traditional server architecture.
Also based on some blogs I've read, they say that VM clusters are good for development since they are more flexible(easy to create and destroy) but in production sense it would be better to have it in physical servers.
Thanks.
Created on 11-14-2019 12:37 AM - edited 11-14-2019 12:41 AM
I will try to comment my views inline -
1.) How different would the Setup and configuration be for Physical Servers as to VMs. Yes, Setting up the VMs would be faster as compared to the physical ones but are there any additional configurations or settings that we would need to look into?
-- If we talk regarding general configuration they below points will be taken in account which counting on performance -
a. Disks
b Network
c. Memory/CPU
d. SLA
2.) We've read that one possible issue with setting the cluster on VMs is with Data Locality and redundancy. On how no 2 replicas should not be in the same physical node but since one physical node may house several VMs, would there be a way around this issue?
-- VM with external storage[like SAN] will be impacting data locality. You can go with dedicated disk for the VM's which will be a good hybrid approach.
'YES' , also for data locality addon components from virtual vendors[like vmware] are provided - such as BDE [Big Data Extensions] also for Network compromises of NSX technology which will help to speed up systems to avoid performance impacts. But you need to take licensing cost into account.
3.) Since the specs of the VMs would be restricted to the specs of the physical node and its resources be split depending on how many VMs it is housing, wouldn't it be better to have separate servers to house 1 node of a cluster to get better performance? and would having several VMs in one physical node affect the parallelism of the jobs that will run on the cluster?
-- Its difficult to put decision at first moment based upon actual experiences. This decision purely depends upon your sla. At start while running hadoop applications, you might not be aware of how much time it takes for your application to process or meet the SLA.
This can be purely POC base approach you need to test and also run benchmarking before you go for actual dev/uat/prod implementations.
benchmarking results will give you fair idea about performance and computational stats. That can be easy then to take the decision.
Pls do check below links which might be useful -
https://community.cloudera.com/t5/Support-Questions/Virtual-Machines-in-Hadoop-cluster/td-p/119675
https://www.kdnuggets.com/2015/12/myths-virtualizing-hadoop-vsphere-explained.html
https://pubs.vmware.com/bde-2/index.jsp
Created 11-14-2019 01:17 AM
Hi @sagarshimpi,
Thanks, this will shed some light to our discussion.
I was wondering that if we have some follow-up questions, I can just tag you here in the thread if that's alright with you.
Created 11-14-2019 06:03 AM
Not to take away from the entire conversation above, which in fact was very detailed and specific comparison. The major take away in your pro/con evaluation needs to be Physical Disk compared to Network or some level of shared Disk. Also in a big ha system there are usually more than one disk (not to mean more than one partition). When you go past dev and POC level benchmarking, deep into performance tuning, the Physical Disk, in high availability arrays, with a physical machine will out perform the Cloud or VMs for large volume and large data processes. To get more specific you have to compare all the nuts and bolts as well as evaluate the Performance Best Practices for each platform, service, component, all the way down to application design.
This is a great debate and one that I have at every customer. That said I have led prod clusters installs in the cloud: Amazon, Azure, IBM Cloud, Google Cloud, and Private Cloud and VM systems.