Created 03-18-2017 07:42 PM
What is meant by IPT tables ? In one of the forums I am seeing a suggestion to Turn it off.
Turn down Swappiness to 10 or less
Turn off Transparent Huge page compaction
How we are ensuring network time source to keep all hosts in sync (Time zones)
How to make sure forward and reverse DNS work on each host to resolve all other hosts consistently
how to set Vcore count based on detected host CPU’s (Cloudera) and how to set heapsizes in such a way that we are not overcommitting Ram.
Please help me to understand these.
Created 03-18-2017 09:45 PM
What is meant by IPT tables ? In one of the forums I am seeing a suggestion to Turn it off.
IPTables is a firewall utility in linux that allows or blocks traffic. There are two many components in Hadoop with several ports to be configured. It is recommended to turn off IPTables because
A) All Hadoop clusters sit behind Enterprise Firewall anyway.
B) If you turn on IPTables, then you have too much manual work to manage all the ports and clients (who to allow and who to block).
Turn down Swappiness to 10 or less
Swappiness setting is used to tell your OS to use memory instead of using disk. A default setting of 60 means that as soon as your memory usage reaches 60%, OS should start swapping objects from memory to disk. This impacts performance. So it is recommended that you set swappiness to zero, so no swapping occurs until you have exhaused your memory.
Turn off Transparent Huge page compaction
As for Transparent Huge Pages, please see this link. It has really good explanation what THP is. As you'll read the purpose of THP is to improve performance, however, with Hadoop we have seen issues. Hence, recommendation to disable it.
How we are ensuring network time source to keep all hosts in sync (Time zones)
You install NTP on one server which has its time synced using some linux server and configure all hosts in the cluster to sync time using this host. This way all nodes are syncing their time using same server.
How to make sure forward and reverse DNS work on each host to resolve all other hosts consistently
The hostname that Hadoop sees is "hostname --fqdn" for the node. On CentOS set hostname in /etc/sysconfig/network file to be equal to your fully qualified domain name. Assuming local install, use this fully qualified domain name in /etc/hosts file on all nodes. If you have a DNS manager, use appropriate settings for your DNS server.
how to set Vcore count based on detected host CPU’s (Cloudera) and how to set heapsizes in such a way that we are not overcommitting Ram.
You will use YARN and capacity scheduler to manage and allocate your resources to queues and assign queues to users/groups. You can always over commit RAM but you will get a warning in Ambari if you do that. You might consciously do that, if you know what you are doing or decide against it and not over commit. That should be your decision based on your requirements and applications and SLAs you have promised to your users.
Created 03-20-2017 05:31 AM
@mqureshi thanks a ton for explaining in detail. It helped a lot.