Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Hadoop configuration questions

What is meant by IPT tables ? In one of the forums I am seeing a suggestion to Turn it off.

Turn down Swappiness to 10 or less

Turn off Transparent Huge page compaction

How we are ensuring network time source to keep all hosts in sync (Time zones)

How to make sure forward and reverse DNS work on each host to resolve all other hosts consistently

how to set Vcore count based on detected host CPU’s (Cloudera) and how to set heapsizes in such a way that we are not overcommitting Ram.

Please help me to understand these.

2 REPLIES 2

Super Guru
@Bala Vignesh N V

What is meant by IPT tables ? In one of the forums I am seeing a suggestion to Turn it off.

IPTables is a firewall utility in linux that allows or blocks traffic. There are two many components in Hadoop with several ports to be configured. It is recommended to turn off IPTables because

A) All Hadoop clusters sit behind Enterprise Firewall anyway.

B) If you turn on IPTables, then you have too much manual work to manage all the ports and clients (who to allow and who to block).

Turn down Swappiness to 10 or less

Swappiness setting is used to tell your OS to use memory instead of using disk. A default setting of 60 means that as soon as your memory usage reaches 60%, OS should start swapping objects from memory to disk. This impacts performance. So it is recommended that you set swappiness to zero, so no swapping occurs until you have exhaused your memory.

Turn off Transparent Huge page compaction

As for Transparent Huge Pages, please see this link. It has really good explanation what THP is. As you'll read the purpose of THP is to improve performance, however, with Hadoop we have seen issues. Hence, recommendation to disable it.

How we are ensuring network time source to keep all hosts in sync (Time zones)

You install NTP on one server which has its time synced using some linux server and configure all hosts in the cluster to sync time using this host. This way all nodes are syncing their time using same server.

How to make sure forward and reverse DNS work on each host to resolve all other hosts consistently

The hostname that Hadoop sees is "hostname --fqdn" for the node. On CentOS set hostname in /etc/sysconfig/network file to be equal to your fully qualified domain name. Assuming local install, use this fully qualified domain name in /etc/hosts file on all nodes. If you have a DNS manager, use appropriate settings for your DNS server.

how to set Vcore count based on detected host CPU’s (Cloudera) and how to set heapsizes in such a way that we are not overcommitting Ram.

You will use YARN and capacity scheduler to manage and allocate your resources to queues and assign queues to users/groups. You can always over commit RAM but you will get a warning in Ambari if you do that. You might consciously do that, if you know what you are doing or decide against it and not over commit. That should be your decision based on your requirements and applications and SLAs you have promised to your users.

@mqureshi thanks a ton for explaining in detail. It helped a lot.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.