Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

Introduction

With HDFS HA, the NameNode is no longer a single point of failure in a Hadoop cluster. However the performance of a single NameNode can often limit the performance of jobs across the cluster. The Apache Hadoop community has made a number of NameNode scalability improvements. This series of articles (also see part 2, part 3part 4 and part 5) explains how you can make configuration changes to help your NameNode scale better and also improve its diagnosability.

Audience

This article is for Hadoop administrators who are familiar with HDFS and its components. If you are using Ambari or Cloudera Manager you should know how to manage services and configurations.

HDFS Audit Logs

The HDFS audit log is a separate log file that contains one entry for each user request. The audit log is a key resource for offline analysis of application traffic to find out users/jobs that generate excessive load on the NameNode. Careful analysis of the HDFS audit logs can also identify application bugs/misconfiguration.

Each audit log entry looks like the following:

2015-11-16 21:00:00,109 INFO FSNamesystem.audit: allowed=true ugi=bob (auth:SIMPLE) 
ip=/000.000.0.0 cmd=listStatussrc=/app-logs/pcd_batch/application_1431545431771/ dst=null  perm=null

Audit logging can be enabled via the Hadoop log4j.properties file (typically found in /etc/hadoop/conf, or under the "process" directory when using CM) with the following settings:

# hdfs audit logging
hdfs.audit.logger=INFO,console
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=${hdfs.audit.logger}
log4j.additivity.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=false
log4j.appender.DRFAAUDIT=org.apache.log4j.DailyRollingFileAppender
log4j.appender.DRFAAUDIT.File=${hadoop.log.dir}/hdfs-audit.log
log4j.appender.DRFAAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.DRFAAUDIT.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
log4j.appender.DRFAAUDIT.DatePattern=.yyyy-MM-dd

Also, check the NameNode startup options in hadoop-env.sh to ensure there is no conflicting override for the hdfs.audit.logger setting. The following is an example of a correct HADOOP_NAMENODE_OPTS setting.

export HADOOP_NAMENODE_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=8g -XX:MaxNewSize=8g -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms64g -Xmx64g -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT ${HADOOP_NAMENODE_OPTS}"

Note: If you enable HDFS audit logs, you should also enable Async Audit Logging as described in the next section.

Async Audit Logging

Using an async log4j appender for the audit logs significantly improves the performance of the NameNode. This setting is critical for the NameNode to scale beyond 10,000 requests/second. Add the following to your hdfs-site.xml.

  <property>
    <name>dfs.namenode.audit.log.async</name>
    <value>true</value>
  </property>

If you are managing your cluster with Ambari, this setting is already enabled by default. If you're using CDH, you'll want to add the property to "NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml". For a CDP cluster, this property is exposed in the UI and enabled by default.

Service RPC port

The service RPC port gives the DataNodes a dedicated port to report their status via block reports and heartbeats. The port is also used by Zookeeper Failover Controllers for periodic health checks by the automatic failover logic. The port is never used by client applications hence it reduces RPC queue contention between client requests and DataNode messages.

For a non-HA cluster, the service RPC port can be enabled with the following setting in hdfs-site.xml (replacing mynamenode.example.com with the hostname or IP address of your NameNode). The port number can be something other than 8040.

  <property>
    <name>dfs.namenode.servicerpc-address</name>
    <value>mynamenode.example.com:8040</value>
  </property>

For an HA cluster, the service RPC port can be enabled with settings like the following, replacing mycluster, nn1 and nn2 appropriately.

  <property>
    <name>dfs.namenode.servicerpc-address.mycluster.nn1</name>
    <value>mynamenode1.example.com:8040</value>
  </property>

  <property>
    <name>dfs.namenode.servicerpc-address.mycluster.nn2</name>
    <value>mynamenode2.example.com:8040</value>
  </property>

If you have enabled Automatic Failover using ZooKeeper Failover Controllers, an additional step is required to reset the ZKFC state in ZooKeeper. Stop both the ZKFC processes and run the following command as the hdfs superuser.

hdfs zkfc –formatZK

Note: Changing the service RPC port settings requires a restart of the NameNodes, DataNodes and ZooKeeper Failover Controllers to take full effect. If you have NameNode HA setup, you can restart the NameNodes one at a time followed by a rolling restart of the remaining components to avoid cluster downtime.

NameNode JVM Garbage Collection

Garbage Collection is important enough to merit its own article. Please refer to NameNode Garbage Collection Configuration: Best Practices and Rationale.

Conclusion

That's it for Part 1. In part 2, part 3part 4 and part 5 we will look at a number of additional configuration settings to help your NameNode scale better.

38,874 Views
Comments
avatar
Explorer

About the whole no downtime thing

If you have enabled Automatic Failover using ZooKeeper Failover Controllers, an additional step is required to reset the ZKFC state in ZooKeeper. Stop both the ZKFC processes and run the following command as the hdfs super user.

I think some clarification would be nice

Stop the ZKFC and then format zkfc znode and after that restart NameNodes?

Or restart the NameNodes, then stop the ZKFC and format after that?

avatar
Expert Contributor