Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

Introduction

This article is part of a series (see part 1, part 2, part 3 and part 5). Here we look at how to avoid some common NameNode performance issues.

Audience

This article is for Apache Hadoop administrators who are familiar with HDFS and its components. If you are using Ambari or Cloudera Manager, you should know how to manage services and configurations.

Avoid Disk Throughput Issues

The performance of the HDFS NameNode depends on being able to flush its edit logs to disk relatively quickly. Any delay in writing edit logs will hold up valuable RPC handler threads and can have cascading effects on the performance of your entire Hadoop cluster.

You should use a dedicated hard disk i.e. dedicated spindle for the edit logs to keep edits flowing to the disk smoothly. The location of the edit log is configured using the dfs.namenode.edits.dir setting in hdfs-site.xml. If dfs.namenode.edits.dir is not configured, it defaults to the same value as dfs.namenode.name.dir.

Example:

  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/mnt/disk1,/mnt/disk2</value>
  </property>

If there are multiple comma-separated directories for dfs.namenode.name.dir, the NameNode will replicate its metadata to each of these directories. In the example above, it is recommended that /mnt/disk1 and /mnt/disk2 refer to dedicated spindles.

As a corollary, if you are using NameNode HA with QJM then the Journal Nodes should also use dedicated spindles for writing edit log transactions. The JournalNode edits directory can be configured via dfs.journalnode.edits.dir in hdfs-site.xml.

  <property>
    <name>dfs.journalnode.edits.dir</name>
    <value>/mnt/disk3</value>
  </property>

This leads us to the next pitfall.

Don't let Reads become Writes

The setting dfs.namenode.accesstime.precision controls how often the NameNode will update the last accessed time for each file. It is specified in milliseconds. If this value is too low, then the NameNode is forced to write an edit log transaction to update the file's last access time for each read request and its performance will suffer.

The default value of this setting is a reasonable 3600000 milliseconds (1 hour). We recommend going one step further and setting it to zero so that last access time updates are turned off. Add the following to your hdfs-site.xml.

  <property>
    <name>dfs.namenode.accesstime.precision</name>
    <value>0</value>
  </property>

Monitor Group lookup Performance

The NameNode frequently performs group lookups for users submitting requests. This is part of the normal access control mechanism. Group lookups may involve contacting an LDAP server. If the LDAP server is slow to process requests, it can hold up NameNode RPC handler threads and affect cluster performance. When the NameNode suffers from this problem it logs a warning that looks like this:

Potential performance problem: getGroups(user=bob) took 23861 milliseconds.

The default warning threshold is 5 seconds. If you see this log message you should investigate the performance of your LDAP server right away. There are a few workarounds if it is not possible to fix the LDAP server performance immediately.

  1. Increase the group cache timeout: The NameNode caches the results of group lookups to avoid making frequent lookups. The cache duration is set via hadoop.security.groups.cache.secs in your core-site.xml file. The default value of 300 (5 minutes) is too low. We have seen this value set as high as 7200 (2 hours) in busy clusters to avoid frequent group lookups.
  2. Increase the negative cache timeout: For users with no group mapping, the NameNode maintains a short negative cache. The default negative cache duration is 30 seconds and maybe too low. You can increase this with the hadoop.security.groups.negative-cache.secs setting in core-site.xml. This setting is useful if you see frequent messages like 'No groups available for user 'alice' in the NameNode logs.
  3. Add a static mapping override for specific users: A static mapping is an in-memory mapping added via configuration that eliminates the need to lookup groups via any other mechanism. It can be a useful last resort to work around LDAP server performance issues that cannot be easily addressed by the cluster administrator. An example static mapping for two users 'alice' and 'bob' looks like this:
  <property>
    <name>hadoop.user.group.static.mapping.overrides</name>
    <value>alice=staff,hadoop;bob=students,hadoop;</value>
  </property>

The HCC article Hadoop and LDAP: Usage, Load Patterns and Tuning has a lot more detail on debugging and optimizing Hadoop LDAP performance.

Monitor NameNode Logs for Excessive Spew

Sometimes the NameNode can generate excessive logging spew which severely impacts its performance. Rendering logging strings is an expensive operation that consumes CPU cycles. It also generates a lot of small short-lived objects (garbage) causing excessive Garbage Collection.

The Apache Hadoop community has made a number of fixes to reduce the verbosity of NameNode logs and these fixes are available in recent HDP maintenance releases. Example: HDFS-9434, HADOOP-12903, HDFS-9941, HDFS-9906 and many more.

Depending on your specific version of Apache Hadoop or HDP you may run into one or more of these problems. Also, new problems can be exposed by changing workloads. It is a good idea to monitor your NameNode logs periodically for excessive log spew and report it either by contacting the Hortonworks support team or filing a bug in the Apache Hadoop Jira.

Conclusion

That is it for Part 4. In Part 5 of this series, we will explore how to optimize logging and block reports.

28,715 Views
Comments
avatar
Super Collaborator

The 04-Series article are very well written. Thanks Arpit for documenting them.

So many heavy calls inside critical RPC loops? That is very weird design decision.

this is really helpful

avatar
New Contributor

This whole series is really insightful and helpful!