About ArpitAgarwal

ArpitAgarwal · ‎07-08-2016

Nice series @szetszwo! Thanks for writing this up. It will be a very useful resource for administrators.

ArpitAgarwal · ‎07-07-2016

Introduction This article continues where part 2 left off. It describes how to enable two new features that make the HDFS NameNode more responsive to high RPC request loads. Audience This article is for Apache Hadoop administrators who are familiar with HDFS and its components. If you are using Ambari you should know how to manage services and configurations with Ambari. It is assumed that you have read part 1 and part 2 of the article. RPC Congestion Control Warning: You must first enable the service RPC port (described in part 1) and restart services so the service RPC setting is effective. Failure to do so will break DataNode-NameNode communication. RPC Congestion Control is a relatively new feature added by the Apache Hadoop Community to help Hadoop Services respond more predictably under high load (see Apache Hadoop Jira HADOOP-10597). In part 2, we discussed how RPC queue overflow can cause request timeouts, and eventually job failures. This problem can be mitigated if the NameNode sends an explicit signal back to the client when its RPC queue is full. Instead of waiting for a request that may never complete, the client throttles itself by resubmitting the request with an exponentially increasing delay. If you are familiar with the Transmission Control Protocol, this is a similar to how senders react when they detect network congestion. The feature can be enabled with the following setting in your core-site.xml file. Replace 8020 with your NameNode RPC port number if it is different. Do not enable this setting for the Service RPC port or the DataNode lifeline port. <property> <name>ipc.8020.backoff.enable</name> <value>true</value> </property> You should not enable this setting unless you are running one of HDP 2.3.6+ or HDP 2.4.2+; or if a Hortonworks engineer has recommended you enable it after examining your cluster. RPC FairCallQueue Warning: You must first enable the service RPC port (described in part 1) and restart services so the service RPC setting is effective. Failure to do so will break DataNode-NameNode communication. The RPC FairCallQueue replaces the single RPC queue with multiple prioritized queues (see Apache Hadoop Jira HADOOP-10282). The RPC server maintains a history of recent requests grouped by user. It places incoming requests into an appropriate queue based on the user's history. RPC handler threads will dequeue requests from higher priority queues with a higher probability. FairCallQueue complements RPC congestion control very well and works best when you enable both features together. FairCallQueue can be enabled with the following setting in your core-site.xml. Replace 8020 with your NameNode RPC port if it is different. Do not enable this setting for the Service RPC port or the DataNode lifeline port. <property> <name>ipc.8020.callqueue.impl</name> <value>org.apache.hadoop.ipc.FairCallQueue</value> </property> Conclusion That is it for part 3. In part 4 of this article we look at how to avoid a few common NameNode performance pitfalls.

ArpitAgarwal · ‎07-07-2016

Introduction This article is part of a series (see part 1, part 2, part 3 and part 5). Here we look at how to avoid some common NameNode performance issues. Audience This article is for Apache Hadoop administrators who are familiar with HDFS and its components. If you are using Ambari or Cloudera Manager, you should know how to manage services and configurations. Avoid Disk Throughput Issues The performance of the HDFS NameNode depends on being able to flush its edit logs to disk relatively quickly. Any delay in writing edit logs will hold up valuable RPC handler threads and can have cascading effects on the performance of your entire Hadoop cluster. You should use a dedicated hard disk i.e. dedicated spindle for the edit logs to keep edits flowing to the disk smoothly. The location of the edit log is configured using the dfs.namenode.edits.dir setting in hdfs-site.xml. If dfs.namenode.edits.dir is not configured, it defaults to the same value as dfs.namenode.name.dir. Example: <property> <name>dfs.namenode.name.dir</name> <value>/mnt/disk1,/mnt/disk2</value> </property> If there are multiple comma-separated directories for dfs.namenode.name.dir, the NameNode will replicate its metadata to each of these directories. In the example above, it is recommended that /mnt/disk1 and /mnt/disk2 refer to dedicated spindles. As a corollary, if you are using NameNode HA with QJM then the Journal Nodes should also use dedicated spindles for writing edit log transactions. The JournalNode edits directory can be configured via dfs.journalnode.edits.dir in hdfs-site.xml. <property> <name>dfs.journalnode.edits.dir</name> <value>/mnt/disk3</value> </property> This leads us to the next pitfall. Don't let Reads become Writes The setting dfs.namenode.accesstime.precision controls how often the NameNode will update the last accessed time for each file. It is specified in milliseconds. If this value is too low, then the NameNode is forced to write an edit log transaction to update the file's last access time for each read request and its performance will suffer. The default value of this setting is a reasonable 3600000 milliseconds (1 hour). We recommend going one step further and setting it to zero so that last access time updates are turned off. Add the following to your hdfs-site.xml. <property> <name>dfs.namenode.accesstime.precision</name> <value>0</value> </property> Monitor Group lookup Performance The NameNode frequently performs group lookups for users submitting requests. This is part of the normal access control mechanism. Group lookups may involve contacting an LDAP server. If the LDAP server is slow to process requests, it can hold up NameNode RPC handler threads and affect cluster performance. When the NameNode suffers from this problem it logs a warning that looks like this: Potential performance problem: getGroups(user=bob) took 23861 milliseconds. The default warning threshold is 5 seconds. If you see this log message you should investigate the performance of your LDAP server right away. There are a few workarounds if it is not possible to fix the LDAP server performance immediately. Increase the group cache timeout: The NameNode caches the results of group lookups to avoid making frequent lookups. The cache duration is set via hadoop.security.groups.cache.secs in your core-site.xml file. The default value of 300 (5 minutes) is too low. We have seen this value set as high as 7200 (2 hours) in busy clusters to avoid frequent group lookups. Increase the negative cache timeout: For users with no group mapping, the NameNode maintains a short negative cache. The default negative cache duration is 30 seconds and maybe too low. You can increase this with the hadoop.security.groups.negative-cache.secs setting in core-site.xml. This setting is useful if you see frequent messages like 'No groups available for user 'alice' in the NameNode logs. Add a static mapping override for specific users: A static mapping is an in-memory mapping added via configuration that eliminates the need to lookup groups via any other mechanism. It can be a useful last resort to work around LDAP server performance issues that cannot be easily addressed by the cluster administrator. An example static mapping for two users 'alice' and 'bob' looks like this: <property> <name>hadoop.user.group.static.mapping.overrides</name> <value>alice=staff,hadoop;bob=students,hadoop;</value> </property> The HCC article Hadoop and LDAP: Usage, Load Patterns and Tuning has a lot more detail on debugging and optimizing Hadoop LDAP performance. Monitor NameNode Logs for Excessive Spew Sometimes the NameNode can generate excessive logging spew which severely impacts its performance. Rendering logging strings is an expensive operation that consumes CPU cycles. It also generates a lot of small short-lived objects (garbage) causing excessive Garbage Collection. The Apache Hadoop community has made a number of fixes to reduce the verbosity of NameNode logs and these fixes are available in recent HDP maintenance releases. Example: HDFS-9434, HADOOP-12903, HDFS-9941, HDFS-9906 and many more. Depending on your specific version of Apache Hadoop or HDP you may run into one or more of these problems. Also, new problems can be exposed by changing workloads. It is a good idea to monitor your NameNode logs periodically for excessive log spew and report it either by contacting the Hortonworks support team or filing a bug in the Apache Hadoop Jira. Conclusion That is it for Part 4. In Part 5 of this series, we will explore how to optimize logging and block reports.

ArpitAgarwal · ‎07-07-2016

Introduction This article continues where part 1 left off. It describes a few more configuration settings that can be enabled in CDP, HDP, CDH, or Apache Hadoop clusters to help the NameNode scale better. Audience This article is for Hadoop administrators who are familiar with HDFS and its components. If you are using Ambari or Cloudera Manager you should know how to manage services and configurations. It is assumed that you have read part 1 of the article. RPC Handler Count The Hadoop RPC server consists of a single RPC queue per port and multiple handler (worker) threads that dequeue and process requests. If the number of handlers is insufficient, then the RPC queue starts building up and eventually overflows. You may start seeing task failures and eventually job failures and unhappy users. It is recommended that the RPC handler count be set to 20 * log2(Cluster Size) with an upper limit of 200. e.g. for a 250 node cluster you should initialize this to 20 * log2(250) = 160. The RPC handler count can be configured with the following setting in hdfs-site.xml. <property> <name>dfs.namenode.handler.count</name> <value>160</value> </property> This heuristic is from the excellent Hadoop Operations book. If you are using Ambari to manage your cluster this setting can be changed via a slider in the Ambari Server Web UI. If you're using Cloudera Manager you can search for property name "dfs.namenode.service.handler.count" under the HDFS configuration page and adjust the value. Service RPC Handler Count Pre-requisite: If you have not enabled the Service RPC port already, please do so first as described here. There is no precise calculation for the Service RPC handler count however the default value of 10 is too low for most production clusters. We have often seen this initialized to 50% of the dfs.namenode.handler.count in busy clusters and this value works well in practice. e.g. for the same 250 node cluster you would initialize the service RPC handler count with the following setting in hdfs-site.xml. <property> <name>dfs.namenode.service.handler.count</name> <value>80</value> </property> DataNode Lifeline Protocol The Lifeline protocol is a feature recently added by the Apache Hadoop Community (see Apache HDFS Jira HDFS-9239). It introduces a new lightweight RPC message that is used by the DataNodes to report their health to the NameNode. It was developed in response to problems seen in some overloaded clusters where the NameNode was too busy to process heartbeats and spuriously marked DataNodes as dead. For a non-HA cluster, the feature can be enabled with the following setting in hdfs-site.xml (replace mynamenode.example.com with the hostname or IP address of your NameNode). The port number can be different too. <property> <name>dfs.namenode.lifeline.rpc-address</name> <value>mynamenode.example.com:8050</value> </property> For an HA cluster, the lifeline RPC port can be enabled with settings like the following, replacing mycluster, nn1 and nn2 appropriately. <property> <name>dfs.namenode.lifeline.rpc-address.mycluster.nn1</name> <value>mynamenode1.example.com:8050</value> </property> <property> <name>dfs.namenode.lifeline.rpc-address.mycluster.nn2</name> <value>mynamenode2.example.com:8050</value> </property> Additional lifeline protocol settings are documented in the HDFS-9239 release note but these can be left at their default values for most clusters. Note: Changing the lifeline protocol settings requires a restart of the NameNodes, DataNodes and ZooKeeper Failover Controllers to take full effect. If you have NameNode HA setup, you can restart the NameNodes one at a time followed by a rolling restart of the remaining components to avoid cluster downtime. Conclusion That is it for Part 2. In Part 3 of this article, we will explore how to enable two new HDFS features to help your NameNode scale better.

ArpitAgarwal · ‎07-07-2016

Introduction With HDFS HA, the NameNode is no longer a single point of failure in a Hadoop cluster. However the performance of a single NameNode can often limit the performance of jobs across the cluster. The Apache Hadoop community has made a number of NameNode scalability improvements. This series of articles (also see part 2, part 3, part 4 and part 5) explains how you can make configuration changes to help your NameNode scale better and also improve its diagnosability. Audience This article is for Hadoop administrators who are familiar with HDFS and its components. If you are using Ambari or Cloudera Manager you should know how to manage services and configurations. HDFS Audit Logs The HDFS audit log is a separate log file that contains one entry for each user request. The audit log is a key resource for offline analysis of application traffic to find out users/jobs that generate excessive load on the NameNode. Careful analysis of the HDFS audit logs can also identify application bugs/misconfiguration. Each audit log entry looks like the following: 2015-11-16 21:00:00,109 INFO FSNamesystem.audit: allowed=true ugi=bob (auth:SIMPLE) ip=/000.000.0.0 cmd=listStatussrc=/app-logs/pcd_batch/application_1431545431771/ dst=null perm=null Audit logging can be enabled via the Hadoop log4j.properties file (typically found in /etc/hadoop/conf, or under the "process" directory when using CM) with the following settings: # hdfs audit logging hdfs.audit.logger=INFO,console log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=${hdfs.audit.logger} log4j.additivity.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=false log4j.appender.DRFAAUDIT=org.apache.log4j.DailyRollingFileAppender log4j.appender.DRFAAUDIT.File=${hadoop.log.dir}/hdfs-audit.log log4j.appender.DRFAAUDIT.layout=org.apache.log4j.PatternLayout log4j.appender.DRFAAUDIT.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n log4j.appender.DRFAAUDIT.DatePattern=.yyyy-MM-dd Also, check the NameNode startup options in hadoop-env.sh to ensure there is no conflicting override for the hdfs.audit.logger setting. The following is an example of a correct HADOOP_NAMENODE_OPTS setting. export HADOOP_NAMENODE_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=8g -XX:MaxNewSize=8g -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms64g -Xmx64g -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT ${HADOOP_NAMENODE_OPTS}" Note: If you enable HDFS audit logs, you should also enable Async Audit Logging as described in the next section. Async Audit Logging Using an async log4j appender for the audit logs significantly improves the performance of the NameNode. This setting is critical for the NameNode to scale beyond 10,000 requests/second. Add the following to your hdfs-site.xml. <property> <name>dfs.namenode.audit.log.async</name> <value>true</value> </property> If you are managing your cluster with Ambari, this setting is already enabled by default. If you're using CDH, you'll want to add the property to "NameNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml". For a CDP cluster, this property is exposed in the UI and enabled by default. Service RPC port The service RPC port gives the DataNodes a dedicated port to report their status via block reports and heartbeats. The port is also used by Zookeeper Failover Controllers for periodic health checks by the automatic failover logic. The port is never used by client applications hence it reduces RPC queue contention between client requests and DataNode messages. For a non-HA cluster, the service RPC port can be enabled with the following setting in hdfs-site.xml (replacing mynamenode.example.com with the hostname or IP address of your NameNode). The port number can be something other than 8040. <property> <name>dfs.namenode.servicerpc-address</name> <value>mynamenode.example.com:8040</value> </property> For an HA cluster, the service RPC port can be enabled with settings like the following, replacing mycluster, nn1 and nn2 appropriately. <property> <name>dfs.namenode.servicerpc-address.mycluster.nn1</name> <value>mynamenode1.example.com:8040</value> </property> <property> <name>dfs.namenode.servicerpc-address.mycluster.nn2</name> <value>mynamenode2.example.com:8040</value> </property> If you have enabled Automatic Failover using ZooKeeper Failover Controllers, an additional step is required to reset the ZKFC state in ZooKeeper. Stop both the ZKFC processes and run the following command as the hdfs superuser. hdfs zkfc –formatZK Note: Changing the service RPC port settings requires a restart of the NameNodes, DataNodes and ZooKeeper Failover Controllers to take full effect. If you have NameNode HA setup, you can restart the NameNodes one at a time followed by a rolling restart of the remaining components to avoid cluster downtime. NameNode JVM Garbage Collection Garbage Collection is important enough to merit its own article. Please refer to NameNode Garbage Collection Configuration: Best Practices and Rationale. Conclusion That's it for Part 1. In part 2, part 3, part 4 and part 5 we will look at a number of additional configuration settings to help your NameNode scale better.

ArpitAgarwal · ‎06-29-2016

Hi @Thomas Larsson, the DataNode will perform a simple disk check operation in response to certain IO errors. The disk check verifies that the DataNode's storage directory root is readable, writeable and executable. If either of these checks fails, the DataNode will mark the volume as failed. HDFS failed disk detection can be better than it is today. We have seen instances where these checks are insufficient to detect volume failures. It is a hard problem in general since disks fail in byzantine ways where some but not all IOs may fail or a subset of directories on the disk become inaccessible.

ArpitAgarwal · ‎06-23-2016

Hi @Jitendra Yadav, you are right - that particular message is harmless and usually a side effect of the Ambari health check. The fix for this issue was actually made in HDFS via HDFS-9572.

ArpitAgarwal · ‎06-23-2016

Hi @Xiaobing Zhou, I think the requirement to have shell(/bin/true) which is essentially a no-op fencer can be eliminated. There is no technical reason to require the no-op fencer. The code that instantiates a fencer is in NodeFencer.java public static NodeFencer create(Configuration conf, String confKey) throws BadFencingConfigurationException { String confStr = conf.get(confKey); if (confStr == null) { return null; } return new NodeFencer(conf, confStr); } A potential improvement is to instantiate a dummy fencer if dfs.ha.fencing.methods is undefined i.e. the confStr == null case above.

ArpitAgarwal · ‎06-17-2016

Hi, as a start you can try the following: Check the DataNodes tab of the NameNode web UI to verify DataNodes are heartbeating. Also check that there are no failed disks. Check the DataNode logs to see whether block reports were successfully sent after restart (you should see a log statement like "Successfully sent block report". The access token error indicates a permissions problem. If you have enabled Kerberos, check that clocks are in sync on all the nodes.

ArpitAgarwal · ‎06-17-2016

Hortonworks recommends using the default RoundRobin policy.

Online	Offline
Last Visited	‎11-03-2023 01:06 PM

Member Since	‎07-30-2019 10:45 AM
Last Visited	‎11-03-2023 01:06 PM
Posts	111
Kudos received	185

Cloudera Community

Re: What is active and passive NameNode in Hadoop?

Re: NameNode heapsize is bigger then it should be.

Re: Delete old BP-* DataNode directories by hand?

Re: NameNode edit logs - purging/Best practises

Re: Hadoop 3.0 in a Virtual Box for beginners

Re: HDFS Balancer (2): Configurations & CLI Option...

Scaling the HDFS NameNode (part 3) - RPC scalabili...

Scaling the HDFS NameNode (part 4) - Avoiding Perf...

Scaling the HDFS NameNode (part 2)

Scaling the HDFS NameNode (part 1)

Re: What causes a datanode to consider a volume as...

Re: Datanode getting down

Re: What's purpose of shell(/bin/true) in HDFS HA ...

Re: Single block created for file, now showing mis...

Re: General guidelines and best practices for tuni...