Created 01-06-2015 06:04 AM
We appear to be having a problem where throughout the day NAME_NODE_RPC_LATENCY alerts will trigger then clear right after.
This is one from this morning:
NAME_NODE_RPC_LATENCY has become bad: The moving average of the RPC latency is 11.7 second(s) over the previous 5 minute(s). The moving average of the queue time is 5.3 second(s). The moving average of the processing time is 6.4 second(s). Critical threshold: 5 second(s).
I have not been able to find anything in my books or on the internet about how to mitigate this (other than perhaps adding more namenodes using HDFS federation). When I look at the server running the namenode I am not seeing any heavy utililzation of CPU nor am I seeing memory being strained. The only indication of anything in this time frame is a network spike.
Is there anything I can do to mitigate this problem ? We do not have jumbo frames enabled if that makes a difference.
Created 05-13-2015 04:12 AM
Greetings GautamG!
Thanks for your reply.
I have actually resolved this issue which was caused by a combination of two factors.
1st - Way to many small files being created by pig denormalization jobs. Jobs processing this result data were making to many RPC calls at once, basically performing a denial of service attack on the namenode (around 1800 RPC calls a second at failure point).
http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/
2nd - Soon after resolving the small files problem, another problem arose. Seemingly random high queue processing load, at the time appeared very odd because the small files problem had been resolved and the amount of RPC calls was well within stable bound.
The first issue turned out to be quite easy to solve. The pig jobs were modified to roll up all these small files into much larger files (not large on the scale of hadoop) so that instead of hundreds of thousands of files we were able to get them down to a just a few hundred.
The second issue was far more interesting. This was difficult to troubleshoot and actually resolved by accident. There is actually almost no documentation on what would cause "high queue processing load" which led to me eventually pulling the source code down for HDFS. It eventually turned out to be caused by the JobHistory server trying to pull the logs from /tmp/logs. The JobHistory server was running into permissions failures when trying to pull said logs, which would lead to the high processing queue time and eventually a backlog in the RPC queue.
ProTip from my pain: High queue processing happens when a job/someone tries to operate against files in HDFS that they do not have permissions for. The big problem with this is that even your least priviledged users can cause cluster instability by just trying to operate against files they do not have permission to.
I found a problem with the JobHistory logs being retrieved in the browser.
In my cause this turned out to be caused by some cleanup work I had done in the HDFS /tmp directory. I had accidently removed /tmp/logs without realizing this is the storage location of the YARN JobHistory server logs.
The folder was automatically recreated by the yarn process, but with the wrong permissions needed for the JobHistory server to operate.
It turns out this problem was caused by previous efforts to clean up the /tmp filespace in HDFS.
The directory /tmp/logs is used by the JobHistory server to store logs for retrieval. When the directory was removed as being "aged out" it was automatically recreated, but with the wrong permissions (It was recreated with yarn:supergroup instead of mapred:hadoop).
The way the directory works is that it leverages the sticky bit to force inheritance of the group permissions on all new files created. Under the correct conditions this would force all new files and folders to be owned by "hadoop". In this case it led to all new folders and files being owned by supergroup, which prevented the JobHistory server from being able to read the logs.
You can see where I commented in this thread on how to fix it..
INCORRECT:
drwxrwxrwt - yarn superuser 0 2015-05-10 09:28 /tmp/logs
[root@namenode hadoop-hdfs]# sudo -u hdfs hdfs dfs -ls /tmp/logs/
Found 13 items
drwxrwx--- - user1 supergroup 0 2015-05-10 09:28 /tmp/logs/user1
drwxrwx--- - user2 supergroup 0 2015-05-06 12:46 /tmp/logs/user2
drwxrwx--- - user3 supergroup 0 2015-05-06 12:54 /tmp/logs/user3
FIXED:
drwxrwxrwt - mapred hadoop 0 2015-05-10 09:28 /tmp/logs
[root@namenode hadoop-hdfs]# sudo -u hdfs hdfs dfs -ls /tmp/logs/
Found 13 items
drwxrwx--- - user1 hadoop 0 2015-05-10 09:28 /tmp/logs/user1
drwxrwx--- - user2 hadoop 0 2015-05-06 12:46 /tmp/logs/user2
drwxrwx--- - user3 hadoop 0 2015-05-06 12:54 /tmp/logs/user3
Created 01-06-2015 06:24 AM
Checked the service log you mentioned, not seeing anything related to the namenode (also searched whole log for NAME_NODE_RPC_LATENCY and nothing found).
Here are the JVM arguments passed to the start. (xms xmx ext)
HADOOP_NAMENODE_OPTS=-Xms2779774976 -Xmx2779774976 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=3000 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -XX:OnOutOfMemoryError={{AGENT_COMMON_DIR}}/killparent.sh
HADOOP_LOGFILE=hadoop-cmf-hdfs-NAMENODE-usapname01.ihtech.com.log.out
HADOOP_AUDIT_LOGGER=INFO,RFAAUDIT
HADOOP_ROOT_LOGGER=INFO,RFA
CDH_VERSION=5
HADOOP_LOG_DIR=/var/log/hadoop-hdfs
HADOOP_SECURITY_LOGGER=INFO,RFAS
Here are the stats from the dfshealth page of the namenode.
Summary
Security is off.
Safemode is off.
816955 files and directories, 953582 blocks = 1770537 total filesystem object(s).
Heap Memory used 1.2 GB of 2.5 GB Heap Memory. Max Heap Memory is 2.5 GB.
Non Heap Memory used 70.28 MB of 97.38 MB Commited Non Heap Memory. Max Non Heap Memory is 130 MB.
Configured Capacity:DFS Used:Non DFS Used:DFS Remaining:DFS Used%:DFS Remaining%:Block Pool Used:Block Pool Used%:DataNodes usages% (Min/Median/Max/stdDev):Live NodesDead NodesDecommissioning NodesNumber of Under-Replicated BlocksNumber of Blocks Pending Deletion
503.99 TB
199.13 TB
25.21 TB
279.64 TB
39.51%
55.49%
199.13 TB
39.51%
33.93% / 40.66% / 43.58% / 2.85%
12 (Decommissioned: 0)
0 (Decommissioned: 0)
0
0
0
NameNode Journal Status
Current transaction ID: 140817035
Journal Manager State
QJM to [10.32.88.30:8485, 10.32.88.23:8485, 10.32.88.24:8485] | Writing segment beginning at txid 140817035. 10.32.88.30:8485 (Written txid 140817035), 10.32.88.23:8485 (Written txid 140817035), 10.32.88.24:8485 (Written txid 140817035) |
FileJournalManager(root=/data/dfs/nn) | EditLogFileOutputStream(/data/dfs/nn/current/edits_inprogress_0000000000140817035) |
Created 01-06-2015 06:24 AM
I have also uploaded a picture of all the performance graphs for this node here.
Created 05-12-2015 05:01 PM
Created 05-12-2015 05:08 PM
Have you checked if the RPC Call Queue length is spiking correlates with the alerts? If it does, maybe you can check the Namenode logs to see why there's a sudden burst of activity once in a while, maybe one of your client applications can be tuned.
Created 05-13-2015 04:12 AM
Greetings GautamG!
Thanks for your reply.
I have actually resolved this issue which was caused by a combination of two factors.
1st - Way to many small files being created by pig denormalization jobs. Jobs processing this result data were making to many RPC calls at once, basically performing a denial of service attack on the namenode (around 1800 RPC calls a second at failure point).
http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/
2nd - Soon after resolving the small files problem, another problem arose. Seemingly random high queue processing load, at the time appeared very odd because the small files problem had been resolved and the amount of RPC calls was well within stable bound.
The first issue turned out to be quite easy to solve. The pig jobs were modified to roll up all these small files into much larger files (not large on the scale of hadoop) so that instead of hundreds of thousands of files we were able to get them down to a just a few hundred.
The second issue was far more interesting. This was difficult to troubleshoot and actually resolved by accident. There is actually almost no documentation on what would cause "high queue processing load" which led to me eventually pulling the source code down for HDFS. It eventually turned out to be caused by the JobHistory server trying to pull the logs from /tmp/logs. The JobHistory server was running into permissions failures when trying to pull said logs, which would lead to the high processing queue time and eventually a backlog in the RPC queue.
ProTip from my pain: High queue processing happens when a job/someone tries to operate against files in HDFS that they do not have permissions for. The big problem with this is that even your least priviledged users can cause cluster instability by just trying to operate against files they do not have permission to.
I found a problem with the JobHistory logs being retrieved in the browser.
In my cause this turned out to be caused by some cleanup work I had done in the HDFS /tmp directory. I had accidently removed /tmp/logs without realizing this is the storage location of the YARN JobHistory server logs.
The folder was automatically recreated by the yarn process, but with the wrong permissions needed for the JobHistory server to operate.
It turns out this problem was caused by previous efforts to clean up the /tmp filespace in HDFS.
The directory /tmp/logs is used by the JobHistory server to store logs for retrieval. When the directory was removed as being "aged out" it was automatically recreated, but with the wrong permissions (It was recreated with yarn:supergroup instead of mapred:hadoop).
The way the directory works is that it leverages the sticky bit to force inheritance of the group permissions on all new files created. Under the correct conditions this would force all new files and folders to be owned by "hadoop". In this case it led to all new folders and files being owned by supergroup, which prevented the JobHistory server from being able to read the logs.
You can see where I commented in this thread on how to fix it..
INCORRECT:
drwxrwxrwt - yarn superuser 0 2015-05-10 09:28 /tmp/logs
[root@namenode hadoop-hdfs]# sudo -u hdfs hdfs dfs -ls /tmp/logs/
Found 13 items
drwxrwx--- - user1 supergroup 0 2015-05-10 09:28 /tmp/logs/user1
drwxrwx--- - user2 supergroup 0 2015-05-06 12:46 /tmp/logs/user2
drwxrwx--- - user3 supergroup 0 2015-05-06 12:54 /tmp/logs/user3
FIXED:
drwxrwxrwt - mapred hadoop 0 2015-05-10 09:28 /tmp/logs
[root@namenode hadoop-hdfs]# sudo -u hdfs hdfs dfs -ls /tmp/logs/
Found 13 items
drwxrwx--- - user1 hadoop 0 2015-05-10 09:28 /tmp/logs/user1
drwxrwx--- - user2 hadoop 0 2015-05-06 12:46 /tmp/logs/user2
drwxrwx--- - user3 hadoop 0 2015-05-06 12:54 /tmp/logs/user3
Created 05-13-2015 05:09 AM
Created on 05-13-2015 12:18 PM - edited 05-13-2015 12:18 PM
Thank you for sharing in detail.