Created 05-28-2016 12:19 AM
Our customer has a HA-enabled cluster, and after automatic failover, the active name node (NN) is very slow. This is a central place for collecting all possible reasons that cause slow NN.
Problem definition:
For answers, please kindly list the possible reason along with solutions.
Created 05-28-2016 12:40 AM
Good question, @Mingliang Liu!
There are many reasons for NN slowness:
1. GC. The GC may be caused by large workload like block reports, large amount of RPC calls, wrong heap settings, etc. The GC can be identified from NN gc log. The large amount block reports can be identified by checking NN log and the JMX metrics, the RPC load can be verified through JMX metrics and also hdfs audit log.
2. Slow editlog persistence. This may be caused by NameNode local disk issue, or sometimes network issue between NameNode and JournalNodes, or JournalNode disk issue. This usually can be identified from NN/JN log.
3. Slow editlog reading. This usually happens in the standby NN. From the NN log you can easily compute the reading speed of the editlog tailer inside of the standby NN.
4. Known/Unknown HDFS bugs. Usually you can take multiple thread dumps of the NN, and check if some thread is blocked by certain operation. If you find anything suspicious, ask for help here or send email to Apache hdfs/hadoop user mailing list.
Created 05-28-2016 12:40 AM
Good question, @Mingliang Liu!
There are many reasons for NN slowness:
1. GC. The GC may be caused by large workload like block reports, large amount of RPC calls, wrong heap settings, etc. The GC can be identified from NN gc log. The large amount block reports can be identified by checking NN log and the JMX metrics, the RPC load can be verified through JMX metrics and also hdfs audit log.
2. Slow editlog persistence. This may be caused by NameNode local disk issue, or sometimes network issue between NameNode and JournalNodes, or JournalNode disk issue. This usually can be identified from NN/JN log.
3. Slow editlog reading. This usually happens in the standby NN. From the NN log you can easily compute the reading speed of the editlog tailer inside of the standby NN.
4. Known/Unknown HDFS bugs. Usually you can take multiple thread dumps of the NN, and check if some thread is blocked by certain operation. If you find anything suspicious, ask for help here or send email to Apache hdfs/hadoop user mailing list.
Created 05-30-2016 09:12 PM
Thanks @Jing Zhao for the answer.