Support Questions

mliu · ‎05-28-2016

Our customer has a HA-enabled cluster, and after automatic failover, the active name node (NN) is very slow. This is a central place for collecting all possible reasons that cause slow NN.

Problem definition:

NN is responding slowly. It does not crash. Most of the operations can succeed.
By slow, I have two proofs a) hdfs dfs -ls / command is sluggy b) From JMX metrics, average RPC process time is up to 3~10 seconds

For answers, please kindly list the possible reason along with solutions.

jing · ‎05-28-2016

Good question, @Mingliang Liu!

There are many reasons for NN slowness:

1. GC. The GC may be caused by large workload like block reports, large amount of RPC calls, wrong heap settings, etc. The GC can be identified from NN gc log. The large amount block reports can be identified by checking NN log and the JMX metrics, the RPC load can be verified through JMX metrics and also hdfs audit log.

2. Slow editlog persistence. This may be caused by NameNode local disk issue, or sometimes network issue between NameNode and JournalNodes, or JournalNode disk issue. This usually can be identified from NN/JN log.

3. Slow editlog reading. This usually happens in the standby NN. From the NN log you can easily compute the reading speed of the editlog tailer inside of the standby NN.

4. Known/Unknown HDFS bugs. Usually you can take multiple thread dumps of the NN, and check if some thread is blocked by certain operation. If you find anything suspicious, ask for help here or send email to Apache hdfs/hadoop user mailing list.

View solution in original post

jing · ‎05-28-2016