Support Questions

mliu · ‎05-28-2016

Our customer has a HA-enabled cluster, and after automatic failover, the active name node (NN) is very slow. This is a central place for collecting all possible reasons that cause slow NN.

Problem definition:

NN is responding slowly. It does not crash. Most of the operations can succeed.
By slow, I have two proofs a) hdfs dfs -ls / command is sluggy b) From JMX metrics, average RPC process time is up to 3~10 seconds

For answers, please kindly list the possible reason along with solutions.

jing · ‎05-28-2016

Good question, @Mingliang Liu!

There are many reasons for NN slowness:

1. GC. The GC may be caused by large workload like block reports, large amount of RPC calls, wrong heap settings, etc. The GC can be identified from NN gc log. The large amount block reports can be identified by checking NN log and the JMX metrics, the RPC load can be verified through JMX metrics and also hdfs audit log.

2. Slow editlog persistence. This may be caused by NameNode local disk issue, or sometimes network issue between NameNode and JournalNodes, or JournalNode disk issue. This usually can be identified from NN/JN log.

3. Slow editlog reading. This usually happens in the standby NN. From the NN log you can easily compute the reading speed of the editlog tailer inside of the standby NN.

4. Known/Unknown HDFS bugs. Usually you can take multiple thread dumps of the NN, and check if some thread is blocked by certain operation. If you find anything suspicious, ask for help here or send email to Apache hdfs/hadoop user mailing list.

View solution in original post

jing · ‎05-28-2016

Good question, @Mingliang Liu!

There are many reasons for NN slowness:

1. GC. The GC may be caused by large workload like block reports, large amount of RPC calls, wrong heap settings, etc. The GC can be identified from NN gc log. The large amount block reports can be identified by checking NN log and the JMX metrics, the RPC load can be verified through JMX metrics and also hdfs audit log.

2. Slow editlog persistence. This may be caused by NameNode local disk issue, or sometimes network issue between NameNode and JournalNodes, or JournalNode disk issue. This usually can be identified from NN/JN log.

3. Slow editlog reading. This usually happens in the standby NN. From the NN log you can easily compute the reading speed of the editlog tailer inside of the standby NN.

4. Known/Unknown HDFS bugs. Usually you can take multiple thread dumps of the NN, and check if some thread is blocked by certain operation. If you find anything suspicious, ask for help here or send email to Apache hdfs/hadoop user mailing list.

namaheshwari · ‎05-30-2016

Thanks @Jing Zhao for the answer.

Cloudera Community

Support Questions

Possible reasons that cause slow name node (NN)

impala slow query caused by scan node

Performance Delays in Namenode Caused by Multiple ...

Impala ODBC/JDBC bad performance - rows fetch is v...

CCUs, Nodes and Cores

Caused by: org.apache.zookeeper.KeeperException$No...

Secondary NN Heap Requirements.

Need for NN dedicated disks

possible bug missing parquetreader version 2.0.0-M...

is it possible to stop/start kafka service each ka...

Standby NN not coming up