- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Possible reasons that cause slow name node (NN)
- Labels:
-
Apache Hadoop
Created ‎05-28-2016 12:19 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Our customer has a HA-enabled cluster, and after automatic failover, the active name node (NN) is very slow. This is a central place for collecting all possible reasons that cause slow NN.
Problem definition:
- NN is responding slowly. It does not crash. Most of the operations can succeed.
- By slow, I have two proofs a) hdfs dfs -ls / command is sluggy b) From JMX metrics, average RPC process time is up to 3~10 seconds
For answers, please kindly list the possible reason along with solutions.
Created ‎05-28-2016 12:40 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good question, @Mingliang Liu!
There are many reasons for NN slowness:
1. GC. The GC may be caused by large workload like block reports, large amount of RPC calls, wrong heap settings, etc. The GC can be identified from NN gc log. The large amount block reports can be identified by checking NN log and the JMX metrics, the RPC load can be verified through JMX metrics and also hdfs audit log.
2. Slow editlog persistence. This may be caused by NameNode local disk issue, or sometimes network issue between NameNode and JournalNodes, or JournalNode disk issue. This usually can be identified from NN/JN log.
3. Slow editlog reading. This usually happens in the standby NN. From the NN log you can easily compute the reading speed of the editlog tailer inside of the standby NN.
4. Known/Unknown HDFS bugs. Usually you can take multiple thread dumps of the NN, and check if some thread is blocked by certain operation. If you find anything suspicious, ask for help here or send email to Apache hdfs/hadoop user mailing list.
Created ‎05-28-2016 12:40 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good question, @Mingliang Liu!
There are many reasons for NN slowness:
1. GC. The GC may be caused by large workload like block reports, large amount of RPC calls, wrong heap settings, etc. The GC can be identified from NN gc log. The large amount block reports can be identified by checking NN log and the JMX metrics, the RPC load can be verified through JMX metrics and also hdfs audit log.
2. Slow editlog persistence. This may be caused by NameNode local disk issue, or sometimes network issue between NameNode and JournalNodes, or JournalNode disk issue. This usually can be identified from NN/JN log.
3. Slow editlog reading. This usually happens in the standby NN. From the NN log you can easily compute the reading speed of the editlog tailer inside of the standby NN.
4. Known/Unknown HDFS bugs. Usually you can take multiple thread dumps of the NN, and check if some thread is blocked by certain operation. If you find anything suspicious, ask for help here or send email to Apache hdfs/hadoop user mailing list.
Created ‎05-30-2016 09:12 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @Jing Zhao for the answer.
