Created 06-23-2016 12:10 AM
There's a classic issue/exception in HDFS HA with ZKFC, that is, "IPC's epoch X is less than the last promised epoch Y". What are the best suggested steps to troubleshoot the problem? How to find the root cause? What are the possible reasons? Thanks.
Created 06-23-2016 06:29 PM
Hello @Xiaobing Zhou,
This may indicate that either a NameNode or JournalNodes were unresponsive for a period of time. This can lead to a cascading failure, whereby a NameNode HA failover occurs, the other NameNode becomes active, the previous NameNode thinks it is still active, and then QJM rejects that NameNode for not operating within the same "epoch" (logical period of time). This is by design, as QJM is intended to prevent 2 NameNodes from mistakenly acting as active in a split-brain scenario.
There are multiple potential reasons for unresponsiveness in the NameNode/JournalNode interaction. Reviewing logs from the NameNodes and JournalNodes would likely reveal more details. There are several common causes to watch for:
Created 06-23-2016 12:42 AM
One major reason could be - Suppose you are getting these errors in X namenode which was active. it was unresponsive for some reason ( may be network connectivity or it was busy in processing datanode's reports or something else and could not able to communicate with zkfc ) and fencing has happened, now Y is your active NN and when X becomes responsive, it assumes that I'm the active NN and tries to send write request to the journal node. As Y is already active, last promised epoc value was increased and journal node will simply reject the write request from X.
Please read detailed information about this at below link.
https://community.hortonworks.com/articles/27225/how-qjm-works-in-namenode-ha.html
Hope this information helps.
Happy Hadooping!! 🙂
Created 06-23-2016 06:29 PM
Hello @Xiaobing Zhou,
This may indicate that either a NameNode or JournalNodes were unresponsive for a period of time. This can lead to a cascading failure, whereby a NameNode HA failover occurs, the other NameNode becomes active, the previous NameNode thinks it is still active, and then QJM rejects that NameNode for not operating within the same "epoch" (logical period of time). This is by design, as QJM is intended to prevent 2 NameNodes from mistakenly acting as active in a split-brain scenario.
There are multiple potential reasons for unresponsiveness in the NameNode/JournalNode interaction. Reviewing logs from the NameNodes and JournalNodes would likely reveal more details. There are several common causes to watch for:
Created 07-07-2016 08:46 PM
Thank you @Chris Nauroth and @Kuldeep Kulkarni for the answer. It's quite clear.