03-20-2018 12:55 PM
I updated (os package updates) my cluster yesterday so as per, I stopped roles on each node, one by one, to update, reboot, and put back in cluster. Somewhere in there, one of my namenodes (I have HA setup) was actually stopped overnight and I just brought it back up an hour ago or so and then did a manual failover back to it as it is normally the active namenode. At this point, it is now complaining bitterly:
"The filesystem checkpoint is 9 hour(s), 51 minute(s) old. This is 986.58% of the configured checkpoint period of 1 hour(s). Critical threshold: 400.00%. 16,378 transactions have occurred since the last filesystem checkpoint. This is 1.64% of the configured checkpoint transaction target of 1,000,000."
I have been googling all over for the last hour or so to figure out how to fix this. It is clearly due to this role having been unintentionally STOPPED for a while. But now how do I get the checkpoint reset and caught back up? What drives me crazy is that most of the stuff I found (and indeed, the message itself on my CM comes with a SUPPRESS button) talks about how to HIDE this problem...!
How do I tell it to refresh its checkpoint? Obviously waiting isn't the answer as the "9 hour(s), 51 minute(s) old" part is simply increasing, so it's definitely not going to do anything on its own. It's not obvious what actions listed in the Actions dropdown menu, if any, that I might take on this namenode.