I currently have a HDP 126.96.36.199 install that is largely working well. Except that when I fail over the HBase master, it can take up to five minutes for the new master to take over. It's very clear this is caused by the new master re-playing a huge number of Procedure WAL files (currently 7742) during startup. They always successfully replay, but the time taken is problematic at best.
Doing some reading, it seems that I might be able to remove old ProcedureWAL files if they were safely processed. The problem I'm encountering is I don't see how to validate that the files are finished and whether it is actually safe to delete old procedure files.
I've attached the Procedures & Locks page from my HBase Master UI to show that there isn't an obviously old stuck procedure in place.
Can anyone give any advice? I'd really like to clean these files up so HBase Master restarts will be safer.
I appreciate the suggestion, but given this is production, it's hard to justify stopping services just to test a theory. Is this the only way to ensure old ProcedureWALs are removed?