I am running into an issue with starting my namenode (HDP 3.1) which failes on a filenotfoundexception. It complains about a file in /apps/hbase/data/WALs/. I ran hdfs fsck / and the report shows that the filesystem is healthy. I am not sure why this file doesn't exist, or why the namenode cares about it existing. Is there a way to force the namenode to start with the file missing?
The directory its looking for holds the WAL's (hbase write ahead logs) Can you try tricking by running the below as user hdfs
$ hdfs dfs -mkdir -p /apps/hbase/data/WALs $ hdfs dfs -chown -R hbase:hdfs /apps/hbase/data/WALs
First restart hbase then try starting the namenode.
Interestingly restarting hbase deletes the directory and file that I created. If I create it and only restart the namenode it still says the file is missing, even though I see it with hdfs dfs -ls. Maybe some issue with a 0 length file?
What type of cluster is this DEV, TEST etc? Kerberized or not? The contents of that directory look like this
How many hbase master in your cluster?
Can you paste the master log before the error happened? Of particular interest is the MasterProcWALs
Do you have hbase dependant services like atlas?
Shutdown and Restart the cluster.
Can you rename the directories MasterProcWALs & WALs found in /usr/var/lib/ambari-metrics-collector/hbase
# mv MasterProcWALs XXXMasterProcWALs # mv WALs XXXWALs
Now restart the hbase and thereafter the Namenode
This is a production cluster that is kerberized. There have been a bunch of ongoing problems over the last week, of which this is one. I have 2 hbase masters in the cluster at the moment. I don't have any hbase dependent services, just an internally developed service that we can easily recreate the data for.
Given that I have been having issues all week I am hesitant to restart the cluster.
The only error I found in the base log is
2019-02-08 14:53:35,713 WARN [Thread-18] wal.WALProcedureStore: Unable to read tracker for hdfs://cluster/apps/hbase/data/MasterProcWALs/pv2-00000000000000000336.logorg.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat$InvalidWALDataException: Missing trailer: size=19 startPos=19 at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readTrailer(ProcedureWALFormat.java:183) at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.readTrailer(ProcedureWALFile.java:93) at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.readTracker(ProcedureWALFile.java:100) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1386) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:1335) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:416) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:714) at org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1398) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:857) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2225) at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:568) at java.lang.Thread.run(Thread.java:745)
It seems that the issue was not about hbase, but instead related to the namenode. Performing a bootstrapStanby after a backup resolved the issue. The biggest concern is that this was done without putting the cluster into safe mode, since the second namenode could not start up.