Created 02-08-2019 04:08 PM
I am running into an issue with starting my namenode (HDP 3.1) which failes on a filenotfoundexception. It complains about a file in /apps/hbase/data/WALs/. I ran hdfs fsck / and the report shows that the filesystem is healthy. I am not sure why this file doesn't exist, or why the namenode cares about it existing. Is there a way to force the namenode to start with the file missing?
Created 02-08-2019 06:28 PM
The directory its looking for holds the WAL's (hbase write ahead logs) Can you try tricking by running the below as user hdfs
$ hdfs dfs -mkdir -p /apps/hbase/data/WALs $ hdfs dfs -chown -R hbase:hdfs /apps/hbase/data/WALs
First restart hbase then try starting the namenode.
Please revert
Created 02-08-2019 06:35 PM
Sorry that it was unclear, there is a specific file in that directory it complains about. So in that case do you suggest
`hdfs dfs -touch thefilename`
then changing the owner?
Created 02-08-2019 06:36 PM
Sorry that it was unclear, there is a specific file in that directory it complains about. So in that case do you suggest
hdfs dfs -touch thefilename
then changing the owner?
Created 02-08-2019 06:51 PM
Okay I now understand yes
hdfs dfs -touch thefilename
Do that for any files it's complaining about and ownership and try restarting the namenode.
Created 02-08-2019 08:33 PM
Interestingly restarting hbase deletes the directory and file that I created. If I create it and only restart the namenode it still says the file is missing, even though I see it with hdfs dfs -ls. Maybe some issue with a 0 length file?
Created 02-08-2019 10:01 PM
What type of cluster is this DEV, TEST etc? Kerberized or not? The contents of that directory look like this
How many hbase master in your cluster?
/apps/hbase/data/WALs/{host_name},16020,1549312261942/{host_name}%2C16020%2C1549312261942..meta.15495826629.meta /apps/hbase/data/WALs/{host_name},16020,1549312261942/{host_name}%2C16020%2C1549312261942.default.15495857667
Can you paste the master log before the error happened? Of particular interest is the MasterProcWALs
/var/log/hbase/hbase-hbase-master-xxx.log
Do you have hbase dependant services like atlas?
Test 1
Shutdown and Restart the cluster.
Test 2
Can you rename the directories MasterProcWALs & WALs found in /usr/var/lib/ambari-metrics-collector/hbase
# mv MasterProcWALs XXXMasterProcWALs # mv WALs XXXWALs
Now restart the hbase and thereafter the Namenode
Please revert
Created 02-10-2019 08:45 PM
Any updates?
Created 02-11-2019 02:54 PM
This is a production cluster that is kerberized. There have been a bunch of ongoing problems over the last week, of which this is one. I have 2 hbase masters in the cluster at the moment. I don't have any hbase dependent services, just an internally developed service that we can easily recreate the data for.
Given that I have been having issues all week I am hesitant to restart the cluster.
The only error I found in the base log is
2019-02-08 14:53:35,713 WARN [Thread-18] wal.WALProcedureStore: Unable to read tracker for hdfs://cluster/apps/hbase/data/MasterProcWALs/pv2-00000000000000000336.logorg.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat$InvalidWALDataException: Missing trailer: size=19 startPos=19 at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readTrailer(ProcedureWALFormat.java:183) at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.readTrailer(ProcedureWALFile.java:93) at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.readTracker(ProcedureWALFile.java:100) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1386) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:1335) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:416) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:714) at org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1398) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:857) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2225) at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:568) at java.lang.Thread.run(Thread.java:745)
Created 02-11-2019 05:01 PM
It seems that the issue was not about hbase, but instead related to the namenode. Performing a bootstrapStanby after a backup resolved the issue. The biggest concern is that this was done without putting the cluster into safe mode, since the second namenode could not start up.