Support Questions

Find answers, ask questions, and share your expertise

How to start namenode with filenotfoundexception?

Explorer

I am running into an issue with starting my namenode (HDP 3.1) which failes on a filenotfoundexception. It complains about a file in /apps/hbase/data/WALs/. I ran hdfs fsck / and the report shows that the filesystem is healthy. I am not sure why this file doesn't exist, or why the namenode cares about it existing. Is there a way to force the namenode to start with the file missing?

9 REPLIES 9

Mentor

@scott powers

The directory its looking for holds the WAL's (hbase write ahead logs) Can you try tricking by running the below as user hdfs

$ hdfs dfs -mkdir -p /apps/hbase/data/WALs 
$ hdfs dfs -chown -R hbase:hdfs /apps/hbase/data/WALs

First restart hbase then try starting the namenode.

Please revert

Explorer

@Geoffrey Shelton Okot

Sorry that it was unclear, there is a specific file in that directory it complains about. So in that case do you suggest

`hdfs dfs -touch thefilename`

then changing the owner?

Explorer

@Geoffrey Shelton Okot

Sorry that it was unclear, there is a specific file in that directory it complains about. So in that case do you suggest

hdfs dfs -touch thefilename

then changing the owner?

Mentor

@scott powers

Okay I now understand yes

hdfs dfs -touch thefilename

Do that for any files it's complaining about and ownership and try restarting the namenode.

Explorer

Interestingly restarting hbase deletes the directory and file that I created. If I create it and only restart the namenode it still says the file is missing, even though I see it with hdfs dfs -ls. Maybe some issue with a 0 length file?

Mentor

@scott powers

What type of cluster is this DEV, TEST etc? Kerberized or not? The contents of that directory look like this

How many hbase master in your cluster?

/apps/hbase/data/WALs/{host_name},16020,1549312261942/{host_name}%2C16020%2C1549312261942..meta.15495826629.meta
/apps/hbase/data/WALs/{host_name},16020,1549312261942/{host_name}%2C16020%2C1549312261942.default.15495857667

Can you paste the master log before the error happened? Of particular interest is the MasterProcWALs

 /var/log/hbase/hbase-hbase-master-xxx.log

Do you have hbase dependant services like atlas?

Test 1

Shutdown and Restart the cluster.

Test 2

Can you rename the directories MasterProcWALs & WALs found in /usr/var/lib/ambari-metrics-collector/hbase

# mv MasterProcWALs XXXMasterProcWALs
# mv WALs XXXWALs

Now restart the hbase and thereafter the Namenode

Please revert

Mentor

@scott powers

Any updates?

Explorer

This is a production cluster that is kerberized. There have been a bunch of ongoing problems over the last week, of which this is one. I have 2 hbase masters in the cluster at the moment. I don't have any hbase dependent services, just an internally developed service that we can easily recreate the data for.

Given that I have been having issues all week I am hesitant to restart the cluster.

The only error I found in the base log is

2019-02-08 14:53:35,713 WARN  [Thread-18] wal.WALProcedureStore: Unable to read tracker for hdfs://cluster/apps/hbase/data/MasterProcWALs/pv2-00000000000000000336.logorg.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat$InvalidWALDataException: Missing trailer: size=19 startPos=19  at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readTrailer(ProcedureWALFormat.java:183)  at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.readTrailer(ProcedureWALFile.java:93)  at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.readTracker(ProcedureWALFile.java:100)  at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1386)  at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:1335)  at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:416)  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:714)  at org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1398)  at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:857)  at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2225)  at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:568)  at java.lang.Thread.run(Thread.java:745)

Explorer

It seems that the issue was not about hbase, but instead related to the namenode. Performing a bootstrapStanby after a backup resolved the issue. The biggest concern is that this was done without putting the cluster into safe mode, since the second namenode could not start up.