Created 08-15-2018 07:23 AM
Hi All,
I have a cluster with namenode HA on aws instances (Instance store disks). Each namenode got 12 mount points and metadata in that. And we got 4 datanodes. My standby namenode got hung due to hardware issue on aws end. We have to stop and start the instance. As this is the only solution we have done that and able to bring other services on standby namenode except namenode service because all 12 mounts dont have any metadata information. what i have done is i have tarred & restored the hadoop dir from each mount on Active working namenode to all mounts on the standby namenode. Now i'm able to start the namenode service and it became standby namenode automatically using ZKFC. But in hadoop-hdfs-namenode-<hostname>.log file im getting the below error. How to fix it and is there any harm due to this? Whether my active namenode can successfully failover to this node? Kindly help and give your suggestion to fix this.
NN1 - Standby namenode (which got issue and have to stop and start)
NN2 - active
DN1
DN2
DN3
DN4
(have remove IP and put above naming conventions in the log below)
Error snippet below.
2018-08-15 15:04:12,909 INFO namenode.EditLogInputStream (RedundantEditLogInputStream.java:nextOp(176)) - Fast-forwarding stream 'http://NN1:8480/getJournal?jid=eimedlcluster1&segmentTxId=211034589&storageInfo=-63%3A1695052906%3A0%3ACID-ce4126e2-d1f2-4233-81ec-d267f195583f, http://NN1:8480/getJournal?jid=eimedlcluster1&segmentTxId=211034589&storageInfo=-63%3A1695052906%3A0%3ACID-ce4126e2-d1f2-4233-81ec-d267f195583f' to transaction ID 211034589 2018-08-15 15:04:12,909 INFO namenode.EditLogInputStream (RedundantEditLogInputStream.java:nextOp(176)) - Fast-forwarding stream 'http://NN1:8480/getJournal?jid=eimedlcluster1&segmentTxId=211034589&storageInfo=-63%3A1695052906%3A0%3ACID-ce4126e2-d1f2-4233-81ec-d267f195583f' to transaction ID 211034589 2018-08-15 15:04:12,926 INFO namenode.FSImage (FSEditLogLoader.java:loadFSEdits(145)) - Edits file http://NN1/getJournal?jid=eimedlcluster1&segmentTxId=211034589&storageInfo=-63%3A1695052906%3A0%3ACI..., http://NN1:8480/getJournal?jid=eimedlcluster1&segmentTxId=211034589&storageInfo=-63%3A1695052906%3A0... of size 14288 edits # 104 loaded in 0 seconds 2018-08-15 15:04:14,335 INFO ha.EditLogTailer (EditLogTailer.java:doTailEdits(238)) - Loaded 104 edits starting from txid 211034588 2018-08-15 15:04:22,552 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:04:27,970 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:04:34,710 INFO ipc.Server (Server.java:run(2165)) - IPC Server handler 25 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN4:51488 Call#101504 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 2018-08-15 15:04:34,711 INFO ipc.Server (Server.java:run(2165)) - IPC Server handler 77 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN3:54288 Call#98633 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 2018-08-15 15:04:34,715 INFO ipc.Server (Server.java:run(2165)) - IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN2:57618 Call#99810 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 2018-08-15 15:04:34,716 INFO ipc.Server (Server.java:run(2165)) - IPC Server handler 35 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN1:59402 Call#100406 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 2018-08-15 15:04:49,013 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:05:05,799 INFO ipc.Server (Server.java:run(2165)) - IPC Server handler 54 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN3:54318 Call#98649 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 2018-08-15 15:05:05,807 INFO ipc.Server (Server.java:run(2165)) - IPC Server handler 56 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN2:57630 Call#99826 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 2018-08-15 15:05:05,810 INFO ipc.Server (Server.java:run(2165)) - IPC Server handler 20 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN4:51498 Call#101519 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 2018-08-15 15:05:05,816 INFO ipc.Server (Server.java:run(2165)) - IPC Server handler 43 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN1:59428 Call#100422 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 2018-08-15 15:05:06,229 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:05:06,246 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:05:06,942 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:05:06,945 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:05:06,954 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:05:06,974 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:05:13,011 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:05:22,543 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:05:32,988 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:05:52,160 INFO ipc.Server (Server.java:run(2165)) - IPC Server handler 44 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN4:51528 Call#101534 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 2018-08-15 15:05:52,186 INFO ipc.Server (Server.java:run(2165)) - IPC Server handler 27 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN2:57658 Call#99841 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby 2018-08-15 15:05:53,981 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:06:06,230 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:06:06,254 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:06:06,930 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:06:06,931 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:06:06,947 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:06:06,968 WARN namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby 2018-08-15 15:06:08,482 INFO ipc.Server (Server.java:run(2165)) - IPC Server handler 71 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN4:51528 Call#101549 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
Created 08-16-2018 12:56 AM
Hello @Muthukumar S!
Hm, I got curious in your case 🙂
Could you check if:
And also, I've noted that after the
You started to face warning messages, we may need to check if both Active and StandBy have the same edits/fsimage. So, try to run a ls -R under the namonode directory in your linux fs. And check if it's missing a file or if the sizes are quite different.
And please, let me know which version are you running. And if it's possible, try to enable the DEBUG log for the SN node.
Hope this helps!
Created 08-16-2018 03:58 AM
Thank you for the reply. Please find below information regarding your queries.
1. I tried below commands from NN1 (rebooted one)
hdfs dfs -ls hdfs://NN2/user/ --> able get the outputs
hdfs dfs -ls hdfs://NN1/user/ --> ERROR: ls: Operation category READ is not supported in state standby (is normal and expected?)
2. Yes both dfs.nameservices at hdfs-site.xml and fs.defaultFS are fine.
I verified fsiimage is happening on both namenodes and size are same with timestamp. But edits file are missing on NN1(standby)
from the time i have copied metadata files from NN2 and started the Namenode service. i,e From 14th Aug 17:39 onwards.
I will not be able to enable DEBUG log because i cannot restart hdfs services because continuously jobs are running. Cant offord downtime now. Also im afraid namenode services will comeup or not.
Below is snippet from both nodes with respect to number of files and size. Also latest fsiimage file.
NN1 (STANDBY) $ ls -l fsi* -rw-r--r--. 1 hdfs hadoop 616714799 Aug 16 01:44 fsimage_0000000000211062321 -rw-r--r--. 1 hdfs hadoop 62 Aug 16 01:44 fsimage_0000000000211062321.md5 -rw-r--r--. 1 hdfs hadoop 619959676 Aug 16 07:45 fsimage_0000000000211102880 -rw-r--r--. 1 hdfs hadoop 62 Aug 16 07:45 fsimage_0000000000211102880.md5 NN2 (ACTIVE) $ ls -l fsi* -rw-r--r--. 1 hdfs hadoop 616714799 Aug 16 01:44 fsimage_0000000000211062321 -rw-r--r--. 1 hdfs hadoop 62 Aug 16 01:45 fsimage_0000000000211062321.md5 -rw-r--r--. 1 hdfs hadoop 619959676 Aug 16 07:45 fsimage_0000000000211102880 -rw-r--r--. 1 hdfs hadoop 62 Aug 16 07:45 fsimage_0000000000211102880.md5 NN1 (STANDBY) FILE counts and SIZE data0 ------- 9064 size is: 1351 /data0/hadoop/hdfs ================== data1 ------- 9064 size is: 1351 /data1/hadoop/hdfs ================== data2 ------- 9064 size is: 1351 /data2/hadoop/hdfs ================== data3 ------- 9064 size is: 1351 /data3/hadoop/hdfs ================== data4 ------- 9064 size is: 1351 /data4/hadoop/hdfs ================== data5 ------- 9064 size is: 1351 /data5/hadoop/hdfs ================== data6 ------- 9064 size is: 1351 /data6/hadoop/hdfs ================== data7 ------- 9064 size is: 1351 /data7/hadoop/hdfs ================== data8 ------- 9064 size is: 1351 /data8/hadoop/hdfs ================== data9 ------- 9064 size is: 1351 /data9/hadoop/hdfs ================== data10 ------- 9064 size is: 1351 /data10/hadoop/hdfs ================== data11 ------- 9064 size is: 1351 /data11/hadoop/hdfs ================== NN2 (ACTIVE) FILE counts and SIZE data0 ------- 9504 size is: 1357 /data0/hadoop/hdfs ================== data1 ------- 9504 size is: 1356 /data1/hadoop/hdfs ================== data2 ------- 9504 size is: 1357 /data2/hadoop/hdfs ================== data3 ------- 9505 size is: 1357 /data3/hadoop/hdfs ================== data4 ------- 9505 size is: 1357 /data4/hadoop/hdfs ================== data5 ------- 9505 size is: 1357 /data5/hadoop/hdfs ================== data6 ------- 9505 size is: 1357 /data6/hadoop/hdfs ================== data7 ------- 9505 size is: 1357 /data7/hadoop/hdfs ================== data8 ------- 9505 size is: 1357 /data8/hadoop/hdfs ================== data9 ------- 9505 size is: 1357 /data9/hadoop/hdfs ================== data10 ------- 9505 size is: 1357 /data10/hadoop/hdfs ================== data11 ------- 9505 size is: 1357 /data11/hadoop/hdfs ==================
Created 08-17-2018 03:33 AM
Hi @Muthukumar S!
What happens if you run the following command? (*change the dfs.nameservices above for the respective value)
hdfs dfs -ls hdfs://<dfs.nameservices>/user
And also you can try to run the following command
hdfs namenode -recover
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#namenode
Hope this helps