Support Questions

muthukumar_siva · ‎08-15-2018

Hi All,

I have a cluster with namenode HA on aws instances (Instance store disks). Each namenode got 12 mount points and metadata in that. And we got 4 datanodes. My standby namenode got hung due to hardware issue on aws end. We have to stop and start the instance. As this is the only solution we have done that and able to bring other services on standby namenode except namenode service because all 12 mounts dont have any metadata information. what i have done is i have tarred & restored the hadoop dir from each mount on Active working namenode to all mounts on the standby namenode. Now i'm able to start the namenode service and it became standby namenode automatically using ZKFC. But in hadoop-hdfs-namenode-<hostname>.log file im getting the below error. How to fix it and is there any harm due to this? Whether my active namenode can successfully failover to this node? Kindly help and give your suggestion to fix this.

NN1 - Standby namenode (which got issue and have to stop and start)

NN2 - active

DN1

DN2

DN3

DN4

(have remove IP and put above naming conventions in the log below)

Error snippet below.

2018-08-15 15:04:12,909 INFO  namenode.EditLogInputStream (RedundantEditLogInputStream.java:nextOp(176)) - Fast-forwarding stream 'http://NN1:8480/getJournal?jid=eimedlcluster1&segmentTxId=211034589&storageInfo=-63%3A1695052906%3A0%3ACID-ce4126e2-d1f2-4233-81ec-d267f195583f, http://NN1:8480/getJournal?jid=eimedlcluster1&segmentTxId=211034589&storageInfo=-63%3A1695052906%3A0%3ACID-ce4126e2-d1f2-4233-81ec-d267f195583f' to transaction ID 211034589
2018-08-15 15:04:12,909 INFO  namenode.EditLogInputStream (RedundantEditLogInputStream.java:nextOp(176)) - Fast-forwarding stream 'http://NN1:8480/getJournal?jid=eimedlcluster1&segmentTxId=211034589&storageInfo=-63%3A1695052906%3A0%3ACID-ce4126e2-d1f2-4233-81ec-d267f195583f' to transaction ID 211034589
2018-08-15 15:04:12,926 INFO  namenode.FSImage (FSEditLogLoader.java:loadFSEdits(145)) - Edits file http://NN1/getJournal?jid=eimedlcluster1&segmentTxId=211034589&storageInfo=-63%3A1695052906%3A0%3ACI..., http://NN1:8480/getJournal?jid=eimedlcluster1&segmentTxId=211034589&storageInfo=-63%3A1695052906%3A0... of size 14288 edits # 104 loaded in 0 seconds
2018-08-15 15:04:14,335 INFO  ha.EditLogTailer (EditLogTailer.java:doTailEdits(238)) - Loaded 104 edits starting from txid 211034588
2018-08-15 15:04:22,552 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:04:27,970 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:04:34,710 INFO  ipc.Server (Server.java:run(2165)) - IPC Server handler 25 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN4:51488 Call#101504 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2018-08-15 15:04:34,711 INFO  ipc.Server (Server.java:run(2165)) - IPC Server handler 77 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN3:54288 Call#98633 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2018-08-15 15:04:34,715 INFO  ipc.Server (Server.java:run(2165)) - IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN2:57618 Call#99810 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2018-08-15 15:04:34,716 INFO  ipc.Server (Server.java:run(2165)) - IPC Server handler 35 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN1:59402 Call#100406 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2018-08-15 15:04:49,013 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:05:05,799 INFO  ipc.Server (Server.java:run(2165)) - IPC Server handler 54 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN3:54318 Call#98649 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2018-08-15 15:05:05,807 INFO  ipc.Server (Server.java:run(2165)) - IPC Server handler 56 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN2:57630 Call#99826 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2018-08-15 15:05:05,810 INFO  ipc.Server (Server.java:run(2165)) - IPC Server handler 20 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN4:51498 Call#101519 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2018-08-15 15:05:05,816 INFO  ipc.Server (Server.java:run(2165)) - IPC Server handler 43 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN1:59428 Call#100422 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2018-08-15 15:05:06,229 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:05:06,246 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:05:06,942 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:05:06,945 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:05:06,954 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:05:06,974 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:05:13,011 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:05:22,543 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:05:32,988 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:05:52,160 INFO  ipc.Server (Server.java:run(2165)) - IPC Server handler 44 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN4:51528 Call#101534 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2018-08-15 15:05:52,186 INFO  ipc.Server (Server.java:run(2165)) - IPC Server handler 27 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN2:57658 Call#99841 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
2018-08-15 15:05:53,981 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:06:06,230 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:06:06,254 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:06:06,930 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:06:06,931 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:06:06,947 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:06:06,968 WARN  namenode.FSNamesystem (FSNamesystem.java:getCorruptFiles(7324)) - Get corrupt file blocks returned error: Operation category READ is not supported in state standby
2018-08-15 15:06:08,482 INFO  ipc.Server (Server.java:run(2165)) - IPC Server handler 71 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from DN4:51528 Call#101549 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby

vmurakami · ‎08-16-2018

Hello @Muthukumar S!
Hm, I got curious in your case 🙂
Could you check if:

Is there any client trying to point to the old SN? E.g. hdfs dfs -put hdfs://SN...
Could you check if your dfs.nameservices at hdfs-site.xml and fs.defaultFS at core-site.xml are okay?

And also, I've noted that after the

2018-08-1515:04:14,335 INFO ha.EditLogTailer(EditLogTailer.java:doTailEdits(238))-Loaded104 edits starting from txid 211034588

You started to face warning messages, we may need to check if both Active and StandBy have the same edits/fsimage. So, try to run a ls -R under the namonode directory in your linux fs. And check if it's missing a file or if the sizes are quite different.

And please, let me know which version are you running. And if it's possible, try to enable the DEBUG log for the SN node.

Hope this helps!

muthukumar_siva · ‎08-16-2018

@Vinicius Higa Murakami

Thank you for the reply. Please find below information regarding your queries.

1. I tried below commands from NN1 (rebooted one)

hdfs dfs -ls hdfs://NN2/user/ --> able get the outputs

hdfs dfs -ls hdfs://NN1/user/ --> ERROR: ls: Operation category READ is not supported in state standby (is normal and expected?)

2. Yes both dfs.nameservices at hdfs-site.xml and fs.defaultFS are fine.

I verified fsiimage is happening on both namenodes and size are same with timestamp. But edits file are missing on NN1(standby)

from the time i have copied metadata files from NN2 and started the Namenode service. i,e From 14th Aug 17:39 onwards.

I will not be able to enable DEBUG log because i cannot restart hdfs services because continuously jobs are running. Cant offord downtime now. Also im afraid namenode services will comeup or not.

Below is snippet from both nodes with respect to number of files and size. Also latest fsiimage file.

NN1 (STANDBY)

$ ls -l fsi*
-rw-r--r--. 1 hdfs hadoop 616714799 Aug 16 01:44 fsimage_0000000000211062321
-rw-r--r--. 1 hdfs hadoop        62 Aug 16 01:44 fsimage_0000000000211062321.md5
-rw-r--r--. 1 hdfs hadoop 619959676 Aug 16 07:45 fsimage_0000000000211102880
-rw-r--r--. 1 hdfs hadoop        62 Aug 16 07:45 fsimage_0000000000211102880.md5

NN2 (ACTIVE)

$ ls -l fsi*
-rw-r--r--. 1 hdfs hadoop 616714799 Aug 16 01:44 fsimage_0000000000211062321
-rw-r--r--. 1 hdfs hadoop        62 Aug 16 01:45 fsimage_0000000000211062321.md5
-rw-r--r--. 1 hdfs hadoop 619959676 Aug 16 07:45 fsimage_0000000000211102880
-rw-r--r--. 1 hdfs hadoop        62 Aug 16 07:45 fsimage_0000000000211102880.md5

NN1 (STANDBY)

FILE counts and SIZE

data0
 -------
9064
 size is: 1351  /data0/hadoop/hdfs
 ==================
data1
 -------
9064
 size is: 1351  /data1/hadoop/hdfs
 ==================
data2
 -------
9064
 size is: 1351  /data2/hadoop/hdfs
 ==================
data3
 -------
9064
 size is: 1351  /data3/hadoop/hdfs
 ==================
data4
 -------
9064
 size is: 1351  /data4/hadoop/hdfs
 ==================
data5
 -------
9064
 size is: 1351  /data5/hadoop/hdfs
 ==================
data6
 -------
9064
 size is: 1351  /data6/hadoop/hdfs
 ==================
data7
 -------
9064
 size is: 1351  /data7/hadoop/hdfs
 ==================
data8
 -------
9064
 size is: 1351  /data8/hadoop/hdfs
 ==================
data9
 -------
9064
 size is: 1351  /data9/hadoop/hdfs
 ==================
data10
 -------
9064
 size is: 1351  /data10/hadoop/hdfs
 ==================
data11
 -------
9064
 size is: 1351  /data11/hadoop/hdfs
 ==================

NN2 (ACTIVE)

FILE counts and SIZE

data0
 -------
9504
 size is: 1357  /data0/hadoop/hdfs
 ==================
data1
 -------
9504
 size is: 1356  /data1/hadoop/hdfs
 ==================
data2
 -------
9504
 size is: 1357  /data2/hadoop/hdfs
 ==================
data3
 -------
9505
 size is: 1357  /data3/hadoop/hdfs
 ==================
data4
 -------
9505
 size is: 1357  /data4/hadoop/hdfs
 ==================
data5
 -------
9505
 size is: 1357  /data5/hadoop/hdfs
 ==================
data6
 -------
9505
 size is: 1357  /data6/hadoop/hdfs
 ==================
data7
 -------
9505
 size is: 1357  /data7/hadoop/hdfs
 ==================
data8
 -------
9505
 size is: 1357  /data8/hadoop/hdfs
 ==================
data9
 -------
9505
 size is: 1357  /data9/hadoop/hdfs
 ==================
data10
 -------
9505
 size is: 1357  /data10/hadoop/hdfs
 ==================
data11
 -------
9505
 size is: 1357  /data11/hadoop/hdfs
 ==================

vmurakami · ‎08-17-2018

Hi @Muthukumar S!

What happens if you run the following command? (*change the dfs.nameservices above for the respective value)

hdfs dfs -ls hdfs://<dfs.nameservices>/user

And also you can try to run the following command

hdfs namenode -recover

https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#namenode

Hope this helps

Cloudera Community

Support Questions

Namenode error - Get corrupt file blocks returned error: Operation category READ is not supported in state standby