Member since
07-18-2016
24
Posts
0
Kudos Received
1
Solution
03-04-2019
05:43 PM
Thanks. I successfully rescued the unrecognized blocks. Addressing the underlying issue of missing but finalized blocks will take time. Hopefully upgrading to a later CDH will work.
... View more
02-28-2019
02:57 PM
The histogram of {count,disk} was preliminary. Here is a final tally: count, drive 144 00 6 01 290 02 155 03 154 04 134 05 167 06 172 08 2 09 144 10 2 11 143 12 7 13 130 15 151 16 280 17 7 19 5 20 2 21 172 22 139 23 4 24 160 25 171 26 162 27 10 28 184 29 171 30 3 31 5 32 7 33 4 34 248 35
... View more
02-28-2019
02:51 PM
> Do you get a full strack trace in the Namenode log at the time of the error in the datanode? No. I showed the entire log msg. that was visible. I checked NN stdout, and did not see a stacktrace. > Have all these files with missing blocks got replication factor of 1 or have they a replication factor 3? repl factor is 3, but the problem files did not reach that level due to competing acivity. The DN in question had to be bounced due to a failed disk (it has 36 8TB disks), apparently at an inopportune moment. I think the excessive RBW files (up to 1 year old) are not the cause since I moved most of them away, restarted DN, and there was no change. Current problem summary: finalized blocks/replicas in a DN are reported as missing, and "RemoteException in offerService" WARN appears in DN log. More info: I found that of the 4400 block/replicas missing cluster-wide, there are 3500 spread across 36 disks on one DN under /finalized/, unevenly: count, disk 2 09 2 11 2 21 3 31 4 24 4 34 5 20 5 32 6 01 7 13 7 19 7 33 10 28 15 05 139 23 154 04 159 27 160 25 171 26 280 17 hence, I am unable to place blame on one disk. This DN has 14M blk files, so only a tiny percentage are affected. It sure seems like an uncaught exception caused the block report from this DN to be incomplete.
... View more
02-26-2019
04:44 PM
Another symptom: the DataNode /blockScannerReport on the problem DN always returns this: Periodic block scanner is not yet initialized. Please check back again after some time.
... View more
02-26-2019
03:04 PM
Using CDH 5.3.1 (without CM), I have a DataNode that seems to not start it's block report. The particular DN has 100x more RBW files than other DNs (some RBW files are a year old). The driving symptom is blocks reported missing, but the particular blocks are indeed under /finalized/ directory of the DN. A few thousand files have missing blocks that are in this state and no alternative blocks/replicas are on the cluster, so we would like to recover these files. The missing blocks are NOT under /rbw/ dir., hence the concern over the "RemoteException in offerService" error. Classpath and VERSION files look good compared to known-good DNs. See point-in-time logs entries below. * Question: How can a /finalized/ block (replica) be considered missing after DN has been up for many hours? * Question: what if I manually copy the finalized blk_* files in question to another DN? would that DN pick them up upon restart? * Question: should I manually clean up old (say, older than a few days) RBW files? DN log: 2019-02-26 21:43:20,152 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: RemoteException in offerService org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.ipc.Client.call(Client.java:1411) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy11.blockReport(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:175) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:503) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:716) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:851) at java.lang.Thread.run(Thread.java:745) NN log: 2019-02-26 21:43:20,147 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 6000, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.blockReport from 207.241.230.241:40178 Call#5 Retry#0 java.lang.NullPointerException The NN error above does NOT show up for any other DN.
... View more
Labels:
- Labels:
-
HDFS