Created on 11-12-2019 12:42 AM - last edited on 11-12-2019 05:51 AM by cjervis
We have spark cluster with the following details ( all machines are linux redhat machines )
2 name-node machines
2 resource-manager machines
8 data-node machines ( HDFS file-system)
We are running running spark streaming application
From the yarn logs we can see the following errors , example:
yarn logs -applicationId application_xxxxxxxx -log_files ALL
---2019-11-08T10:12:20.040 ERROR [][][] [org.apache.spark.scheduler.LiveListenerBus] Listener EventLoggingListener threw an exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): BP-484874736-172.2.45.23-8478399929292:blk_1081495827_7755233 does not exist or is not under Construction
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:6721)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:6789)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:931)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:979)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
we can see that - `8478399929292:-blk_1081495827_7755233` does not exist or is not under Construction
but what could be the reasons that yarn complain about this?
Created 11-12-2019 01:04 AM
Hi Mike,
Can you do quick check below -
**BP-484874736-172.2.45.23-8478399929292:blk_1081495827_7755233 does not exist or is not under Construction
>>
1. Are all Datanodes up and running fine within cluster
2. Check on the NN UI and see if any Datanode is NOT reporting blocks in Datanode tab or any Missing blocks reported on NN UI
3. You can run fsck [unless cluster is huge and loaded with data] and check of the block exist and which all nodes has the replica.
It might help to drill down the issue.
Created 11-12-2019 01:34 AM
hi
1. all Datanodes are up and running fine
2. I not see corrupted block or under replica
3, We runs the fsck and hdfs is healthy
any other possibility's?
Created on 11-12-2019 01:52 AM - edited 11-12-2019 01:53 AM
we also do the following
su hdfs
hadoop fsck / -files -blocks >/tmp/file
and we bot found the block - blk_1081495827_7755233 in the file - /tmp/file
so what is the reason that block removed?
Created 11-12-2019 01:53 AM
1. Is the job failed due to above reason? If "NO", then is the error occurring in logs eveything for other BP XXX also?
2. Can you check using fsck which nodes had copied of the BP specified above?
Created 11-12-2019 01:55 AM
please send me the fsck cli that you want me to run
Created 11-12-2019 02:19 AM
If you know the file name then -
hdfs fsck /myfile.txt -files -blocks -locations
Else
hdfs fsck |grep <blkxxx>
Created on 11-12-2019 07:08 AM - edited 11-12-2019 07:10 AM
by the following
hdfs fsck / -files -blocks -locations | grep blk_xxxxxx_xxxxxx
as:
su hdfs
hdfs fsck / -files -blocks -locations | grep blk_1081495827_7755233
we not get any results
so I guess its mean that blk_xxxxx_xxxx isnt exist in HDFS file-system
what next ?
Created 11-12-2019 08:09 PM
1. Is the job failed due to above reason?
If "NO" - then is it the error occurring displayed in logs for all spark jobs or just for this job?