Created 03-24-2024 02:04 AM
hi, can someone suggest what could be the problem. i have big volume of missing blocks and namenode log error indicated that ipc max length is not enough, increased it, but that didn't help much, DNs before and after this increase still showing same that replica cache file doesn't exist, i don't see DNs are actually trying to connect to NN
Setup is very small, 2 servers, 1 server used for NN and DN, second server is DN. Cloudera Express 6.3.1 Express used.
Configured Capacity: 3787349868544 (3.44 TB)
Present Capacity: 3593183903744 (3.27 TB)
DFS Remaining: 3235560472576 (2.94 TB)
DFS Used: 357623431168 (333.06 GB)
DFS Used%: 9.95%
Replicated Blocks:
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 7899994
Missing blocks (with replication factor 1): 405
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Erasure Coded Block Groups:
Low redundancy block groups: 0
Block groups with corrupt internal blocks: 0
Missing block groups: 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Both DN1 & DN2
2024-03-24 10:16:51,876 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Adding replicas to map for block pool BP-1052250670-10.53.XXX-XX-XXXX981591679 on volume /opt/hadoop/dfs/dn...
2024-03-24 10:16:51,876 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice: Replica Cache file: /opt/hadoop/dfs/dn/current/BP-1052250670-10.53.XXX-XX-XXXX981591679/current/replicas doesn't exist
NameNode
2024-03-24 10:56:59,872 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 28 Total time for transactions(ms): 15 Number of transactions batched in Syncs: 2 Number of syncs: 26 SyncTimes(ms): 64
2024-03-24 10:56:59,889 INFO org.apache.hadoop.ipc.Server: IPC Server handler 13 on 8020, call Call#23851 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.53.235.25:39568
java.io.IOException: File /opt/logs/hdfs_canary_health/.canary_file_2024_03_24-10_56_59.f684be6f08559e9f could only be written to 0 of the 1 minReplication nodes. There are 0 datanode(s) running and 0 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2102)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2673)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:872)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:550)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
Created 03-24-2024 11:35 PM
@user2024, Welcome to our community! To help you get the best possible answer, I have tagged in our Cloudera Manager experts @upadhyayk04 @utrivedi @Rajat_710 @Raamar who may be able to assist you further.
Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.
Regards,
Vidya Sargur,Created 03-26-2024 11:51 AM
Hi, @user2024 I don't the canary file is gonna cause this issue, the blocks that are corrupt/missing are now lost and cannot be recovered, you can manually delete those blocks by identifying them using the below command and run the hdfs balancer on HDFS so that NN will balance the new blocks across the cluster.
# hdfs fsck -list-corruptfileblocks
You can also refer to the below article.
https://stackoverflow.com/questions/19205057/how-to-fix-corrupt-hdfs-files