Created 08-27-2018 01:24 PM
I have a DR site and run replication from prod to the DR site. My BDR jobs are failing with missing blocks error. The files and blocks that are reported missing are in the source prod system so I'm not sure why the jobs are failing and not copying them over.
Created 08-27-2018 07:31 PM
You're the hero. I pulled diag data on one of the jobs and found a connection refused when trying to access one of the files with the missing blocks error. I tried connecting to the remote server and can't. I looked at others and some I can connect to others I can't. It's a large inventory of servers to work through.
I really appreciate your help with this. Thank you.
Created 08-27-2018 01:28 PM
files are split up into blocks and stored on DataNodes. By default, each block is stored on 3 datanoes (block replication factor of 3).
In BDR, the mappers will request blocks (as instructed by the NameNode) from the DataNodes that have them. If no DataNodes contain the blocks, for that file, the file itself cannot be copied.
Recommendation:
Check to see what files have missing blocks in the source cluster and find address the issue.
BDR/distcp copies files, not individual blocks at this time so if one block of a file is missing from the source, the remaining blocks are not copied.
Created 08-27-2018 01:35 PM
Thanks for your reply.
The files and blocks reported missing by the BDR job running on the DR site do exist on the source system.
Created 08-27-2018 04:08 PM
I don't quite follow.
Can you show us the errors or messages you are seeing and some log context around them?
Created on 08-27-2018 04:49 PM - edited 08-27-2018 04:53 PM
Here's the error from the BDR job Running in the DR system. Below that I've run an fsck on the file on the source system to show that it does exist on the source and has the same block number as listed in the error.
I've removed ip addresses and I've removed the actual file name and replaced with "filename"
ERROR /path/filename org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1508298398-ipaddress-1406065203774:blk_2079737512_1100628731148 file=filename
at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1040)
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1023)
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1002)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:642)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:895)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:954)
at java.io.DataInputStream.read(DataInputStream.java:149)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at com.cloudera.enterprise.distcp.util.ThrottledInputStream.read(ThrottledInputStream.java:77)
at com.cloudera.enterprise.distcp.mapred.RetriableFileCopyCommand.readBytes(RetriableFileCopyCommand.java:371)
at com.cloudera.enterprise.distcp.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:345)
at com.cloudera.enterprise.distcp.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:161)
at com.cloudera.enterprise.distcp.util.RetriableCommand.execute(RetriableCommand.java:87)
at com.cloudera.enterprise.distcp.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:617)
at com.cloudera.enterprise.distcp.mapred.CopyMapper.map(CopyMapper.java:454)
at com.cloudera.enterprise.distcp.mapred.CopyMapper.map(CopyMapper.java:69)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop
Here's the file on the source system -
hdfs fsck filename -files -blocks -locations
Connecting to namenode via http://prodservername:50070
FSCK started by hdfs (auth:KERBEROS_SSL) from /serveripaddress for path filename at Mon Aug 27 19:39:53 EDT 2018
filename 995352 bytes, 1 block(s): OK
0. BP-1508298398-ipaddress-1406065203774:blk_2079737512_1100628731148 len=995352 Live_repl=3 [DatanodeInfoWithStorage[ipaddress:1004,DS-00246250-eef8-4c03-8ef7-c898594f960b,DISK], DatanodeInfoWithStorage[ipaddress:1004,DS-297b0420-a2a1-4418-8691-3ef9a374cc51,DISK], DatanodeInfoWithStorage[ipaddress:1004,DS-0ae9f985-a12a-4871-991b-d2e8017c4c4b,DISK]]
Created 08-27-2018 05:31 PM
I got an email showing your update, but for some reason I don't see it here.
What I did notice was that the stack trace says:
ERROR /path/filename org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1508298398-10.9.129.86-1406065203774:blk_2079737512_1100628731148 file=filename
at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(
The inablity by a client to retrieve blocks results in the BlockMissingException. This might be a bit misleading.
Rather, I'd check to verify that all the DataNodes in the source cluster are accessible during replication and that all the nodes in your destination cluster can connect to DataNodes in the source clusters.
Note that I think there may be more information in the BDR job logs in YARN. It could be that there is a firewall or something else preventing mappers from retrieving blocks.
Does the same problem happen every time, somtimes, etc.?
Created 08-27-2018 07:31 PM
You're the hero. I pulled diag data on one of the jobs and found a connection refused when trying to access one of the files with the missing blocks error. I tried connecting to the remote server and can't. I looked at others and some I can connect to others I can't. It's a large inventory of servers to work through.
I really appreciate your help with this. Thank you.
Created 08-28-2018 09:49 PM
Created 08-29-2018 06:44 AM
I have a further question
Is there a way to have the BDR job connect to a specific source server?
Created 08-29-2018 10:35 AM
Can you clarify what you mean by "source server?"
Really, the answer is No.
Your source configuration dictates what NameNode(s) to communicate with.
Your source's NameNodes tell clients where to get blocks. Those blocks can be on any DataNode in the source cluster.
On the target side where the MapReduce Job runs, the Resource Manager decides on which nodes the Mappers will run.