Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

BDR jobs fail with missing blocks

avatar
Contributor

I have a DR site and run replication from prod to the DR site. My BDR jobs are failing with missing blocks error. The files and blocks that are reported missing are in the source prod system so I'm not sure why the jobs are failing and not copying them over.

1 ACCEPTED SOLUTION

avatar
Contributor

You're the hero. I pulled diag data on one of the jobs and found a connection refused when trying to access one of the files with the missing blocks error. I tried connecting to the remote server and can't. I looked at others and some I can connect to others I can't. It's a large inventory of servers to work through.

 

I really appreciate your help with this. Thank you.

View solution in original post

9 REPLIES 9

avatar
Master Guru

@DanielWhite,

 

files are split up into blocks and stored on DataNodes.  By default, each block is stored on 3 datanoes (block replication factor of 3).

 

In BDR, the mappers will request blocks (as instructed by the NameNode) from the DataNodes that have them.  If no DataNodes contain the blocks, for that file, the file itself cannot be copied.

 

Recommendation:

 

Check to see what files have missing blocks in the source cluster and find address the issue.

 

BDR/distcp copies files, not individual blocks at this time so if one block of a file is missing from the source, the remaining blocks are not copied.

 

avatar
Contributor

Thanks for your reply.

 

The files and blocks reported missing by the BDR job running on the DR site do exist on the source system.

avatar
Master Guru

@DanielWhite,

 

I don't quite follow.

 

Can you show us the errors or messages you are seeing and some log context around them?

 

avatar
Contributor

Here's the error from the BDR job Running in the DR system. Below that I've run an fsck on the file on the source system to show that it does exist on the source and has the same block number as listed in the error.

 

I've removed ip addresses and I've removed the actual file name and replaced with "filename"

 

ERROR  /path/filename org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1508298398-ipaddress-1406065203774:blk_2079737512_1100628731148 file=filename
 at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1040)
 at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1023)
 at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1002)
 at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:642)
 at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:895)
 at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:954)
 at java.io.DataInputStream.read(DataInputStream.java:149)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
 at java.io.FilterInputStream.read(FilterInputStream.java:107)
 at com.cloudera.enterprise.distcp.util.ThrottledInputStream.read(ThrottledInputStream.java:77)
 at com.cloudera.enterprise.distcp.mapred.RetriableFileCopyCommand.readBytes(RetriableFileCopyCommand.java:371)
 at com.cloudera.enterprise.distcp.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:345)
 at com.cloudera.enterprise.distcp.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:161)
 at com.cloudera.enterprise.distcp.util.RetriableCommand.execute(RetriableCommand.java:87)
 at com.cloudera.enterprise.distcp.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:617)
 at com.cloudera.enterprise.distcp.mapred.CopyMapper.map(CopyMapper.java:454)
 at com.cloudera.enterprise.distcp.mapred.CopyMapper.map(CopyMapper.java:69)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
 at org.apache.hadoop

 

Here's the file on the source system -

 

hdfs fsck filename -files -blocks -locations
Connecting to namenode via http://prodservername:50070
FSCK started by hdfs (auth:KERBEROS_SSL) from /serveripaddress for path filename at Mon Aug 27 19:39:53 EDT 2018
filename 995352 bytes, 1 block(s):  OK
0. BP-1508298398-ipaddress-1406065203774:blk_2079737512_1100628731148 len=995352 Live_repl=3 [DatanodeInfoWithStorage[ipaddress:1004,DS-00246250-eef8-4c03-8ef7-c898594f960b,DISK], DatanodeInfoWithStorage[ipaddress:1004,DS-297b0420-a2a1-4418-8691-3ef9a374cc51,DISK], DatanodeInfoWithStorage[ipaddress:1004,DS-0ae9f985-a12a-4871-991b-d2e8017c4c4b,DISK]]

 

 

avatar
Master Guru

@DanielWhite,

 

I got an email showing your update, but for some reason I don't see it here.

What I did notice was that the stack trace says:



ERROR  /path/filename org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1508298398-10.9.129.86-1406065203774:blk_2079737512_1100628731148 file=filename
 at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(

 

The inablity by a client to retrieve blocks results in the BlockMissingException.  This might be a bit misleading.

Rather, I'd check to verify that all the DataNodes in the source cluster are accessible during replication and that all the nodes in your destination cluster can connect to DataNodes in the source clusters.

 

Note that I think there may be more information in the BDR job logs in YARN.  It could be that there is a firewall or something else preventing mappers from retrieving blocks.

 

Does the same problem happen every time, somtimes, etc.?

avatar
Contributor

You're the hero. I pulled diag data on one of the jobs and found a connection refused when trying to access one of the files with the missing blocks error. I tried connecting to the remote server and can't. I looked at others and some I can connect to others I can't. It's a large inventory of servers to work through.

 

I really appreciate your help with this. Thank you.

avatar
Master Guru
That's great news! BDR has a lot of moving parts, so it can be super tricky to debug so I hope once you get the connectivity worked out that it is smooth sailing.

avatar
Contributor

I have a further question

 

Is there a way to have the BDR job connect to a specific source server?

 

avatar
Master Guru

@DanielWhite,

 

Can you clarify what you mean by "source server?"

Really, the answer is No.

 

Your source configuration dictates what NameNode(s) to communicate with.

Your source's NameNodes tell clients where to get blocks. Those blocks can be on any DataNode in the source cluster.

 

On the target side where the MapReduce Job runs, the Resource Manager decides on which nodes the Mappers will run.