Support Questions

Find answers, ask questions, and share your expertise

YARN - occasional Error message

avatar
Contributor

Hi I notice below occassional error message in YARN jobs, if someone has seen/noticed below would you know an exact cause, apart from any network latency etc?

 

Note: This doesn't contribute to job failures etc, because it is tried I think by default 4 attempts and usually go through:

 

2018-07-19 09:36:46,433 WARN [ContainerLocalizer Downloader] org.apache.hadoop.hdfs.BlockReaderFactory: I/O error constructing remote block reader.
java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR, self=/xx.xx.xxx.xxx:32010, remote=/xx.xx.xxx.xx:1004, for file /user/hive/.staging/job_xxxxxxxxxxx_xxxxx/libjars/sqoop-1.4.5-cdh5.4.4.jar, for pool BP-xxxxxxx-xx.xx.xxx.xx-xxxxxxxxxx block xxxxx_xxxx
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:890)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:768)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:377)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:660)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:897)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:956)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:87)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:61)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:121)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:265)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:364)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

 

 

6 REPLIES 6

avatar
Super Collaborator

Hi Prav,

 

These types of errors are network / connection related. It might have been a slow response on the remote side service and not a congested network. It could be a lot of work if you want to track it down. Looking at the DN logs on the remote side might give you some insight.

 

It is not really a YARN issue, the HDFS community will be in a better position to help you.

 

Wilfred

avatar
Contributor

@Wilfred Thank you for your response .

 

Regards

avatar
Mentor
The error merely indicates that the DataNode the client contacted for the replica wasn't able to perform the read operation requested. The actual I/O error behind the OP_READ_BLOCK error response will be logged on the DataNode host specified by the remote=x.x.x.x information in the log message printed.

On a related note, given the intermittency, what is your 'mapreduce.client.submit.file.replication' configuration set to? If it is higher than the HDFS DataNode count, set it lower. Cloudera Manager's auto configuration rules for this property is detailed at https://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_autoconfig.html#concept_v4y_vb...:


"""
let x be Number of DataNodes, and y be the configured HDFS replication factor, then:

mapreduce.client.submit.file.replication = max(min(x, y), sqrt(x))
"""

avatar
Contributor

@Harsh J Thanks for your response, you pointed it out correct,

 

DN logs do indicate reason of these notification, "Replica not found" and that relates to "mapreduce.client.submit.file.replication" because it is currently set to 1 [CM recommends it to be 8]. I can bump it up and check if that alleviates or decreased the occurence further.

 

What are the repurcussions if this value is set too high?

 

Regards

avatar
Mentor
The 1-factor should work. Setting it higher slows the job initialization phase a bit, but has better task startup time due to quicker localization of its files.

Interesting that you observe a "Replica not found" message for files needed during localization. Do you actively/frequently run the HDFS balancer, or were running the balancer when you experienced this error? Its likely that the block changed locations between the point of write and the localizer downloading it when the job tasks begin. That'd cause the WARN you see, which forces the client to re-fetch new locations from NameNode and proceed normally after that.

avatar
Contributor

@Harsh J

 

No, we rarely run balancer in this environment.

I'll set it to 3 for now and observe for a while for any reoccurence of those WARNs if any . (CM recommends to set it between a value equal or greater than replication factor and lesser than number of DNs)

 

Regards