Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

YARN - occasional Error message

YARN - occasional Error message

Rising Star

Hi I notice below occassional error message in YARN jobs, if someone has seen/noticed below would you know an exact cause, apart from any network latency etc?

 

Note: This doesn't contribute to job failures etc, because it is tried I think by default 4 attempts and usually go through:

 

2018-07-19 09:36:46,433 WARN [ContainerLocalizer Downloader] org.apache.hadoop.hdfs.BlockReaderFactory: I/O error constructing remote block reader.
java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR, self=/xx.xx.xxx.xxx:32010, remote=/xx.xx.xxx.xx:1004, for file /user/hive/.staging/job_xxxxxxxxxxx_xxxxx/libjars/sqoop-1.4.5-cdh5.4.4.jar, for pool BP-xxxxxxx-xx.xx.xxx.xx-xxxxxxxxxx block xxxxx_xxxx
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:890)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:768)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:377)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:660)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:897)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:956)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:87)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:61)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:121)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:265)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:364)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

 

 

6 REPLIES 6
Highlighted

Re: YARN - occasional Error message

Super Collaborator

Hi Prav,

 

These types of errors are network / connection related. It might have been a slow response on the remote side service and not a congested network. It could be a lot of work if you want to track it down. Looking at the DN logs on the remote side might give you some insight.

 

It is not really a YARN issue, the HDFS community will be in a better position to help you.

 

Wilfred

Re: YARN - occasional Error message

Rising Star

@Wilfred Thank you for your response .

 

Regards

Re: YARN - occasional Error message

Master Guru
The error merely indicates that the DataNode the client contacted for the replica wasn't able to perform the read operation requested. The actual I/O error behind the OP_READ_BLOCK error response will be logged on the DataNode host specified by the remote=x.x.x.x information in the log message printed.

On a related note, given the intermittency, what is your 'mapreduce.client.submit.file.replication' configuration set to? If it is higher than the HDFS DataNode count, set it lower. Cloudera Manager's auto configuration rules for this property is detailed at https://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_autoconfig.html#concept_v4y_vb...:


"""
let x be Number of DataNodes, and y be the configured HDFS replication factor, then:

mapreduce.client.submit.file.replication = max(min(x, y), sqrt(x))
"""

Re: YARN - occasional Error message

Rising Star

@Harsh J Thanks for your response, you pointed it out correct,

 

DN logs do indicate reason of these notification, "Replica not found" and that relates to "mapreduce.client.submit.file.replication" because it is currently set to 1 [CM recommends it to be 8]. I can bump it up and check if that alleviates or decreased the occurence further.

 

What are the repurcussions if this value is set too high?

 

Regards

Re: YARN - occasional Error message

Master Guru
The 1-factor should work. Setting it higher slows the job initialization phase a bit, but has better task startup time due to quicker localization of its files.

Interesting that you observe a "Replica not found" message for files needed during localization. Do you actively/frequently run the HDFS balancer, or were running the balancer when you experienced this error? Its likely that the block changed locations between the point of write and the localizer downloading it when the job tasks begin. That'd cause the WARN you see, which forces the client to re-fetch new locations from NameNode and proceed normally after that.

Re: YARN - occasional Error message

Rising Star

@Harsh J

 

No, we rarely run balancer in this environment.

I'll set it to 3 for now and observe for a while for any reoccurence of those WARNs if any . (CM recommends to set it between a value equal or greater than replication factor and lesser than number of DNs)

 

Regards