Created on 07-20-2018 09:19 AM - edited 09-16-2022 06:29 AM
Hi I notice below occassional error message in YARN jobs, if someone has seen/noticed below would you know an exact cause, apart from any network latency etc?
Note: This doesn't contribute to job failures etc, because it is tried I think by default 4 attempts and usually go through:
2018-07-19 09:36:46,433 WARN [ContainerLocalizer Downloader] org.apache.hadoop.hdfs.BlockReaderFactory: I/O error constructing remote block reader.
java.io.IOException: Got error for OP_READ_BLOCK, status=ERROR, self=/xx.xx.xxx.xxx:32010, remote=/xx.xx.xxx.xx:1004, for file /user/hive/.staging/job_xxxxxxxxxxx_xxxxx/libjars/sqoop-1.4.5-cdh5.4.4.jar, for pool BP-xxxxxxx-xx.xx.xxx.xx-xxxxxxxxxx block xxxxx_xxxx
at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:467)
at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:890)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:768)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:377)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:660)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:897)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:956)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:87)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:61)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:121)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:265)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:364)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Created 07-24-2018 08:08 PM
Hi Prav,
These types of errors are network / connection related. It might have been a slow response on the remote side service and not a congested network. It could be a lot of work if you want to track it down. Looking at the DN logs on the remote side might give you some insight.
It is not really a YARN issue, the HDFS community will be in a better position to help you.
Wilfred
Created 07-26-2018 01:21 PM
Created 07-29-2018 06:34 PM
Created 07-30-2018 10:52 AM
@Harsh J Thanks for your response, you pointed it out correct,
DN logs do indicate reason of these notification, "Replica not found" and that relates to "mapreduce.client.submit.file.replication" because it is currently set to 1 [CM recommends it to be 8]. I can bump it up and check if that alleviates or decreased the occurence further.
What are the repurcussions if this value is set too high?
Regards
Created 07-30-2018 07:33 PM
Created 07-31-2018 07:09 AM
No, we rarely run balancer in this environment.
I'll set it to 3 for now and observe for a while for any reoccurence of those WARNs if any . (CM recommends to set it between a value equal or greater than replication factor and lesser than number of DNs)
Regards