Reply
New Contributor
Posts: 3
Registered: ‎10-04-2017

Multiple Tasks attempts FAILED in YARN: Timed out after 600 secs

 

I'm running an Oracle loader app, that do 'Select * from hive_table', in Yarn and loads the data in Oracle. It run only map tasks and it finish sucessfully, but when you look at the details, it is clear that 5 of the 48 maps, have been failed because of the Timed out in the attempt. I noticed that only the last tasks got timed out and after the 10 min they run sucessfull. What is the cause of this? I try to increase/decrease the mapreduce.task.timeout parameter with no luck - it still happening.

 

 job_1509368122489_0453

Attempt Type Failed Killed Successful Maps Reduces
5048
000

 

Can you please tell me whats wrong? Thanks.

 

mapred-site:

<?xml version="1.0" encoding="UTF-8"?>

<!--Autogenerated by Cloudera Manager-->
<configuration>
  <property>
    <name>mapreduce.job.split.metainfo.maxsize</name>
    <value>10000000</value>
  </property>
  <property>
    <name>mapreduce.job.counters.max</name>
    <value>120</value>
  </property>
  <property>
    <name>mapreduce.output.fileoutputformat.compress</name>
    <value>false</value>
  </property>
  <property>
    <name>mapreduce.output.fileoutputformat.compress.type</name>
    <value>BLOCK</value>
  </property>
  <property>
    <name>mapreduce.output.fileoutputformat.compress.codec</name>
    <value>org.apache.hadoop.io.compress.DefaultCodec</value>
  </property>
  <property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
  </property>
  <property>
    <name>mapreduce.map.output.compress</name>
    <value>true</value>
  </property>
  <property>
    <name>zlib.compress.level</name>
    <value>DEFAULT_COMPRESSION</value>
  </property>
  <property>
    <name>mapreduce.task.io.sort.factor</name>
    <value>64</value>
  </property>
  <property>
    <name>mapreduce.map.sort.spill.percent</name>
    <value>0.8</value>
  </property>
  <property>
    <name>mapreduce.reduce.shuffle.parallelcopies</name>
    <value>14</value>
  </property>
  <property>
    <name>mapreduce.task.timeout</name>
    <value>600000</value>
  </property>
  <property>
    <name>mapreduce.client.submit.file.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>mapreduce.job.maps</name>
    <value>48</value>
  </property>
  <property>
    <name>mapreduce.job.reduces</name>
    <value>96</value>
  </property>
  <property>
    <name>mapreduce.task.io.sort.mb</name>
    <value>256</value>
  </property>
  <property>
    <name>mapreduce.map.speculative</name>
    <value>true</value>
  </property>
  <property>
    <name>mapreduce.reduce.speculative</name>
    <value>true</value>
  </property>
  <property>
    <name>mapreduce.job.reduce.slowstart.completedmaps</name>
    <value>0.8</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop-nn:10020</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop-nn:19888</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.webapp.https.address</name>
    <value>hadoop-nn:19890</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.admin.address</name>
    <value>hadoop-nn:10033</value>
  </property>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.staging-dir</name>
    <value>/user</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.resource.mb</name>
    <value>1024</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.resource.cpu-vcores</name>
    <value>1</value>
  </property>
  <property>
    <name>mapreduce.job.ubertask.enable</name>
    <value>false</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.command-opts</name>
    <value>-Djava.net.preferIPv4Stack=true -Xmx825955249</value>
  </property>
  <property>
    <name>mapreduce.map.java.opts</name>
    <value>-Djava.net.preferIPv4Stack=true -Xmx825955249</value>
  </property>
  <property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-Djava.net.preferIPv4Stack=true -Xmx825955249</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.admin.user.env</name>
    <value>LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH</value>
  </property>
  <property>
    <name>mapreduce.map.memory.mb</name>
    <value>1024</value>
  </property>
  <property>
    <name>mapreduce.map.cpu.vcores</name>
    <value>1</value>
  </property>
  <property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>1024</value>
  </property>
  <property>
    <name>mapreduce.reduce.cpu.vcores</name>
    <value>1</value>
  </property>
  <property>
    <name>mapreduce.application.classpath</name>
    <value>$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$MR2_CLASSPATH</value>
  </property>
  <property>
    <name>mapreduce.admin.user.env</name>
    <value>LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH</value>
  </property>
  <property>
    <name>mapreduce.shuffle.max.connections</name>
    <value>80</value>
  </property>
</configuration>

Highlighted
Posts: 1,566
Kudos: 287
Solutions: 240
Registered: ‎07-31-2013

Re: Multiple Tasks attempts FAILED in YARN: Timed out after 600 secs

Is there any pattern to this? For ex., do the few tasks that hang all run on the same host or specific set of hosts among all nodes in the cluster?

A more detailed root cause can be sought by performing a jstack on a task that appears hung live. This is done by first finding which host the hung task is running on (within the task timeout period, after noticing it hanging), discovering its container ID and finding the associated java process on the machine followed by the jstack command run on the PID.
Backline Customer Operations Engineer
Announcements