<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: One dead big job blocks all jobs in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95803#M59162</link>
    <description>&lt;P&gt;Yes, do check ALL the HDP configurations and make sure the physical memory is not overcommitted.  The possibility of a rouge process consuming memory on the node is still there but that may not be in your control.&lt;/P&gt;</description>
    <pubDate>Thu, 22 Oct 2015 20:40:50 GMT</pubDate>
    <dc:creator>TerryP</dc:creator>
    <dc:date>2015-10-22T20:40:50Z</dc:date>
    <item>
      <title>One dead big job blocks all jobs</title>
      <link>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95800#M59159</link>
      <description>&lt;P&gt;On relatively busy cluster, ran a huge job which consumed almost 100% resources, then during shuffle phase, it died with OOM on a NodeManager, after that, all jobs including this job are not progressing. &lt;/P&gt;&lt;P&gt;To recover from this state, needed to kill this job and also other jobs.&lt;/P&gt;&lt;P&gt;This can't reproduce at will but occasionally happens.&lt;/P&gt;&lt;P&gt;Have you come across any similar symptom?
Is there any smarter way to recover from this state? Killing jobs manually wouldn't be ideal.
Maybe need to check/modify some yarn config?&lt;/P&gt;</description>
      <pubDate>Thu, 22 Oct 2015 14:26:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95800#M59159</guid>
      <dc:creator>hosako</dc:creator>
      <dc:date>2015-10-22T14:26:41Z</dc:date>
    </item>
    <item>
      <title>Re: One dead big job blocks all jobs</title>
      <link>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95801#M59160</link>
      <description>&lt;P&gt;@&lt;A href="http://community.hortonworks.com/users/206/hosako.html"&gt;hosako@hortonworks.com&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Do you have Capacity scheduler configured? I high recommend to deploy Capacity Scheduler view  and configure queues and allocate resources appropriately&lt;/P&gt;&lt;P&gt;&lt;A target="_blank" href="http://hortonworks.com/hadoop-tutorial/configuring-yarn-capacity-scheduler-ambari/"&gt;Link&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 22 Oct 2015 17:40:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95801#M59160</guid>
      <dc:creator>nsabharwal</dc:creator>
      <dc:date>2015-10-22T17:40:22Z</dc:date>
    </item>
    <item>
      <title>Re: One dead big job blocks all jobs</title>
      <link>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95802#M59161</link>
      <description>&lt;P&gt;Good point! But I think if you run into an OOM issue even the Capacity Scheduler wouldnt help. Its probably a good idea to validate the Yarn configuration, especially memory settings. &lt;/P&gt;</description>
      <pubDate>Thu, 22 Oct 2015 20:18:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95802#M59161</guid>
      <dc:creator>jstraub</dc:creator>
      <dc:date>2015-10-22T20:18:10Z</dc:date>
    </item>
    <item>
      <title>Re: One dead big job blocks all jobs</title>
      <link>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95803#M59162</link>
      <description>&lt;P&gt;Yes, do check ALL the HDP configurations and make sure the physical memory is not overcommitted.  The possibility of a rouge process consuming memory on the node is still there but that may not be in your control.&lt;/P&gt;</description>
      <pubDate>Thu, 22 Oct 2015 20:40:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95803#M59162</guid>
      <dc:creator>TerryP</dc:creator>
      <dc:date>2015-10-22T20:40:50Z</dc:date>
    </item>
    <item>
      <title>Re: One dead big job blocks all jobs</title>
      <link>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95804#M59163</link>
      <description>&lt;P&gt;I would suggest installing SmartSense, as we have specific recommendations on optimal memory configurations for YARN, MR2, and others.&lt;/P&gt;</description>
      <pubDate>Thu, 22 Oct 2015 20:56:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95804#M59163</guid>
      <dc:creator>pcodding</dc:creator>
      <dc:date>2015-10-22T20:56:26Z</dc:date>
    </item>
    <item>
      <title>Re: One dead big job blocks all jobs</title>
      <link>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95805#M59164</link>
      <description>&lt;P&gt;Doesn't Yarn offer a protection mechanism against too much overcommitting?&lt;/P&gt;&lt;P&gt;I am thinking of the parameters:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;yarn.nodemanager.pmem-check-enabled&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;yarn.nodemanager.vmem-check-enabled&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;yarn.nodemanager.vmem-pmem-ratio&lt;/P&gt;</description>
      <pubDate>Thu, 22 Oct 2015 22:22:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95805#M59164</guid>
      <dc:creator>sluangsay</dc:creator>
      <dc:date>2015-10-22T22:22:40Z</dc:date>
    </item>
    <item>
      <title>Re: One dead big job blocks all jobs</title>
      <link>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95806#M59165</link>
      <description>&lt;P&gt;Thanks everyone!&lt;/P&gt;&lt;P&gt;What Terry described looks very close to the symptom.&lt;/P&gt;&lt;P&gt;SmartSense has been installed, and Capacity Scheduler has been configured and will review the config.&lt;/P&gt;&lt;P&gt;Also will check Yarn nodemanager params&lt;/P&gt;</description>
      <pubDate>Fri, 23 Oct 2015 13:56:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/One-dead-big-job-blocks-all-jobs/m-p/95806#M59165</guid>
      <dc:creator>hosako</dc:creator>
      <dc:date>2015-10-23T13:56:26Z</dc:date>
    </item>
  </channel>
</rss>

