Reply
Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

i will try that.
Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

One thing I forgot to say that you could do, is to check the memory + disk usage of nodes running node manager. Disk space/file descriptor/memory could fill up and lead to Node Manager to shut down because the file deletion problem.

Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

I checked the memory on both nodes and they both spike to 97.78% with all 10 containers running for about 10 minutes. I couldn't look at the file descriptors though.  But, all metrics spike the same: GC, CPU usage, disk latency, network throughput - while the metrics for JVM Heap Memory Usage and Java Threads disappear off the charts. The 4x2TB disks on each node are barely used because the amount of data is small, ~57GB. Do you have an idea of what to configure or change to fix this somehow? And why does CDH 5.5.2 do this and not earlier versions? Also, the 'ulimit -n' is 65536. I don't know if this helps.

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Hi Ben, 

 

The reason of high memory cpu usage in this case might not be exactly the same as what caused your original problem of jobs being in stuck. I see a lot of recovered stuff in Node Manager. I guess because all the containers were not cleaned up previously, so they were recovered when you restarted Node Manager, taking a lot of resources. But they stil cannot be cleaned up because of the same reason which caused containers to be stuck in previous run. I suspect it might have something to do with the cgroup setup, though I have no knowlege of how cgroup is used or set up in CDH. I have seen cgroup issues in the Node Manager log consistently, which might have lead to the failure of resource cleanup of the containers. Therefore, containers can never be claimed back, and they will stay in state store for recovery next time when Node Manager comes back.

Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Are you saying that the completed tasks (containers) are not being cleaned up so new ones cannot be allocated to other tasks? If so, is there something that addresses this somewhere? Has anyone else heard of this?

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Yes, it is very likely. I haven't checked out how Node Manager handles clean up failure exactly in the code though. You could read CGroup with Yarn to verify some of the settings.

Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

These are the settings in YARN that we have set regarding containers. Do you see anything out of the ordinary?

 

  • CGroups Hierarchy
    • yarn.nodemanager.linux-container-executor.cgroups.hierarchy = /hadoop-yarn
  • Use CGroups for Resource Management
    • yarn.nodemanager.linux-container-executor.resources-handler.class = true
  • UNIX User for Nonsecure Mode with Linux Container Executor
    • yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user = nobody
  • Container Executor Group
    • yarn.nodemanager.linux-container-executor.group = yarn
  • Containers Environment Variable
    • yarn.nodemanager.admin-env = MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX
  • Containers Environment Variables Whitelist
    • yarn.nodemanager.env-whitelist = JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,HADOOP_YARN_HOME
  • Container Manager Thread Count
    • yarn.nodemanager.container-manager.thread-count = 20
  • Container Monitor Interval
    • yarn.resourcemanager.container.liveness-monitor.interval-ms = 10 minutes
  • Fair Scheduler Assign Multiple Tasks
    • yarn.scheduler.fair.assignmultiple = true
  • Use CGroups for Resource Management
    • yarn.nodemanager.linux-container-executor.resources-handler.class = true
  • Always Use Linux Container Executor
    • yarn.nodemanager.container-executor.class = true 

Thanks.

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

You can upload a container log so we can verify from the container's perspective what was happening.

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

I'm afraid that I cannot help you with cgroup as I don't know how  cgroup works with YARN (Just started working on YARN not long ago). Reading the apache doc, you may try to verify if hadoop-yarn cgroup hierarchy exists and try to  set yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user to the desired user  as described in section CGroups and Security of https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html#CGroups_a...

Announcements