Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

MapReduce jobs stop executing after upgrading to CDH 5.5.2

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

One thing I forgot to say that you could do, is to check the memory + disk usage of nodes running node manager. Disk space/file descriptor/memory could fill up and lead to Node Manager to shut down because the file deletion problem.

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

I checked the memory on both nodes and they both spike to 97.78% with all 10 containers running for about 10 minutes. I couldn't look at the file descriptors though.  But, all metrics spike the same: GC, CPU usage, disk latency, network throughput - while the metrics for JVM Heap Memory Usage and Java Threads disappear off the charts. The 4x2TB disks on each node are barely used because the amount of data is small, ~57GB. Do you have an idea of what to configure or change to fix this somehow? And why does CDH 5.5.2 do this and not earlier versions? Also, the 'ulimit -n' is 65536. I don't know if this helps.

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

Hi Ben, 

 

The reason of high memory cpu usage in this case might not be exactly the same as what caused your original problem of jobs being in stuck. I see a lot of recovered stuff in Node Manager. I guess because all the containers were not cleaned up previously, so they were recovered when you restarted Node Manager, taking a lot of resources. But they stil cannot be cleaned up because of the same reason which caused containers to be stuck in previous run. I suspect it might have something to do with the cgroup setup, though I have no knowlege of how cgroup is used or set up in CDH. I have seen cgroup issues in the Node Manager log consistently, which might have lead to the failure of resource cleanup of the containers. Therefore, containers can never be claimed back, and they will stay in state store for recovery next time when Node Manager comes back.

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

Are you saying that the completed tasks (containers) are not being cleaned up so new ones cannot be allocated to other tasks? If so, is there something that addresses this somewhere? Has anyone else heard of this?

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

Yes, it is very likely. I haven't checked out how Node Manager handles clean up failure exactly in the code though. You could read CGroup with Yarn to verify some of the settings.

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

These are the settings in YARN that we have set regarding containers. Do you see anything out of the ordinary?

 

  • CGroups Hierarchy
    • yarn.nodemanager.linux-container-executor.cgroups.hierarchy = /hadoop-yarn
  • Use CGroups for Resource Management
    • yarn.nodemanager.linux-container-executor.resources-handler.class = true
  • UNIX User for Nonsecure Mode with Linux Container Executor
    • yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user = nobody
  • Container Executor Group
    • yarn.nodemanager.linux-container-executor.group = yarn
  • Containers Environment Variable
    • yarn.nodemanager.admin-env = MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX
  • Containers Environment Variables Whitelist
    • yarn.nodemanager.env-whitelist = JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,HADOOP_YARN_HOME
  • Container Manager Thread Count
    • yarn.nodemanager.container-manager.thread-count = 20
  • Container Monitor Interval
    • yarn.resourcemanager.container.liveness-monitor.interval-ms = 10 minutes
  • Fair Scheduler Assign Multiple Tasks
    • yarn.scheduler.fair.assignmultiple = true
  • Use CGroups for Resource Management
    • yarn.nodemanager.linux-container-executor.resources-handler.class = true
  • Always Use Linux Container Executor
    • yarn.nodemanager.container-executor.class = true 

Thanks.

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

I'm afraid that I cannot help you with cgroup as I don't know how  cgroup works with YARN (Just started working on YARN not long ago). Reading the apache doc, you may try to verify if hadoop-yarn cgroup hierarchy exists and try to  set yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user to the desired user  as described in section CGroups and Security of https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html#CGroups_a...

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

If this is of any help, for both clusters, I had to turn off Log Aggregation because none of the jobs would start. They would be stuck in pending state and never start. Can this be a symptom?

 

Thanks,

Ben

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star
i will try that.

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

You can upload a container log so we can verify from the container's perspective what was happening.