Support Questions

Find answers, ask questions, and share your expertise

Yarn containers get KILLED automatically. MapReduce/Hive jobs fail

avatar
Explorer

We have a 39-node HDP cluster. We have been observing a peculiar phenomena. Some MapReduce/Hive jobs are allotted containers - they transition to RUNNING state and then get automatically killed. Here are the logs from a nodemanager -

2016-11-28 16:26:02,521 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000002 transitioned from LOCALIZED to RUNNING
2016-11-28 16:26:02,613 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000004
2016-11-28 16:26:02,613 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000002
2016-11-28 16:26:02,613 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000003
2016-11-28 16:26:02,613 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000005
2016-11-28 16:26:02,646 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19599 for container-id container_e40_1479381018014_95355_01_000004: 2
4.6 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used
2016-11-28 16:26:02,681 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19597 for container-id container_e40_1479381018014_95355_01_000005: 30.0 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used
2016-11-28 16:26:02,717 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19600 for container-id container_e40_1479381018014_95355_01_000002: 39.3 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used
2016-11-28 16:26:02,760 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19598 for container-id container_e40_1479381018014_95355_01_000003: 50.3 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used
2016-11-28 16:26:03,698 INFO  ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE)
2016-11-28 16:26:03,699 INFO  ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE)
et:s016-11-28 16:26:03,700 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
2016-11-28 16:26:03,700 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000005
2016-11-28 16:26:03,701 INFO  nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest  IP=172.23.35.43 OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1479381018014_95355   CONTAINERID=container_e40_1479381018014_95355_01_000005
2016-11-28 16:26:03,701 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000005 transitioned from RUNNING to KILLING
2016-11-28 16:26:03,701 INFO  launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(371)) - Cleaning up container container_e40_1479381018014_95355_01_000005
2016-11-28 16:26:03,702 INFO  ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE)
2016-11-28 16:26:03,702 INFO  ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE)
2016-11-28 16:26:03,702 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
2016-11-28 16:26:03,703 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000004
2016-11-28 16:26:03,703 INFO  nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest  IP=172.23.35.43 OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1479381018014_95355   CONTAINERID=container_e40_1479381018014_95355_01_000004
2016-11-28 16:26:03,703 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
2016-11-28 16:26:03,704 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000002
2016-11-28 16:26:03,704 INFO  nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest  IP=172.23.35.43 OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1479381018014_95355   CONTAINERID=container_e40_1479381018014_95355_01_000002
2016-11-28 16:26:03,704 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
2016-11-28 16:26:03,704 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000003
2016-11-28 16:26:03,704 INFO  nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest  IP=172.23.35.43 OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1479381018014_95355   CONTAINERID=container_e40_1479381018014_95355_01_000003
2016-11-28 16:26:03,706 WARN  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:launchContainer(381)) - Exit code from container container_e40_1479381018014_95355_01_000005 is : 143
2016-11-28 16:26:03,716 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000004 transitioned from RUNNING to KILLING
2016-11-28 16:26:03,716 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000002 transitioned from RUNNING to KILLING
2016-11-28 16:26:03,716 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000003 transitioned from RUNNING to KILLING
2016-11-28 16:26:03,716 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000005 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2016-11-28 16:26:03,716 INFO  launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(371)) - Cleaning up container container_e40_1479381018014_95355_01_000004
10 REPLIES 10

avatar
Super Collaborator

Hi @Garima Dosi

If you look at the full yarn logs (i.e. 'yarn logs -applicationId <app id>'), do you see "Container killed on request" in the logs? The exit code may hold some clues.

Best Regards,

Craig 🙂

avatar
Explorer

Attaching the yarn logs for that application app-logs-17399.txt

avatar
Super Collaborator

Are you sure a user (or process) is not issuing "yarn application -kill <app id>"? From the logs, that is what it looks like. What is running on 172.23.35.6 that would have killed the application?

2016-11-28 16:26:03,683 INFO [IPC Server handler 0 on 37719] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job job_1479381018014_95355 received from bdsa_ingest (auth:TOKEN) at 172.23.35.6


2016-11-28 16:26:03,866 INFO [Thread-107] org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Setting job diagnostics to Kill job job_1479381018014_95355 received from bdsa_ingest (auth:TOKEN) at 172.23.35.6

avatar
Explorer

@clukasik This is for sure that no one has issued the yarn kill command. Moreover, the machine 172.23.35.6 is a machine where only master services are running. And access is restricted to this machine. The services on this machine are -

namenode, hive_server, hive_metastore, journalnode, kerberos_client,mapreduce2_client,

falcon_client, hbase_client, hbase_master, hcat, hdfs_client, hive_client, mahout, metrics_monitor, oozie_client, phoenix_query_server, pig, ranger_admin, ranger_usersync, resourcemanager, spark_client, spark_thriftserver, sqoop, tez_client, webhcat_server, yarn_client, zkfc, zookeeper_client, zookeeper_server

Since these are hive jobs, there could be a possibility that hive_server itself is issuing sort of a kill command. Not sure though - will check hive logs as well.

avatar
Explorer

@clukasik

I checked the hiveserver logs. Found that corresponding to all these MapReduce failures for Hive jobs, there was always a FATAL error that occured in hive before the kill signal was seen in MapReduce logs. The stack trace for this particular error is attached in the "hiveserver2-logs-error.txt" and the activities of the thread is where it happened is available in "thread-logs.txt".

On further investigation, it was also found that the attached error can be related(traced back) to an unresolved JIRA -

Please let me know your thoughts on this.

avatar
Super Collaborator

@Garima Dosi - I see this WARN in the logs:

2016-11-28 16:25:51,029 WARN  [Thread-2037242]: split.JobSplitWriter (JobSplitWriter.java:writeOldSplits(168)) - Max block location exceeded for split: 

I am not sure if it would lead to the current outcome but it might be a clue.

avatar
Explorer

@clukasik The warning is not a problem. We increased the value of mapreduce.job.max.split.locations from 10 to "number of data nodes" and thats it. The error mentioned previously still happens. I still think the issue is related to this JIRA -

avatar

avatar
Explorer

We are using hadoop version - Hadoop 2.7.1.2.4.2.0-258 where this bug seems to be already resolved. However, not sure whether this is the issue.