Support Questions

Find answers, ask questions, and share your expertise

Yarn containers get KILLED automatically. MapReduce/Hive jobs fail

avatar
Explorer

We have a 39-node HDP cluster. We have been observing a peculiar phenomena. Some MapReduce/Hive jobs are allotted containers - they transition to RUNNING state and then get automatically killed. Here are the logs from a nodemanager -

2016-11-28 16:26:02,521 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000002 transitioned from LOCALIZED to RUNNING
2016-11-28 16:26:02,613 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000004
2016-11-28 16:26:02,613 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000002
2016-11-28 16:26:02,613 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000003
2016-11-28 16:26:02,613 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000005
2016-11-28 16:26:02,646 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19599 for container-id container_e40_1479381018014_95355_01_000004: 2
4.6 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used
2016-11-28 16:26:02,681 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19597 for container-id container_e40_1479381018014_95355_01_000005: 30.0 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used
2016-11-28 16:26:02,717 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19600 for container-id container_e40_1479381018014_95355_01_000002: 39.3 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used
2016-11-28 16:26:02,760 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19598 for container-id container_e40_1479381018014_95355_01_000003: 50.3 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used
2016-11-28 16:26:03,698 INFO  ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE)
2016-11-28 16:26:03,699 INFO  ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE)
et:s016-11-28 16:26:03,700 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
2016-11-28 16:26:03,700 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000005
2016-11-28 16:26:03,701 INFO  nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest  IP=172.23.35.43 OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1479381018014_95355   CONTAINERID=container_e40_1479381018014_95355_01_000005
2016-11-28 16:26:03,701 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000005 transitioned from RUNNING to KILLING
2016-11-28 16:26:03,701 INFO  launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(371)) - Cleaning up container container_e40_1479381018014_95355_01_000005
2016-11-28 16:26:03,702 INFO  ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE)
2016-11-28 16:26:03,702 INFO  ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE)
2016-11-28 16:26:03,702 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
2016-11-28 16:26:03,703 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000004
2016-11-28 16:26:03,703 INFO  nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest  IP=172.23.35.43 OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1479381018014_95355   CONTAINERID=container_e40_1479381018014_95355_01_000004
2016-11-28 16:26:03,703 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
2016-11-28 16:26:03,704 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000002
2016-11-28 16:26:03,704 INFO  nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest  IP=172.23.35.43 OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1479381018014_95355   CONTAINERID=container_e40_1479381018014_95355_01_000002
2016-11-28 16:26:03,704 INFO  authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
2016-11-28 16:26:03,704 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000003
2016-11-28 16:26:03,704 INFO  nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest  IP=172.23.35.43 OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1479381018014_95355   CONTAINERID=container_e40_1479381018014_95355_01_000003
2016-11-28 16:26:03,706 WARN  nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:launchContainer(381)) - Exit code from container container_e40_1479381018014_95355_01_000005 is : 143
2016-11-28 16:26:03,716 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000004 transitioned from RUNNING to KILLING
2016-11-28 16:26:03,716 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000002 transitioned from RUNNING to KILLING
2016-11-28 16:26:03,716 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000003 transitioned from RUNNING to KILLING
2016-11-28 16:26:03,716 INFO  container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000005 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2016-11-28 16:26:03,716 INFO  launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(371)) - Cleaning up container container_e40_1479381018014_95355_01_000004
10 REPLIES 10

avatar
Explorer

@clukasik, @Divakar Annapureddy

This was an issue related to the JIRA - https://issues.apache.org/jira/browse/HIVE-11681

We have hive jobs that use same aux jars and "add jar" statements and these jobs execute concurrently. So, apparently, when one of the jobs is completed, the loader is closed while the other job is still using it and hence the KILL signal. We solved it by removing add jar statements from concurrent hive jobs and instead added the auxiliary jars in hive.aux.jars.path at the time of HS2 "start" which means that these jars would be loaded only once and will not be removed/closed/unloaded per session. Thus concurrent hive jobs will not have problem in accessing them.

To make the jars available to HS2 - they were copied on each machine that hosted HS2 and then through Ambari added the path to HIVE_AUX_JARS_PATH only for the hiveserver component in the hive-env template script.

if [ "$SERVICE" = "hiveserver2" ]; then

CUSTOM_AUX_JARS_PATH=/<path_to_aux_lib_dir>

if [ -d $CUSTOM_AUX_JARS_PATH ]; then

CUSTOM_AUX_JARS=`echo $CUSTOM_AUX_JARS_PATH/*.jar | sed 's/ /,/g'`

export HIVE_AUX_JARS_PATH="$HIVE_AUX_JARS_PATH,$CUSTOM_AUX_JARS"

fi

fi