Created 11-28-2016 03:28 PM
We have a 39-node HDP cluster. We have been observing a peculiar phenomena. Some MapReduce/Hive jobs are allotted containers - they transition to RUNNING state and then get automatically killed. Here are the logs from a nodemanager -
2016-11-28 16:26:02,521 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000002 transitioned from LOCALIZED to RUNNING 2016-11-28 16:26:02,613 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000004 2016-11-28 16:26:02,613 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000002 2016-11-28 16:26:02,613 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000003 2016-11-28 16:26:02,613 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000005 2016-11-28 16:26:02,646 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19599 for container-id container_e40_1479381018014_95355_01_000004: 2 4.6 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used 2016-11-28 16:26:02,681 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19597 for container-id container_e40_1479381018014_95355_01_000005: 30.0 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used 2016-11-28 16:26:02,717 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19600 for container-id container_e40_1479381018014_95355_01_000002: 39.3 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used 2016-11-28 16:26:02,760 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19598 for container-id container_e40_1479381018014_95355_01_000003: 50.3 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used 2016-11-28 16:26:03,698 INFO ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE) 2016-11-28 16:26:03,699 INFO ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE) et:s016-11-28 16:26:03,700 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 2016-11-28 16:26:03,700 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000005 2016-11-28 16:26:03,701 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest IP=172.23.35.43 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1479381018014_95355 CONTAINERID=container_e40_1479381018014_95355_01_000005 2016-11-28 16:26:03,701 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000005 transitioned from RUNNING to KILLING 2016-11-28 16:26:03,701 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(371)) - Cleaning up container container_e40_1479381018014_95355_01_000005 2016-11-28 16:26:03,702 INFO ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE) 2016-11-28 16:26:03,702 INFO ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE) 2016-11-28 16:26:03,702 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 2016-11-28 16:26:03,703 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000004 2016-11-28 16:26:03,703 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest IP=172.23.35.43 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1479381018014_95355 CONTAINERID=container_e40_1479381018014_95355_01_000004 2016-11-28 16:26:03,703 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 2016-11-28 16:26:03,704 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000002 2016-11-28 16:26:03,704 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest IP=172.23.35.43 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1479381018014_95355 CONTAINERID=container_e40_1479381018014_95355_01_000002 2016-11-28 16:26:03,704 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 2016-11-28 16:26:03,704 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000003 2016-11-28 16:26:03,704 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest IP=172.23.35.43 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1479381018014_95355 CONTAINERID=container_e40_1479381018014_95355_01_000003 2016-11-28 16:26:03,706 WARN nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:launchContainer(381)) - Exit code from container container_e40_1479381018014_95355_01_000005 is : 143 2016-11-28 16:26:03,716 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000004 transitioned from RUNNING to KILLING 2016-11-28 16:26:03,716 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000002 transitioned from RUNNING to KILLING 2016-11-28 16:26:03,716 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000003 transitioned from RUNNING to KILLING 2016-11-28 16:26:03,716 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000005 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 2016-11-28 16:26:03,716 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(371)) - Cleaning up container container_e40_1479381018014_95355_01_000004
Created 01-08-2017 11:18 AM
@clukasik, @Divakar Annapureddy
This was an issue related to the JIRA - https://issues.apache.org/jira/browse/HIVE-11681
We have hive jobs that use same aux jars and "add jar" statements and these jobs execute concurrently. So, apparently, when one of the jobs is completed, the loader is closed while the other job is still using it and hence the KILL signal. We solved it by removing add jar statements from concurrent hive jobs and instead added the auxiliary jars in hive.aux.jars.path at the time of HS2 "start" which means that these jars would be loaded only once and will not be removed/closed/unloaded per session. Thus concurrent hive jobs will not have problem in accessing them.
To make the jars available to HS2 - they were copied on each machine that hosted HS2 and then through Ambari added the path to HIVE_AUX_JARS_PATH only for the hiveserver component in the hive-env template script.
if [ "$SERVICE" = "hiveserver2" ]; then
CUSTOM_AUX_JARS_PATH=/<path_to_aux_lib_dir>
if [ -d $CUSTOM_AUX_JARS_PATH ]; then
CUSTOM_AUX_JARS=`echo $CUSTOM_AUX_JARS_PATH/*.jar | sed 's/ /,/g'`
export HIVE_AUX_JARS_PATH="$HIVE_AUX_JARS_PATH,$CUSTOM_AUX_JARS"
fi
fi