Created 11-28-2016 03:28 PM
We have a 39-node HDP cluster. We have been observing a peculiar phenomena. Some MapReduce/Hive jobs are allotted containers - they transition to RUNNING state and then get automatically killed. Here are the logs from a nodemanager -
2016-11-28 16:26:02,521 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000002 transitioned from LOCALIZED to RUNNING 2016-11-28 16:26:02,613 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000004 2016-11-28 16:26:02,613 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000002 2016-11-28 16:26:02,613 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000003 2016-11-28 16:26:02,613 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000005 2016-11-28 16:26:02,646 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19599 for container-id container_e40_1479381018014_95355_01_000004: 2 4.6 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used 2016-11-28 16:26:02,681 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19597 for container-id container_e40_1479381018014_95355_01_000005: 30.0 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used 2016-11-28 16:26:02,717 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19600 for container-id container_e40_1479381018014_95355_01_000002: 39.3 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used 2016-11-28 16:26:02,760 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19598 for container-id container_e40_1479381018014_95355_01_000003: 50.3 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used 2016-11-28 16:26:03,698 INFO ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE) 2016-11-28 16:26:03,699 INFO ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE) et:s016-11-28 16:26:03,700 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 2016-11-28 16:26:03,700 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000005 2016-11-28 16:26:03,701 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest IP=172.23.35.43 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1479381018014_95355 CONTAINERID=container_e40_1479381018014_95355_01_000005 2016-11-28 16:26:03,701 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000005 transitioned from RUNNING to KILLING 2016-11-28 16:26:03,701 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(371)) - Cleaning up container container_e40_1479381018014_95355_01_000005 2016-11-28 16:26:03,702 INFO ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE) 2016-11-28 16:26:03,702 INFO ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE) 2016-11-28 16:26:03,702 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 2016-11-28 16:26:03,703 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000004 2016-11-28 16:26:03,703 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest IP=172.23.35.43 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1479381018014_95355 CONTAINERID=container_e40_1479381018014_95355_01_000004 2016-11-28 16:26:03,703 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 2016-11-28 16:26:03,704 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000002 2016-11-28 16:26:03,704 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest IP=172.23.35.43 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1479381018014_95355 CONTAINERID=container_e40_1479381018014_95355_01_000002 2016-11-28 16:26:03,704 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB 2016-11-28 16:26:03,704 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000003 2016-11-28 16:26:03,704 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest IP=172.23.35.43 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1479381018014_95355 CONTAINERID=container_e40_1479381018014_95355_01_000003 2016-11-28 16:26:03,706 WARN nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:launchContainer(381)) - Exit code from container container_e40_1479381018014_95355_01_000005 is : 143 2016-11-28 16:26:03,716 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000004 transitioned from RUNNING to KILLING 2016-11-28 16:26:03,716 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000002 transitioned from RUNNING to KILLING 2016-11-28 16:26:03,716 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000003 transitioned from RUNNING to KILLING 2016-11-28 16:26:03,716 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000005 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL 2016-11-28 16:26:03,716 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(371)) - Cleaning up container container_e40_1479381018014_95355_01_000004
Created 11-28-2016 03:48 PM
Hi @Garima Dosi
If you look at the full yarn logs (i.e. 'yarn logs -applicationId <app id>'), do you see "Container killed on request" in the logs? The exit code may hold some clues.
Best Regards,
Craig 🙂
Created 11-29-2016 05:05 AM
Attaching the yarn logs for that application app-logs-17399.txt
Created 11-29-2016 05:20 PM
Are you sure a user (or process) is not issuing "yarn application -kill <app id>"? From the logs, that is what it looks like. What is running on 172.23.35.6 that would have killed the application?
2016-11-28 16:26:03,683 INFO [IPC Server handler 0 on 37719] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job job_1479381018014_95355 received from bdsa_ingest (auth:TOKEN) at 172.23.35.6 2016-11-28 16:26:03,866 INFO [Thread-107] org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: Setting job diagnostics to Kill job job_1479381018014_95355 received from bdsa_ingest (auth:TOKEN) at 172.23.35.6
Created 11-30-2016 06:50 AM
@clukasik This is for sure that no one has issued the yarn kill command. Moreover, the machine 172.23.35.6 is a machine where only master services are running. And access is restricted to this machine. The services on this machine are -
namenode, hive_server, hive_metastore, journalnode, kerberos_client,mapreduce2_client,
falcon_client, hbase_client, hbase_master, hcat, hdfs_client, hive_client, mahout, metrics_monitor, oozie_client, phoenix_query_server, pig, ranger_admin, ranger_usersync, resourcemanager, spark_client, spark_thriftserver, sqoop, tez_client, webhcat_server, yarn_client, zkfc, zookeeper_client, zookeeper_server
Since these are hive jobs, there could be a possibility that hive_server itself is issuing sort of a kill command. Not sure though - will check hive logs as well.
Created 11-30-2016 09:17 AM
I checked the hiveserver logs. Found that corresponding to all these MapReduce failures for Hive jobs, there was always a FATAL error that occured in hive before the kill signal was seen in MapReduce logs. The stack trace for this particular error is attached in the "hiveserver2-logs-error.txt" and the activities of the thread is where it happened is available in "thread-logs.txt".
On further investigation, it was also found that the attached error can be related(traced back) to an unresolved JIRA -
Please let me know your thoughts on this.
Created 11-30-2016 01:50 PM
@Garima Dosi - I see this WARN in the logs:
2016-11-28 16:25:51,029 WARN [Thread-2037242]: split.JobSplitWriter (JobSplitWriter.java:writeOldSplits(168)) - Max block location exceeded for split:
I am not sure if it would lead to the current outcome but it might be a clue.
Created 12-06-2016 06:36 AM
@clukasik The warning is not a problem. We increased the value of mapreduce.job.max.split.locations from 10 to "number of data nodes" and thats it. The error mentioned previously still happens. I still think the issue is related to this JIRA -
Created 11-28-2016 04:09 PM
it looks this issue is hitting https://issues.apache.org/jira/browse/YARN-3238
Created 11-29-2016 05:11 AM
We are using hadoop version - Hadoop 2.7.1.2.4.2.0-258 where this bug seems to be already resolved. However, not sure whether this is the issue.