Member since
11-28-2016
7
Posts
0
Kudos Received
0
Solutions
01-08-2017
11:18 AM
@clukasik, @Divakar Annapureddy This was an issue related to the JIRA - https://issues.apache.org/jira/browse/HIVE-11681 We have hive jobs that use same aux jars and "add jar" statements and these jobs execute concurrently. So, apparently, when one of the jobs is completed, the loader is closed while the other job is still using it and hence the KILL signal. We solved it by removing add jar statements from concurrent hive jobs and instead added the auxiliary jars in hive.aux.jars.path at the time of HS2 "start" which means that these jars would be loaded only once and will not be removed/closed/unloaded per session. Thus concurrent hive jobs will not have problem in accessing them. To make the jars available to HS2 - they were copied on each machine that hosted HS2 and then through Ambari added the path to HIVE_AUX_JARS_PATH only for the hiveserver component in the hive-env template script. if [ "$SERVICE" = "hiveserver2" ]; then CUSTOM_AUX_JARS_PATH=/<path_to_aux_lib_dir> if [ -d $CUSTOM_AUX_JARS_PATH ]; then CUSTOM_AUX_JARS=`echo $CUSTOM_AUX_JARS_PATH/*.jar | sed 's/ /,/g'` export HIVE_AUX_JARS_PATH="$HIVE_AUX_JARS_PATH,$CUSTOM_AUX_JARS" fi fi
... View more
12-06-2016
06:36 AM
@clukasik The warning is not a problem. We increased the value of mapreduce.job.max.split.locations from 10 to "number of data nodes" and thats it. The error mentioned previously still happens. I still think the issue is related to this JIRA -
https://issues.apache.org/jira/browse/HIVE-11681
... View more
11-30-2016
09:17 AM
@clukasik I checked the hiveserver logs. Found that corresponding to all these MapReduce failures for Hive jobs, there was always a FATAL error that occured in hive before the kill signal was seen in MapReduce logs. The stack trace for this particular error is attached in the "hiveserver2-logs-error.txt" and the activities of the thread is where it happened is available in "thread-logs.txt". On further investigation, it was also found that the attached error can be related(traced back) to an unresolved JIRA -
http://osdir.com/ml/general/2015-08/msg39276.html https://issues.apache.org/jira/browse/HIVE-11681 Please let me know your thoughts on this.
... View more
11-30-2016
06:50 AM
@clukasik This is for sure that no one has issued the yarn kill command. Moreover, the machine 172.23.35.6 is a machine where only master services are running. And access is restricted to this machine. The services on this machine are - namenode, hive_server, hive_metastore, journalnode, kerberos_client,mapreduce2_client, falcon_client, hbase_client, hbase_master, hcat, hdfs_client, hive_client, mahout, metrics_monitor, oozie_client, phoenix_query_server, pig, ranger_admin, ranger_usersync, resourcemanager, spark_client, spark_thriftserver, sqoop, tez_client, webhcat_server, yarn_client, zkfc, zookeeper_client, zookeeper_server Since these are hive jobs, there could be a possibility that hive_server itself is issuing sort of a kill command. Not sure though - will check hive logs as well.
... View more
11-29-2016
05:11 AM
We are using hadoop version - Hadoop 2.7.1.2.4.2.0-258 where this bug seems to be already resolved. However, not sure whether this is the issue.
... View more
11-29-2016
05:05 AM
Attaching the yarn logs for that application app-logs-17399.txt
... View more
11-28-2016
03:28 PM
We have a 39-node HDP cluster. We have been observing a peculiar phenomena. Some MapReduce/Hive jobs are allotted containers - they transition to RUNNING state and then get automatically killed. Here are the logs from a nodemanager - 2016-11-28 16:26:02,521 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000002 transitioned from LOCALIZED to RUNNING
2016-11-28 16:26:02,613 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000004
2016-11-28 16:26:02,613 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000002
2016-11-28 16:26:02,613 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000003
2016-11-28 16:26:02,613 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e40_1479381018014_95355_01_000005
2016-11-28 16:26:02,646 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19599 for container-id container_e40_1479381018014_95355_01_000004: 2
4.6 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used
2016-11-28 16:26:02,681 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19597 for container-id container_e40_1479381018014_95355_01_000005: 30.0 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used
2016-11-28 16:26:02,717 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19600 for container-id container_e40_1479381018014_95355_01_000002: 39.3 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used
2016-11-28 16:26:02,760 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 19598 for container-id container_e40_1479381018014_95355_01_000003: 50.3 MB of 4.5 GB physical memory used; 4.4 GB of 9.4 GB virtual memory used
2016-11-28 16:26:03,698 INFO ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE)
2016-11-28 16:26:03,699 INFO ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE)
et:s016-11-28 16:26:03,700 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
2016-11-28 16:26:03,700 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000005
2016-11-28 16:26:03,701 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest IP=172.23.35.43 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1479381018014_95355 CONTAINERID=container_e40_1479381018014_95355_01_000005
2016-11-28 16:26:03,701 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000005 transitioned from RUNNING to KILLING
2016-11-28 16:26:03,701 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(371)) - Cleaning up container container_e40_1479381018014_95355_01_000005
2016-11-28 16:26:03,702 INFO ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE)
2016-11-28 16:26:03,702 INFO ipc.Server (Server.java:saslProcess(1441)) - Auth successful for appattempt_1479381018014_95355_000001 (auth:SIMPLE)
2016-11-28 16:26:03,702 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
2016-11-28 16:26:03,703 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000004
2016-11-28 16:26:03,703 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest IP=172.23.35.43 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1479381018014_95355 CONTAINERID=container_e40_1479381018014_95355_01_000004
2016-11-28 16:26:03,703 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
2016-11-28 16:26:03,704 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000002
2016-11-28 16:26:03,704 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest IP=172.23.35.43 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1479381018014_95355 CONTAINERID=container_e40_1479381018014_95355_01_000002
2016-11-28 16:26:03,704 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for appattempt_1479381018014_95355_000001 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
2016-11-28 16:26:03,704 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(966)) - Stopping container with container Id: container_e40_1479381018014_95355_01_000003
2016-11-28 16:26:03,704 INFO nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(89)) - USER=bdsa_ingest IP=172.23.35.43 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1479381018014_95355 CONTAINERID=container_e40_1479381018014_95355_01_000003
2016-11-28 16:26:03,706 WARN nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:launchContainer(381)) - Exit code from container container_e40_1479381018014_95355_01_000005 is : 143
2016-11-28 16:26:03,716 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000004 transitioned from RUNNING to KILLING
2016-11-28 16:26:03,716 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000002 transitioned from RUNNING to KILLING
2016-11-28 16:26:03,716 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000003 transitioned from RUNNING to KILLING
2016-11-28 16:26:03,716 INFO container.ContainerImpl (ContainerImpl.java:handle(1136)) - Container container_e40_1479381018014_95355_01_000005 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2016-11-28 16:26:03,716 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(371)) - Cleaning up container container_e40_1479381018014_95355_01_000004
... View more
Labels:
- Labels:
-
Apache YARN