Created 09-24-2016 05:51 AM
As a newbie of hadoop, I encounter below trouble. I found several RUNNING applications in Yarn resource manager UI those hang and never complete.
Show 20406080100 entries Search:
ID | User | Name | Application Type | Queue | Application Priority | StartTime | FinishTime | State | FinalStatus | Running Containers | Allocated CPU VCores | Allocated Memory MB | % of Queue | % of Cluster | Progress | Tracking UI | Blacklisted Nodes |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
application_1474533507895_0024 | hive | org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 | SPARK | default | 0 | Fri Sep 23 11:37:55 +0800 2016 | N/A | RUNNING | UNDEFINED | ApplicationMaster | |||||||
application_1474514425259_0011 | hive | org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 | SPARK | default | 0 | Thu Sep 22 11:25:54 +0800 2016 | N/A | RUNNING | UNDEFINED | ApplicationMaster | |||||||
application_1474514425259_0010 | hive | org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 | SPARK | default | 0 | Thu Sep 22 11:25:54 +0800 2016 | N/A | RUNNING | UNDEFINED | ApplicationMaster | |||||||
application_1474514425259_0009 | hive | org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 | SPARK | default | 0 | Thu Sep 22 11:25:54 +0800 2016 | N/A | RUNNING | UNDEFINED | ApplicationMaster | |||||||
application_1474514425259_0006 | hive | org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 | SPARK | default | 0 | Thu Sep 22 11:25:53 +0800 2016 | N/A | RUNNING | UNDEFINED | ApplicationMaster |
Showing 1 to 5 of 5 entries FirstPrevious1Next
Last
and in spark history server UI
App ID | App Name | Started | Completed | Duration | Spark User | Last Updated |
---|---|---|---|---|---|---|
application_1474533507895_0024 | org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 | 2016/09/23 11:37:52 | - | - | hive | |
application_1474514425259_0011 | org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 | 2016/09/22 11:25:51 | - | - | hive | |
application_1474514425259_0010 | org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 | 2016/09/22 11:25:51 | - | - | hive | |
application_1474514425259_0009 | org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 | 2016/09/22 11:25:50 | - | - | hive | |
application_1474514425259_0006 | org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 | 2016/09/22 11:25:49 | - | - | hive | |
application_1474466529970_0009 | org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 | 2016/09/21 22:15:33 | - | - | hive | |
application_1474466529970_0008 | org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 | 2016/09/21 22:15:28 | - | - | hive | |
application_1474466529970_0006 | org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 | 2016/09/21 22:15:28 | - | - | hive | |
application_1474466529970_0007 | org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 | 2016/09/21 22:15:27 | - | - | hive |
Back to completed applications
How can I do with those above running applications? how to know the root cause of such issue ? and how to kill them and delete them?
Tons of thanks.
Created 09-24-2016 02:42 PM
@Huahua Wei You need to explicitly stop the SparkContext sc by calling sc.stop. In cluster settings if you don't explicitly call sc.stop() your application may hang. Like closing files, network connections, etc, when you're done with them, it's a good idea to call sc.stop(), which lets the spark master know that your application is finished consuming resources. If you don't call sc.stop(), the event log information that is used by the history server will be incomplete, and your application will not show up in the history server's UI.
Created 09-24-2016 02:42 PM
@Huahua Wei You need to explicitly stop the SparkContext sc by calling sc.stop. In cluster settings if you don't explicitly call sc.stop() your application may hang. Like closing files, network connections, etc, when you're done with them, it's a good idea to call sc.stop(), which lets the spark master know that your application is finished consuming resources. If you don't call sc.stop(), the event log information that is used by the history server will be incomplete, and your application will not show up in the history server's UI.
Created 09-27-2016 06:40 AM
I called sc.stop() in spark-shell, but seem no help. the running applications still there.
scala> sc.stop() 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/sql,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null} 16/09/26 15:13:17 INFO SparkUI: Stopped Spark web UI at http://202.1.2.138:4041 16/09/26 15:13:17 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 16/09/26 15:13:17 INFO MemoryStore: MemoryStore cleared 16/09/26 15:13:17 INFO BlockManager: BlockManager stopped 16/09/26 15:13:17 INFO BlockManagerMaster: BlockManagerMaster stopped 16/09/26 15:13:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 16/09/26 15:13:17 INFO SparkContext: Successfully stopped SparkContext
scala> 16/09/26 15:13:17 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 16/09/26 15:13:17 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 16/09/26 15:13:17 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. sc.stop() 16/09/26 15:15:57 INFO SparkContext: SparkContext already stopped.
scala> sc.stop() 16/09/26 15:18:36 INFO SparkContext: SparkContext already stopped.
scala>
Yarn KILL, can work.
[root@insightcluster133 /]# yarn application -list 16/09/26 15:21:08 INFO impl.TimelineClientImpl: Timeline service address: http://insightcluster132.huawei.com:8188/ws/v1/timeline/ 16/09/26 15:21:08 INFO client.RMProxy: Connecting to ResourceManager at insightcluster133.huawei.com/202.1.2.133:8050 16/09/26 15:21:08 INFO client.AHSProxy: Connecting to Application History server at insightcluster132.huawei.com/202.1.2.132:10200 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):3 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL application_1474533507895_0024 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 SPARK hive default RUNNING UNDEFINED 10% http://202.1.2.138:4040 application_1474514425259_0010 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 SPARK hive default RUNNING UNDEFINED 10% http://202.1.2.130:4040 application_1474514425259_0009 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 SPARK hive default RUNNING UNDEFINED 10% http://202.1.2.134:4040 [root@insightcluster133 /]# yarn application -kill application_1474514425259_0009 16/09/26 15:21:21 INFO impl.TimelineClientImpl: Timeline service address: http://insightcluster132.huawei.com:8188/ws/v1/timeline/ 16/09/26 15:21:21 INFO client.RMProxy: Connecting to ResourceManager at insightcluster133.huawei.com/202.1.2.133:8050 16/09/26 15:21:22 INFO client.AHSProxy: Connecting to Application History server at insightcluster132.huawei.com/202.1.2.132:10200 Killing application application_1474514425259_0009 16/09/26 15:21:22 INFO impl.YarnClientImpl: Killed application application_1474514425259_0009 [root@insightcluster133 /]#
But, the killed items still in the History Server UI
Created 09-24-2016 03:36 PM
you can kill them in YARN as well if they are hung, but follow Tom's advice first. stop and clean up your jobs.
Created 09-26-2016 07:25 AM
I called sc.stop() in spark-shell, but seem no help. the running applications still there.
scala> sc.stop() 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/sql,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null} 16/09/26 15:13:17 INFO SparkUI: Stopped Spark web UI at http://202.1.2.138:4041 16/09/26 15:13:17 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 16/09/26 15:13:17 INFO MemoryStore: MemoryStore cleared 16/09/26 15:13:17 INFO BlockManager: BlockManager stopped 16/09/26 15:13:17 INFO BlockManagerMaster: BlockManagerMaster stopped 16/09/26 15:13:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 16/09/26 15:13:17 INFO SparkContext: Successfully stopped SparkContext
scala> 16/09/26 15:13:17 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 16/09/26 15:13:17 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 16/09/26 15:13:17 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. sc.stop() 16/09/26 15:15:57 INFO SparkContext: SparkContext already stopped.
scala> sc.stop() 16/09/26 15:18:36 INFO SparkContext: SparkContext already stopped.
scala>
Yarn KILL, can work.
[root@insightcluster133 /]# yarn application -list 16/09/26 15:21:08 INFO impl.TimelineClientImpl: Timeline service address: http://insightcluster132.huawei.com:8188/ws/v1/timeline/ 16/09/26 15:21:08 INFO client.RMProxy: Connecting to ResourceManager at insightcluster133.huawei.com/202.1.2.133:8050 16/09/26 15:21:08 INFO client.AHSProxy: Connecting to Application History server at insightcluster132.huawei.com/202.1.2.132:10200 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):3 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL application_1474533507895_0024 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 SPARK hive default RUNNING UNDEFINED 10% http://202.1.2.138:4040 application_1474514425259_0010 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 SPARK hive default RUNNING UNDEFINED 10% http://202.1.2.130:4040 application_1474514425259_0009 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 SPARK hive default RUNNING UNDEFINED 10% http://202.1.2.134:4040 [root@insightcluster133 /]# yarn application -kill application_1474514425259_0009 16/09/26 15:21:21 INFO impl.TimelineClientImpl: Timeline service address: http://insightcluster132.huawei.com:8188/ws/v1/timeline/ 16/09/26 15:21:21 INFO client.RMProxy: Connecting to ResourceManager at insightcluster133.huawei.com/202.1.2.133:8050 16/09/26 15:21:22 INFO client.AHSProxy: Connecting to Application History server at insightcluster132.huawei.com/202.1.2.132:10200 Killing application application_1474514425259_0009 16/09/26 15:21:22 INFO impl.YarnClientImpl: Killed application application_1474514425259_0009 [root@insightcluster133 /]#
But, the killed items still in the History Server UICreated 10-01-2016 12:01 AM
@Huahua Wei What version of Spark are you running? There is a JIRA for Spark 1.5.1 where the SparkContext stop method does not close HiveContexts.
Created 10-09-2016 07:45 AM
Spark 1.6.x.2.5
Created 09-03-2019 11:19 AM
Hi,
We need to review the Resource manager logs to look for the Errors if any,. Also we need to view the Resource manager webUI to check for the resource utilization and Memory utiilization in queue wise on the jobs submitted.
Thanks
AKR