Support Questions

Find answers, ask questions, and share your expertise

RUNNING applications and Incomplete applications seems hang

Rising Star

As a newbie of hadoop, I encounter below trouble. I found several RUNNING applications in Yarn resource manager UI those hang and never complete.

Show 20406080100 entries Search:

ID User Name Application Type Queue Application Priority StartTime FinishTime State FinalStatus Running Containers Allocated CPU VCores Allocated Memory MB % of Queue % of Cluster Progress Tracking UI Blacklisted Nodes
application_1474533507895_0024hiveorg.apache.spark.sql.hive.thriftserver.HiveThriftServer2SPARKdefault0Fri Sep 23 11:37:55 +0800 2016N/ARUNNINGUNDEFINEDApplicationMaster
application_1474514425259_0011hiveorg.apache.spark.sql.hive.thriftserver.HiveThriftServer2SPARKdefault0Thu Sep 22 11:25:54 +0800 2016N/ARUNNINGUNDEFINEDApplicationMaster
application_1474514425259_0010hiveorg.apache.spark.sql.hive.thriftserver.HiveThriftServer2SPARKdefault0Thu Sep 22 11:25:54 +0800 2016N/ARUNNINGUNDEFINEDApplicationMaster
application_1474514425259_0009hiveorg.apache.spark.sql.hive.thriftserver.HiveThriftServer2SPARKdefault0Thu Sep 22 11:25:54 +0800 2016N/ARUNNINGUNDEFINEDApplicationMaster
application_1474514425259_0006hiveorg.apache.spark.sql.hive.thriftserver.HiveThriftServer2SPARKdefault0Thu Sep 22 11:25:53 +0800 2016N/ARUNNINGUNDEFINEDApplicationMaster

Showing 1 to 5 of 5 entries FirstPrevious1Next

Last

and in spark history server UI

1.6.2 History Server

  • Event log directory: hdfs:///spark-history

Showing 1-9 of 9 (Incomplete applications)1

App IDApp NameStartedCompletedDurationSpark UserLast Updated
application_1474533507895_0024org.apache.spark.sql.hive.thriftserver.HiveThriftServer22016/09/23 11:37:52--hive
application_1474514425259_0011org.apache.spark.sql.hive.thriftserver.HiveThriftServer22016/09/22 11:25:51--hive
application_1474514425259_0010org.apache.spark.sql.hive.thriftserver.HiveThriftServer22016/09/22 11:25:51--hive
application_1474514425259_0009org.apache.spark.sql.hive.thriftserver.HiveThriftServer22016/09/22 11:25:50--hive
application_1474514425259_0006org.apache.spark.sql.hive.thriftserver.HiveThriftServer22016/09/22 11:25:49--hive
application_1474466529970_0009org.apache.spark.sql.hive.thriftserver.HiveThriftServer22016/09/21 22:15:33--hive
application_1474466529970_0008org.apache.spark.sql.hive.thriftserver.HiveThriftServer22016/09/21 22:15:28--hive
application_1474466529970_0006org.apache.spark.sql.hive.thriftserver.HiveThriftServer22016/09/21 22:15:28--hive
application_1474466529970_0007org.apache.spark.sql.hive.thriftserver.HiveThriftServer22016/09/21 22:15:27--hive

Back to completed applications

How can I do with those above running applications? how to know the root cause of such issue ? and how to kill them and delete them?

Tons of thanks.

1 ACCEPTED SOLUTION

@Huahua Wei You need to explicitly stop the SparkContext sc by calling sc.stop. In cluster settings if you don't explicitly call sc.stop() your application may hang. Like closing files, network connections, etc, when you're done with them, it's a good idea to call sc.stop(), which lets the spark master know that your application is finished consuming resources. If you don't call sc.stop(), the event log information that is used by the history server will be incomplete, and your application will not show up in the history server's UI.

View solution in original post

7 REPLIES 7

@Huahua Wei You need to explicitly stop the SparkContext sc by calling sc.stop. In cluster settings if you don't explicitly call sc.stop() your application may hang. Like closing files, network connections, etc, when you're done with them, it's a good idea to call sc.stop(), which lets the spark master know that your application is finished consuming resources. If you don't call sc.stop(), the event log information that is used by the history server will be incomplete, and your application will not show up in the history server's UI.

Rising Star

I called sc.stop() in spark-shell, but seem no help. the running applications still there.

scala> sc.stop() 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/sql,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null} 16/09/26 15:13:17 INFO SparkUI: Stopped Spark web UI at http://202.1.2.138:4041 16/09/26 15:13:17 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 16/09/26 15:13:17 INFO MemoryStore: MemoryStore cleared 16/09/26 15:13:17 INFO BlockManager: BlockManager stopped 16/09/26 15:13:17 INFO BlockManagerMaster: BlockManagerMaster stopped 16/09/26 15:13:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 16/09/26 15:13:17 INFO SparkContext: Successfully stopped SparkContext

scala> 16/09/26 15:13:17 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 16/09/26 15:13:17 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 16/09/26 15:13:17 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. sc.stop() 16/09/26 15:15:57 INFO SparkContext: SparkContext already stopped.

scala> sc.stop() 16/09/26 15:18:36 INFO SparkContext: SparkContext already stopped.

scala>

Yarn KILL, can work.

[root@insightcluster133 /]# yarn application -list 16/09/26 15:21:08 INFO impl.TimelineClientImpl: Timeline service address: http://insightcluster132.huawei.com:8188/ws/v1/timeline/ 16/09/26 15:21:08 INFO client.RMProxy: Connecting to ResourceManager at insightcluster133.huawei.com/202.1.2.133:8050 16/09/26 15:21:08 INFO client.AHSProxy: Connecting to Application History server at insightcluster132.huawei.com/202.1.2.132:10200 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):3 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL application_1474533507895_0024 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 SPARK hive default RUNNING UNDEFINED 10% http://202.1.2.138:4040 application_1474514425259_0010 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 SPARK hive default RUNNING UNDEFINED 10% http://202.1.2.130:4040 application_1474514425259_0009 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 SPARK hive default RUNNING UNDEFINED 10% http://202.1.2.134:4040 [root@insightcluster133 /]# yarn application -kill application_1474514425259_0009 16/09/26 15:21:21 INFO impl.TimelineClientImpl: Timeline service address: http://insightcluster132.huawei.com:8188/ws/v1/timeline/ 16/09/26 15:21:21 INFO client.RMProxy: Connecting to ResourceManager at insightcluster133.huawei.com/202.1.2.133:8050 16/09/26 15:21:22 INFO client.AHSProxy: Connecting to Application History server at insightcluster132.huawei.com/202.1.2.132:10200 Killing application application_1474514425259_0009 16/09/26 15:21:22 INFO impl.YarnClientImpl: Killed application application_1474514425259_0009 [root@insightcluster133 /]#

But, the killed items still in the History Server UI

Super Guru

you can kill them in YARN as well if they are hung, but follow Tom's advice first. stop and clean up your jobs.

Rising Star

I called sc.stop() in spark-shell, but seem no help. the running applications still there.

scala> sc.stop() 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static/sql,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/execution,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/SQL,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null} 16/09/26 15:13:17 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null} 16/09/26 15:13:17 INFO SparkUI: Stopped Spark web UI at http://202.1.2.138:4041 16/09/26 15:13:17 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 16/09/26 15:13:17 INFO MemoryStore: MemoryStore cleared 16/09/26 15:13:17 INFO BlockManager: BlockManager stopped 16/09/26 15:13:17 INFO BlockManagerMaster: BlockManagerMaster stopped 16/09/26 15:13:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 16/09/26 15:13:17 INFO SparkContext: Successfully stopped SparkContext

scala> 16/09/26 15:13:17 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 16/09/26 15:13:17 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 16/09/26 15:13:17 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. sc.stop() 16/09/26 15:15:57 INFO SparkContext: SparkContext already stopped.

scala> sc.stop() 16/09/26 15:18:36 INFO SparkContext: SparkContext already stopped.

scala>

Yarn KILL, can work.

[root@insightcluster133 /]# yarn application -list 16/09/26 15:21:08 INFO impl.TimelineClientImpl: Timeline service address: http://insightcluster132.huawei.com:8188/ws/v1/timeline/ 16/09/26 15:21:08 INFO client.RMProxy: Connecting to ResourceManager at insightcluster133.huawei.com/202.1.2.133:8050 16/09/26 15:21:08 INFO client.AHSProxy: Connecting to Application History server at insightcluster132.huawei.com/202.1.2.132:10200 Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):3 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL application_1474533507895_0024 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 SPARK hive default RUNNING UNDEFINED 10% http://202.1.2.138:4040 application_1474514425259_0010 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 SPARK hive default RUNNING UNDEFINED 10% http://202.1.2.130:4040 application_1474514425259_0009 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 SPARK hive default RUNNING UNDEFINED 10% http://202.1.2.134:4040 [root@insightcluster133 /]# yarn application -kill application_1474514425259_0009 16/09/26 15:21:21 INFO impl.TimelineClientImpl: Timeline service address: http://insightcluster132.huawei.com:8188/ws/v1/timeline/ 16/09/26 15:21:21 INFO client.RMProxy: Connecting to ResourceManager at insightcluster133.huawei.com/202.1.2.133:8050 16/09/26 15:21:22 INFO client.AHSProxy: Connecting to Application History server at insightcluster132.huawei.com/202.1.2.132:10200 Killing application application_1474514425259_0009 16/09/26 15:21:22 INFO impl.YarnClientImpl: Killed application application_1474514425259_0009 [root@insightcluster133 /]#

But, the killed items still in the History Server UI

@Huahua Wei What version of Spark are you running? There is a JIRA for Spark 1.5.1 where the SparkContext stop method does not close HiveContexts.

Rising Star

Spark 1.6.x.2.5

Cloudera Employee

Hi,

 

We need to review the Resource manager logs to look for the Errors if any,. Also we need to view the Resource manager webUI to check for the resource utilization and Memory utiilization in queue wise on the jobs submitted.

 

Thanks

AKR