Reply
Explorer
Posts: 10
Registered: ‎11-21-2017

Spark streaming job on YARN cluster mode stuck in accepted, then fails with a Timeout Exception

[ Edited ]

I am running a spark streaming application that simply read messages from a Kafka topic, enrich them and then write the enriched messages in another kafka topic. I already tried it in Standalone mode (both client and cluster deploy mode) and in YARN client mode, successfully. When I submit the application in cluster mode it gives me the following messages:

18/01/10 12:13:34 INFO Client: Submitting application application_1515582681419_0001 to ResourceManager
18/01/10 12:13:34 INFO YarnClientImpl: Submitted application application_1515582681419_0001
18/01/10 12:13:35 INFO Client: Application report for application_1515582681419_0001 (state: ACCEPTED)
18/01/10 12:13:35 INFO Client: 
     client token: N/A
     diagnostics: AM container is launched, waiting for AM container to Register with RM
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1515582814080
     final status: UNDEFINED
     tracking URL: http://ambari1.internal:8088/proxy/application_1515582681419_0001/
     user: root
18/01/10 12:13:36 INFO Client: Application report for application_1515582681419_0001 (state: ACCEPTED)
18/01/10 12:13:37 INFO Client: Application report for application_1515582681419_0001 (state: ACCEPTED)

And keeps stuck in ACCEPTED Status until after around 4-5 minutes, exit with the following error message:

18/01/10 12:17:00 INFO InputInfoTracker: remove old batch metadata: 1515583000000 ms
18/01/10 12:17:02 ERROR ApplicationMaster: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:423)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:282)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:768)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:766)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
18/01/10 12:17:02 INFO ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds])
18/01/10 12:17:02 INFO StreamingContext: Invoking stop(stopGracefully=false) from shutdown hook
18/01/10 12:17:02 INFO ReceiverTracker: ReceiverTracker stopped
18/01/10 12:17:02 INFO JobGenerator: Stopping JobGenerator immediately 

Funny fact: If I visit the age of the application, I can see that the Spark Context has been started and it processes some messages.

Could anyone help me on this? PS: These are the resources of my YARN cluster: 

Screen Shot 2018-01-10 at 16.07.37.png

Champion
Posts: 617
Registered: ‎05-16-2016

Re: Spark streaming job on YARN cluster mode stuck in accepted, then fails with a Timeout Exception

Could you share the resourcemanager logs and nodemanager logs to dig more 

Highlighted
Contributor
Posts: 52
Registered: ‎01-08-2016

Re: Spark streaming job on YARN cluster mode stuck in accepted, then fails with a Timeout Exception

In addition to csguna's request for rm log, can you please share are you using command line to submit the job? If yes then please share the property file.

Explorer
Posts: 10
Registered: ‎11-21-2017

Re: Spark streaming job on YARN cluster mode stuck in accepted, then fails with a Timeout Exception

Thank you guys for replying. 

 

I wanted to edit the question in order to give more logs there. But I don't think I can... So... here you are:

 

 

hadoop-mapreduce.jobsummary.log 

2018-01-10 12:11:23,189 INFO resourcemanager.RMAppManager$ApplicationSummary: appId=application_1515514117634_0006,name=iit.cnr.it.socialpipeline.NormalizeJSON,user=root,queue=default,state=FAILED,trackingUrl=http://ambari1.internal:8088/proxy/application_1515514117634_0006/,appMasterHost=N/A,startTime=1515521579503,finishTime=1515521786904,finalStatus=FAILED,memorySeconds=421688,vcoreSeconds=205,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=SPARK
2018-01-10 12:11:23,189 INFO resourcemanager.RMAppManager$ApplicationSummary: appId=application_1515521947770_0001,name=iit.cnr.it.socialpipeline.NormalizeJSON,user=root,queue=default,state=FAILED,trackingUrl=http://ambari1.internal:8088/proxy/application_1515521947770_0001/,appMasterHost=N/A,startTime=1515522042260,finishTime=1515522252527,finalStatus=FAILED,memorySeconds=320665,vcoreSeconds=208,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=SPARK
2018-01-10 12:11:23,190 INFO resourcemanager.RMAppManager$ApplicationSummary: appId=application_1515521947770_0002,name=iit.cnr.it.socialpipeline.NormalizeJSON,user=root,queue=default,state=FAILED,trackingUrl=http://ambari1.internal:8088/proxy/application_1515521947770_0002/,appMasterHost=N/A,startTime=1515581922574,finishTime=1515582130269,finalStatus=FAILED,memorySeconds=316332,vcoreSeconds=205,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=SPARK
2018-01-10 12:17:02,638 INFO resourcemanager.RMAppManager$ApplicationSummary: appId=application_1515582681419_0001,name=iit.cnr.it.socialpipeline.NormalizeJSON,user=root,queue=default,state=FAILED,trackingUrl=http://ambari1.internal:8088/cluster/app/application_1515582681419_0001,appMasterHost=N/A,startTime=1515582814080,finishTime=1515583022633,finalStatus=FAILED,memorySeconds=317834,vcoreSeconds=206,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=SPARK
2018-01-10 14:32:07,537 INFO resourcemanager.RMAppManager$ApplicationSummary: appId=application_1515582681419_0002,name=iit.cnr.it.socialpipeline.NormalizeJSON,user=root,queue=default,state=FAILED,trackingUrl=http://ambari1.internal:8088/cluster/app/application_1515582681419_0002,appMasterHost=N/A,startTime=1515590479857,finishTime=1515591127527,finalStatus=FAILED,memorySeconds=992297,vcoreSeconds=645,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=SPARK
2018-01-10 15:00:56,548 INFO resourcemanager.RMAppManager$ApplicationSummary: appId=application_1515582681419_0003,name=iit.cnr.it.socialpipeline.NormalizeJSON,user=root,queue=default,state=FAILED,trackingUrl=http://ambari1.internal:8088/cluster/app/application_1515582681419_0003,appMasterHost=N/A,startTime=1515592208420,finishTime=1515592856545,finalStatus=FAILED,memorySeconds=1654602,vcoreSeconds=645,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=SPARK

 

 

rm-audit.log

Container exited with a non-zero exit code 10. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
f time 1515583020000 ms
18/01/10 12:17:00 INFO JobScheduler: Total delay: 0.013 s for time 1515583020000 ms (execution: 0.009 s)
18/01/10 12:17:00 INFO KafkaRDD: Removing RDD 8 from persistence list
18/01/10 12:17:00 INFO BlockManager: Removing RDD 8
18/01/10 12:17:00 INFO ReceivedBlockTracker: Deleting batches: 
18/01/10 12:17:00 INFO InputInfoTracker: remove old batch metadata: 1515583000000 ms
18/01/10 12:17:02 ERROR ApplicationMaster: Uncaught exception: 
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
	at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:423)
	at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:282)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:768)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
	at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:766)
	at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
18/01/10 12:17:02 INFO ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds])
18/01/10 12:17:02 INFO StreamingContext: Invoking stop(stopGracefully=false) from shutdown hook
18/01/10 12:17:02 INFO ReceiverTracker: ReceiverTracker stopped
18/01/10 12:17:02 INFO JobGenerator: Stopping JobGenerator immediately
18/01/10 12:17:02 INFO RecurringTimer: Stopped timer for JobGenerator after time 1515583020000
18/01/10 12:17:02 INFO JobGenerator: Stopped JobGenerator
18/01/10 12:17:02 INFO JobScheduler: Stopped JobScheduler
18/01/10 12:17:02 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@6fd4adf6{/streaming,null,UNAVAILABLE,@Spark}
18/01/10 12:17:02 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@4db63574{/streaming/batch,null,UNAVAILABLE,@Spark}
18/01/10 12:17:02 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@5aa67d0f{/static/streaming,null,UNAVAILABLE,@Spark}
18/01/10 12:17:02 INFO StreamingContext: StreamingContext stopped successfully
18/01/10 12:17:02 INFO SparkContext: Invoking stop() from shutdown hook
18/01/10 12:17:02 INFO AbstractConnector: Stopped Spark@39c6ad48{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
18/01/10 12:17:02 INFO SparkUI: Stopped Spark web UI at http://192.168.100.25:34356
18/01/10 12:17:02 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/01/10 12:17:02 INFO MemoryStore: MemoryStore cleared
18/01/10 12:17:02 INFO BlockManager: BlockManager stopped
18/01/10 12:17:02 INFO BlockManagerMaster: BlockManagerMaster stopped
18/01/10 12:17:02 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/01/10 12:17:02 INFO SparkContext: Successfully stopped SparkContext
18/01/10 12:17:02 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds])
18/01/10 12:17:02 INFO ApplicationMaster: Deleting staging directory hdfs://ambari1.internal:8020/user/root/.sparkStaging/application_1515582681419_0001
18/01/10 12:17:02 INFO ShutdownHookManager: Shutdown hook called
18/01/10 12:17:02 INFO ShutdownHookManager: Deleting directory /hadoop/yarn/local/usercache/root/appcache/application_1515582681419_0001/spark-58754dbc-fb87-4868-a4df-164f898d3ff5

Failing this attempt. Failing the application.	APPID=application_1515582681419_0001
2018-01-10 15:25:31,389 WARN resourcemanager.RMAuditLogger: USER=root	OPERATION=Application Finished - Failed	TARGET=RMAppManager	RESULT=FAILURE	DESCRIPTION=App failed with state: FAILED	PERMISSIONS=Application application_1515582681419_0002 failed 2 times due to AM Container for appattempt_1515582681419_0002_000002 exited with  exitCode: 10
For more detailed output, check the application tracking page: http://ambari1.internal:8088/cluster/app/application_1515582681419_0002 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e05_1515582681419_0002_02_000001
Exit code: 10

 

Explorer
Posts: 10
Registered: ‎11-21-2017

Re: Spark streaming job on YARN cluster mode stuck in accepted, then fails with a Timeout Exception

2018-01-10 12:13:32,405 INFO  resourcemanager.ClientRMService (ClientRMService.java:getNewApplicationId(297)) - Allocated new applicationId: 1
2018-01-10 12:13:34,080 WARN  rmapp.RMAppImpl (RMAppImpl.java:<init>(423)) - The specific max attempts: 0 for application: 1 is invalid, because it is out of the range [1, 2]. Use the global max attempts instead.
2018-01-10 12:13:34,081 INFO  rmapp.RMAppImpl (RMAppImpl.java:transition(1105)) - Storing application with id application_1515582681419_0001
2018-01-10 12:13:34,082 INFO  resourcemanager.ClientRMService (ClientRMService.java:submitApplication(615)) - Application with id 1 submitted by user root
2018-01-10 12:13:34,084 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(779)) - application_1515582681419_0001 State change from NEW to NEW_SAVING
2018-01-10 12:13:34,084 INFO  recovery.RMStateStore (RMStateStore.java:transition(199)) - Storing info for app: application_1515582681419_0001
2018-01-10 12:13:34,095 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(779)) - application_1515582681419_0001 State change from NEW_SAVING to SUBMITTED
2018-01-10 12:13:34,096 INFO  capacity.ParentQueue (ParentQueue.java:addApplication(360)) - Application added - appId: application_1515582681419_0001 user: root leaf-queue of parent: root #applications: 1
2018-01-10 12:13:34,097 INFO  capacity.CapacityScheduler (CapacityScheduler.java:addApplication(744)) - Accepted application application_1515582681419_0001 from user: root, in queue: default
2018-01-10 12:13:34,097 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(779)) - application_1515582681419_0001 State change from SUBMITTED to ACCEPTED
2018-01-10 12:13:34,119 INFO  resourcemanager.ApplicationMasterService (ApplicationMasterService.java:registerAppAttempt(679)) - Registering app attempt : appattempt_1515582681419_0001_000001
2018-01-10 12:13:34,120 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1515582681419_0001_000001 State change from NEW to SUBMITTED
2018-01-10 12:13:34,131 INFO  capacity.LeafQueue (LeafQueue.java:activateApplications(720)) - Application application_1515582681419_0001 from user: root activated in queue: default
2018-01-10 12:13:34,131 INFO  capacity.LeafQueue (LeafQueue.java:addApplicationAttempt(744)) - Application added - appId: application_1515582681419_0001 user: root, leaf-queue: default #user-pending-applications: 0 #user-active-applications: 1 #queue-pending-applications: 0 #queue-active-applications: 1
2018-01-10 12:13:34,132 INFO  capacity.CapacityScheduler (CapacityScheduler.java:addApplicationAttempt(773)) - Added Application Attempt appattempt_1515582681419_0001_000001 to scheduler from user root in queue default
2018-01-10 12:13:34,133 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1515582681419_0001_000001 State change from SUBMITTED to SCHEDULED
2018-01-10 12:13:34,701 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(422)) - container_e05_1515582681419_0001_01_000001 Container Transitioned from NEW to ALLOCATED
2018-01-10 12:13:34,702 INFO  scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(152)) - Assigned container container_e05_1515582681419_0001_01_000001 of capacity <memory:1536, vCores:1> on host ambari2.internal:45454, which has 1 containers, <memory:1536, vCores:1> used and <memory:10496, vCores:3> available after allocation
2018-01-10 12:13:34,702 INFO  allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(89)) - assignedContainer application attempt=appattempt_1515582681419_0001_000001 container=container_e05_1515582681419_0001_01_000001 queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@6d756235 clusterResource=<memory:12032, vCores:4> type=OFF_SWITCH
2018-01-10 12:13:34,716 INFO  security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken for nodeId : ambari2.internal:45454 for container : container_e05_1515582681419_0001_01_000001
2018-01-10 12:13:34,724 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(422)) - container_e05_1515582681419_0001_01_000001 Container Transitioned from ALLOCATED to ACQUIRED
2018-01-10 12:13:34,724 INFO  capacity.ParentQueue (ParentQueue.java:assignContainers(475)) - assignedContainer queue=root usedCapacity=0.12765957 absoluteUsedCapacity=0.12765957 used=<memory:1536, vCores:1> cluster=<memory:12032, vCores:4>
2018-01-10 12:13:34,725 INFO  security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:clearNodeSetForAttempt(146)) - Clear node set for appattempt_1515582681419_0001_000001
2018-01-10 12:13:34,725 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:storeAttempt(1955)) - Storing attempt: AppId: application_1515582681419_0001 AttemptId: appattempt_1515582681419_0001_000001 MasterContainer: Container: [ContainerId: container_e05_1515582681419_0001_01_000001, NodeId: ambari2.internal:45454, NodeHttpAddress: ambari2.internal:8042, Resource: <memory:1536, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.100.25:45454 }, ]
2018-01-10 12:13:34,728 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1515582681419_0001_000001 State change from SCHEDULED to ALLOCATED_SAVING
2018-01-10 12:13:34,731 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1515582681419_0001_000001 State change from ALLOCATED_SAVING to ALLOCATED
2018-01-10 12:13:34,751 INFO  amlauncher.AMLauncher (AMLauncher.java:run(266)) - Launching masterappattempt_1515582681419_0001_000001
2018-01-10 12:13:34,770 INFO  amlauncher.AMLauncher (AMLauncher.java:launch(111)) - Setting up container Container: [ContainerId: container_e05_1515582681419_0001_01_000001, NodeId: ambari2.internal:45454, NodeHttpAddress: ambari2.internal:8042, Resource: <memory:1536, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.100.25:45454 }, ] for AM appattempt_1515582681419_0001_000001
2018-01-10 12:13:34,771 INFO  security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createAndGetAMRMToken(195)) - Create AMRMToken for ApplicationAttempt: appattempt_1515582681419_0001_000001
2018-01-10 12:13:34,774 INFO  security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createPassword(307)) - Creating password for appattempt_1515582681419_0001_000001
2018-01-10 12:13:34,985 INFO  amlauncher.AMLauncher (AMLauncher.java:launch(132)) - Done launching container Container: [ContainerId: container_e05_1515582681419_0001_01_000001, NodeId: ambari2.internal:45454, NodeHttpAddress: ambari2.internal:8042, Resource: <memory:1536, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.100.25:45454 }, ] for AM appattempt_1515582681419_0001_000001
2018-01-10 12:13:34,986 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1515582681419_0001_000001 State change from ALLOCATED to LAUNCHED
2018-01-10 12:13:35,701 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(422)) - container_e05_1515582681419_0001_01_000001 Container Transitioned from ACQUIRED to RUNNING
2018-01-10 12:15:18,860 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(422)) - container_e05_1515582681419_0001_01_000001 Container Transitioned from RUNNING to COMPLETED
2018-01-10 12:15:18,860 INFO  scheduler.SchedulerNode (SchedulerNode.java:releaseContainer(220)) - Released container container_e05_1515582681419_0001_01_000001 of capacity <memory:1536, vCores:1> on host ambari2.internal:45454, which currently has 0 containers, <memory:0, vCores:0> used and <memory:12032, vCores:4> available, release resources=true
2018-01-10 12:15:18,861 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1209)) - Updating application attempt appattempt_1515582681419_0001_000001 with final state: FAILED, and exit status: 10
2018-01-10 12:15:18,862 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1515582681419_0001_000001 State change from LAUNCHED to FINAL_SAVING
2018-01-10 12:15:18,862 INFO  integration.RMRegistryOperationsService (RMRegistryOperationsService.java:onContainerFinished(144)) - Container container_e05_1515582681419_0001_01_000001 finished, purging container-level records
2018-01-10 12:15:18,862 INFO  integration.RMRegistryOperationsService (RMRegistryOperationsService.java:purgeRecordsAsync(198)) -  records under / with ID container_e05_1515582681419_0001_01_000001 and policy container: {}
2018-01-10 12:15:18,870 INFO  resourcemanager.ApplicationMasterService (ApplicationMasterService.java:unregisterAttempt(685)) - Unregistering app attempt : appattempt_1515582681419_0001_000001
2018-01-10 12:15:18,871 INFO  security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:applicationMasterFinished(124)) - Application finished, removing password for appattempt_1515582681419_0001_000001
2018-01-10 12:15:18,872 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1515582681419_0001_000001 State change from FINAL_SAVING to FAILED
2018-01-10 12:15:18,872 INFO  rmapp.RMAppImpl (RMAppImpl.java:transition(1331)) - The number of failed attempts is 1. The max attempts is 2
2018-01-10 12:15:18,873 INFO  capacity.CapacityScheduler (CapacityScheduler.java:doneApplicationAttempt(812)) - Application Attempt appattempt_1515582681419_0001_000001 is done. finalState=FAILED
2018-01-10 12:15:18,873 INFO  resourcemanager.ApplicationMasterService (ApplicationMasterService.java:registerAppAttempt(679)) - Registering app attempt : appattempt_1515582681419_0001_000002
2018-01-10 12:15:18,873 INFO  scheduler.AppSchedulingInfo (AppSchedulingInfo.java:clearRequests(124)) - Application application_1515582681419_0001 requests cleared
2018-01-10 12:15:18,873 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1515582681419_0001_000002 State change from NEW to SUBMITTED
2018-01-10 12:15:18,873 INFO  capacity.LeafQueue (LeafQueue.java:removeApplicationAttempt(795)) - Application removed - appId: application_1515582681419_0001 user: root queue: default #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2018-01-10 12:15:18,873 INFO  capacity.LeafQueue (LeafQueue.java:activateApplications(720)) - Application application_1515582681419_0001 from user: root activated in queue: default
2018-01-10 12:15:18,873 INFO  capacity.LeafQueue (LeafQueue.java:addApplicationAttempt(744)) - Application added - appId: application_1515582681419_0001 user: root, leaf-queue: default #user-pending-applications: 0 #user-active-applications: 1 #queue-pending-applications: 0 #queue-active-applications: 1
2018-01-10 12:15:18,873 INFO  capacity.CapacityScheduler (CapacityScheduler.java:addApplicationAttempt(773)) - Added Application Attempt appattempt_1515582681419_0001_000002 to scheduler from user root in queue default
2018-01-10 12:15:18,877 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1515582681419_0001_000002 State change from SUBMITTED to SCHEDULED
2018-01-10 12:15:19,860 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(422)) - container_e05_1515582681419_0001_02_000001 Container Transitioned from NEW to ALLOCATED
2018-01-10 12:15:19,860 INFO  scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(152)) - Assigned container container_e05_1515582681419_0001_02_000001 of capacity <memory:1536, vCores:1> on host ambari2.internal:45454, which has 1 containers, <memory:1536, vCores:1> used and <memory:10496, vCores:3> available after allocation
2018-01-10 12:15:19,861 INFO  allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(89)) - assignedContainer application attempt=appattempt_1515582681419_0001_000002 container=container_e05_1515582681419_0001_02_000001 queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@2cf52214 clusterResource=<memory:12032, vCores:4> type=OFF_SWITCH
2018-01-10 12:15:19,862 INFO  security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken for nodeId : ambari2.internal:45454 for container : container_e05_1515582681419_0001_02_000001
2018-01-10 12:15:19,862 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(422)) - container_e05_1515582681419_0001_02_000001 Container Transitioned from ALLOCATED to ACQUIRED
2018-01-10 12:15:19,863 INFO  security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:clearNodeSetForAttempt(146)) - Clear node set for appattempt_1515582681419_0001_000002
2018-01-10 12:15:19,863 INFO  capacity.ParentQueue (ParentQueue.java:assignContainers(475)) - assignedContainer queue=root usedCapacity=0.12765957 absoluteUsedCapacity=0.12765957 used=<memory:1536, vCores:1> cluster=<memory:12032, vCores:4>
2018-01-10 12:15:19,863 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:storeAttempt(1955)) - Storing attempt: AppId: application_1515582681419_0001 AttemptId: appattempt_1515582681419_0001_000002 MasterContainer: Container: [ContainerId: container_e05_1515582681419_0001_02_000001, NodeId: ambari2.internal:45454, NodeHttpAddress: ambari2.internal:8042, Resource: <memory:1536, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.100.25:45454 }, ]
2018-01-10 12:15:19,863 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1515582681419_0001_000002 State change from SCHEDULED to ALLOCATED_SAVING
2018-01-10 12:15:19,867 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1515582681419_0001_000002 State change from ALLOCATED_SAVING to ALLOCATED
2018-01-10 12:15:19,867 INFO  amlauncher.AMLauncher (AMLauncher.java:run(266)) - Launching masterappattempt_1515582681419_0001_000002
2018-01-10 12:15:19,870 INFO  amlauncher.AMLauncher (AMLauncher.java:launch(111)) - Setting up container Container: [ContainerId: container_e05_1515582681419_0001_02_000001, NodeId: ambari2.internal:45454, NodeHttpAddress: ambari2.internal:8042, Resource: <memory:1536, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.100.25:45454 }, ] for AM appattempt_1515582681419_0001_000002
2018-01-10 12:15:19,870 INFO  security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createAndGetAMRMToken(195)) - Create AMRMToken for ApplicationAttempt: appattempt_1515582681419_0001_000002
2018-01-10 12:15:19,870 INFO  security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:createPassword(307)) - Creating password for appattempt_1515582681419_0001_000002
2018-01-10 12:15:19,887 INFO  amlauncher.AMLauncher (AMLauncher.java:launch(132)) - Done launching container Container: [ContainerId: container_e05_1515582681419_0001_02_000001, NodeId: ambari2.internal:45454, NodeHttpAddress: ambari2.internal:8042, Resource: <memory:1536, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 192.168.100.25:45454 }, ] for AM appattempt_1515582681419_0001_000002
2018-01-10 12:15:19,887 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1515582681419_0001_000002 State change from ALLOCATED to LAUNCHED
2018-01-10 12:15:20,862 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(422)) - container_e05_1515582681419_0001_02_000001 Container Transitioned from ACQUIRED to RUNNING
2018-01-10 12:17:02,626 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(422)) - container_e05_1515582681419_0001_02_000001 Container Transitioned from RUNNING to COMPLETED
2018-01-10 12:17:02,626 INFO  scheduler.SchedulerNode (SchedulerNode.java:releaseContainer(220)) - Released container container_e05_1515582681419_0001_02_000001 of capacity <memory:1536, vCores:1> on host ambari2.internal:45454, which currently has 0 containers, <memory:0, vCores:0> used and <memory:12032, vCores:4> available, release resources=true
2018-01-10 12:17:02,627 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1209)) - Updating application attempt appattempt_1515582681419_0001_000002 with final state: FAILED, and exit status: 10
2018-01-10 12:17:02,627 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1515582681419_0001_000002 State change from LAUNCHED to FINAL_SAVING
2018-01-10 12:17:02,627 INFO  integration.RMRegistryOperationsService (RMRegistryOperationsService.java:onContainerFinished(144)) - Container container_e05_1515582681419_0001_02_000001 finished, purging container-level records
2018-01-10 12:17:02,627 INFO  integration.RMRegistryOperationsService (RMRegistryOperationsService.java:purgeRecordsAsync(198)) -  records under / with ID container_e05_1515582681419_0001_02_000001 and policy container: {}
2018-01-10 12:17:02,632 INFO  resourcemanager.ApplicationMasterService (ApplicationMasterService.java:unregisterAttempt(685)) - Unregistering app attempt : appattempt_1515582681419_0001_000002
2018-01-10 12:17:02,632 INFO  security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:applicationMasterFinished(124)) - Application finished, removing password for appattempt_1515582681419_0001_000002
2018-01-10 12:17:02,632 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1515582681419_0001_000002 State change from FINAL_SAVING to FAILED
2018-01-10 12:17:02,632 INFO  rmapp.RMAppImpl (RMAppImpl.java:transition(1331)) - The number of failed attempts is 2. The max attempts is 2
2018-01-10 12:17:02,633 INFO  rmapp.RMAppImpl (RMAppImpl.java:rememberTargetTransitionsAndStoreState(1124)) - Updating application application_1515582681419_0001 with final state: FAILED
2018-01-10 12:17:02,634 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(779)) - application_1515582681419_0001 State change from ACCEPTED to FINAL_SAVING
2018-01-10 12:17:02,634 INFO  recovery.RMStateStore (RMStateStore.java:transition(228)) - Updating info for app: application_1515582681419_0001
2018-01-10 12:17:02,634 INFO  capacity.CapacityScheduler (CapacityScheduler.java:doneApplicationAttempt(812)) - Application Attempt appattempt_1515582681419_0001_000002 is done. finalState=FAILED
2018-01-10 12:17:02,634 INFO  scheduler.AppSchedulingInfo (AppSchedulingInfo.java:clearRequests(124)) - Application application_1515582681419_0001 requests cleared
2018-01-10 12:17:02,634 INFO  capacity.LeafQueue (LeafQueue.java:removeApplicationAttempt(795)) - Application removed - appId: application_1515582681419_0001 user: root queue: default #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2018-01-10 12:17:02,636 INFO  rmapp.RMAppImpl (RMAppImpl.java:transition(1064)) - Application application_1515582681419_0001 failed 2 times due to AM Container for appattempt_1515582681419_0001_000002 exited with  exitCode: 10
For more detailed output, check the application tracking page: http://ambari1.internal:8088/cluster/app/application_1515582681419_0001 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e05_1515582681419_0001_02_000001
Exit code: 10

Container exited with a non-zero exit code 10. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
f time 1515583020000 ms
18/01/10 12:17:00 INFO JobScheduler: Total delay: 0.013 s for time 1515583020000 ms (execution: 0.009 s)
18/01/10 12:17:00 INFO KafkaRDD: Removing RDD 8 from persistence list
18/01/10 12:17:00 INFO BlockManager: Removing RDD 8
18/01/10 12:17:00 INFO ReceivedBlockTracker: Deleting batches: 
18/01/10 12:17:00 INFO InputInfoTracker: remove old batch metadata: 1515583000000 ms
18/01/10 12:17:02 ERROR ApplicationMaster: Uncaught exception: 
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
	at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:423)
	at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:282)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:768)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
	at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:766)
	at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
18/01/10 12:17:02 INFO ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds])
18/01/10 12:17:02 INFO StreamingContext: Invoking stop(stopGracefully=false) from shutdown hook
18/01/10 12:17:02 INFO ReceiverTracker: ReceiverTracker stopped
18/01/10 12:17:02 INFO JobGenerator: Stopping JobGenerator immediately
18/01/10 12:17:02 INFO RecurringTimer: Stopped timer for JobGenerator after time 1515583020000
18/01/10 12:17:02 INFO JobGenerator: Stopped JobGenerator
18/01/10 12:17:02 INFO JobScheduler: Stopped JobScheduler
18/01/10 12:17:02 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@6fd4adf6{/streaming,null,UNAVAILABLE,@Spark}
18/01/10 12:17:02 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@4db63574{/streaming/batch,null,UNAVAILABLE,@Spark}
18/01/10 12:17:02 INFO ContextHandler: Stopped o.s.j.s.ServletContextHandler@5aa67d0f{/static/streaming,null,UNAVAILABLE,@Spark}
18/01/10 12:17:02 INFO StreamingContext: StreamingContext stopped successfully
18/01/10 12:17:02 INFO SparkContext: Invoking stop() from shutdown hook
18/01/10 12:17:02 INFO AbstractConnector: Stopped Spark@39c6ad48{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
18/01/10 12:17:02 INFO SparkUI: Stopped Spark web UI at http://192.168.100.25:34356
18/01/10 12:17:02 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/01/10 12:17:02 INFO MemoryStore: MemoryStore cleared
18/01/10 12:17:02 INFO BlockManager: BlockManager stopped
18/01/10 12:17:02 INFO BlockManagerMaster: BlockManagerMaster stopped
18/01/10 12:17:02 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/01/10 12:17:02 INFO SparkContext: Successfully stopped SparkContext
18/01/10 12:17:02 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds])
18/01/10 12:17:02 INFO ApplicationMaster: Deleting staging directory hdfs://ambari1.internal:8020/user/root/.sparkStaging/application_1515582681419_0001
18/01/10 12:17:02 INFO ShutdownHookManager: Shutdown hook called
18/01/10 12:17:02 INFO ShutdownHookManager: Deleting directory /hadoop/yarn/local/usercache/root/appcache/application_1515582681419_0001/spark-58754dbc-fb87-4868-a4df-164f898d3ff5

Failing this attempt. Failing the application.
2018-01-10 12:17:02,637 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(779)) - application_1515582681419_0001 State change from FINAL_SAVING to FAILED
2018-01-10 12:17:02,637 INFO  capacity.ParentQueue (ParentQueue.java:removeApplication(385)) - Application removed - appId: application_1515582681419_0001 user: root leaf-queue of parent: root #applications: 0
2018-01-10 12:17:02,638 INFO  integration.RMRegistryOperationsService (RMRegistryOperationsService.java:onApplicationCompleted(119)) - Application application_1515582681419_0001 completed, purging application-level records
2018-01-10 12:17:02,638 INFO  integration.RMRegistryOperationsService (RMRegistryOperationsService.java:purgeRecordsAsync(198)) -  records under / with ID application_1515582681419_0001 and policy application: {}
2018-01-10 12:21:21,034 INFO  scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:run(479)) - Release request cache is cleaned up
2018-01-10 13:21:20,999 INFO  availability.MetricSinkWriteShardHostnameHashingStrategy (MetricSinkWriteShardHostnameHashingStrategy.java:findCollectorShard(42)) - Calculated collector shard ambari1.internal based on hostname: ambari1.internal
2018-01-10 14:21:18,164 INFO  resourcemanager.ClientRMService (ClientRMService.java:getNewApplicationId(297)) - Allocated new applicationId: 2
2018-01-10 14:21:19,857 WARN  rmapp.RMAppImpl (RMAppImpl.java:<init>(423)) - The specific max attempts: 0 for application: 2 is invalid, because it is out of the range [1, 2]. Use the global max attempts instead.
2018-01-10 14:21:19,857 INFO  resourcemanager.ClientRMService (ClientRMService.java:submitApplication(615)) - Application with id 2 submitted by user root
2018-01-10 14:21:19,857 INFO  rmapp.RMAppImpl (RMAppImpl.java:transition(1105)) - Storing application with id application_1515582681419_0002
2018-01-10 14:21:19,858 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(779)) - application_1515582681419_0002 State change from NEW to NEW_SAVING
2018-01-10 14:21:19,858 INFO  recovery.RMStateStore (RMStateStore.java:transition(199)) - Storing info for app: application_1515582681419_0002
2018-01-10 14:21:19,863 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(779)) - application_1515582681419_0002 State change from NEW_SAVING to SUBMITTED
2018-01-10 14:21:19,863 INFO  capacity.ParentQueue (ParentQueue.java:addApplication(360)) - Application added - appId: application_1515582681419_0002 user: root leaf-queue of parent: root #applications: 1
2018-01-10 14:21:19,863 INFO  capacity.CapacityScheduler (CapacityScheduler.java:addApplication(744)) - Accepted application application_1515582681419_0002 from user: root, in queue: default

 

resourcemanager.log 

Explorer
Posts: 10
Registered: ‎11-21-2017

Re: Spark streaming job on YARN cluster mode stuck in accepted, then fails with a Timeout Exception

nodemanager.log 
2018-01-10 15:32:29,309 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(499)) - Memory usage of ProcessTree 15190 for container-id container_e07_1515594329635_0001_01_000001: 513.4 MB of 2.5 GB physical memory used; 3.7 GB of 5.3 GB virtual memory used 2018-01-10 15:32:29,553 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(249)) - Exit code from container container_e07_1515594329635_0001_01_000001 is : 10 2018-01-10 15:32:29,554 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(255)) - Exception from container-launch with container ID: container_e07_1515594329635_0001_01_000001 and exit code: 10 ExitCodeException exitCode=10: at org.apache.hadoop.util.Shell.runCommand(Shell.java:944) at org.apache.hadoop.util.Shell.run(Shell.java:848) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:237) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:326) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:87) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2018-01-10 15:32:29,556 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(368)) - Exception from container-launch. 2018-01-10 15:32:29,557 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(368)) - Container id: container_e07_1515594329635_0001_01_000001 2018-01-10 15:32:29,557 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(368)) - Exit code: 10 2018-01-10 15:32:29,557 WARN launcher.ContainerLaunch (ContainerLaunch.java:handleContainerExitWithFailure(431)) - Container launch failed : Container exited with a non-zero exit code 10. 2018-01-10 15:32:29,563 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e07_1515594329635_0001_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE 2018-01-10 15:32:29,564 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(541)) - Cleaning up container container_e07_1515594329635_0001_01_000001 2018-01-10 15:32:29,588 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(492)) - Deleting absolute path : /hadoop/yarn/local/usercache/root/appcache/application_1515594329635_0001/container_e07_1515594329635_0001_01_000001 2018-01-10 15:32:29,595 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e07_1515594329635_0001_01_000001 transitioned from EXITED_WITH_FAILURE to DONE
Explorer
Posts: 10
Registered: ‎11-21-2017

Re: Spark streaming job on YARN cluster mode stuck in accepted, then fails with a Timeout Exception

spark-submit command

CONF_FILE=hdfs://ambari1.internal:8020/user/hadoop/local.normalizer.conf

${SPARK_DIR}/bin/spark-submit --conf spark.yarn.am.waitTime=110s --class mypackage.NormalizeJSON --master yarn --deploy-mode cluster  ${WORK_DIR}/target/${JAR_VERSION} ${CONF_FILE}
Announcements