Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

spark error after upgrade to cdh5.50

spark error after upgrade to cdh5.50

New Contributor

Hi, we have just upgraded our cluster to cdh5.50. After the upgrade, formerly developed spark application cannot run. Even a simplest wordcount raise errors. There is no error or warning in cloudera manager. Errors in the logs are as follows:

 

15/11/24 15:04:12 ERROR ErrorMonitor: AssociationError [akka.tcp://sparkDriver@10.0.0.200:50785] <- [akka.tcp://driverPropsFetcher@shgc03:55655]: Error [Shut down address: akka.tcp://driverPropsFetcher@shgc03:55655] [
akka.remote.ShutDownAssociation: Shut down address: akka.tcp://driverPropsFetcher@shgc03:55655
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down.
]
akka.event.Logging$Error$NoCause$
15/11/24 15:04:13 INFO YarnClientSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@shgc03:58430/user/Executor#-809057478]) with ID 1
15/11/24 15:04:13 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 1)
15/11/24 15:04:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, shgc03, partition 0,NODE_LOCAL, 2220 bytes)
15/11/24 15:04:13 INFO BlockManagerMasterEndpoint: Registering block manager shgc03:43239 with 530.3 MB RAM, BlockManagerId(1, shgc03, 43239)
15/11/24 15:04:13 ERROR ErrorMonitor: AssociationError [akka.tcp://sparkDriver@10.0.0.200:50785] <- [akka.tcp://driverPropsFetcher@shgc02:55596]: Error [Shut down address: akka.tcp://driverPropsFetcher@shgc02:55596] [
akka.remote.ShutDownAssociation: Shut down address: akka.tcp://driverPropsFetcher@shgc02:55596
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down.
]
akka.event.Logging$Error$NoCause$
15/11/24 15:04:13 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on shgc03:43239 (size: 1859.0 B, free: 530.3 MB)
15/11/24 15:04:13 INFO YarnClientSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@shgc02:34693/user/Executor#-143872230]) with ID 2
15/11/24 15:04:13 INFO ExecutorAllocationManager: New executor 2 has registered (new total is 2)
15/11/24 15:04:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, shgc02, partition 1,NODE_LOCAL, 2220 bytes)
15/11/24 15:04:13 INFO BlockManagerMasterEndpoint: Registering block manager shgc02:36398 with 530.3 MB RAM, BlockManagerId(2, shgc02, 36398)
15/11/24 15:04:14 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on shgc03:43239 (size: 22.0 KB, free: 530.3 MB)
15/11/24 15:04:14 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on shgc02:36398 (size: 1859.0 B, free: 530.3 MB)
15/11/24 15:04:14 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on shgc02:36398 (size: 22.0 KB, free: 530.3 MB)
15/11/24 15:04:14 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1674 ms on shgc03 (1/2)
15/11/24 15:04:15 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1534 ms on shgc02 (2/2)
15/11/24 15:04:15 INFO DAGScheduler: ResultStage 0 (count at App.scala:15) finished in 6.555 s
15/11/24 15:04:15 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/11/24 15:04:15 INFO DAGScheduler: Job 0 finished: count at App.scala:15, took 6.725293 s
15/11/24 15:04:15 INFO SparkContext: Starting job: count at App.scala:16
15/11/24 15:04:15 INFO DAGScheduler: Got job 1 (count at App.scala:16) with 2 output partitions
15/11/24 15:04:15 INFO DAGScheduler: Final stage: ResultStage 1(count at App.scala:16)
15/11/24 15:04:15 INFO DAGScheduler: Parents of final stage: List()
15/11/24 15:04:15 INFO DAGScheduler: Missing parents: List()
15/11/24 15:04:15 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[3] at filter at App.scala:16), which has no missing parents
15/11/24 15:04:15 INFO MemoryStore: ensureFreeSpace(3184) called with curMem=222036, maxMem=556038881
15/11/24 15:04:15 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.1 KB, free 530.1 MB)
15/11/24 15:04:15 INFO MemoryStore: ensureFreeSpace(1861) called with curMem=225220, maxMem=556038881
15/11/24 15:04:15 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1861.0 B, free 530.1 MB)
15/11/24 15:04:15 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.0.0.200:58136 (size: 1861.0 B, free: 530.3 MB)
15/11/24 15:04:15 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:861
15/11/24 15:04:15 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at filter at App.scala:16)
15/11/24 15:04:15 INFO YarnScheduler: Adding task set 1.0 with 2 tasks
15/11/24 15:04:15 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, shgc02, partition 0,NODE_LOCAL, 2220 bytes)
15/11/24 15:04:15 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, shgc03, partition 1,NODE_LOCAL, 2220 bytes)
15/11/24 15:04:15 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on shgc02:36398 (size: 1861.0 B, free: 530.3 MB)
15/11/24 15:04:15 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on shgc03:43239 (size: 1861.0 B, free: 530.3 MB)
15/11/24 15:04:15 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 73 ms on shgc02 (1/2)
15/11/24 15:04:15 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 120 ms on shgc03 (2/2)
15/11/24 15:04:15 INFO DAGScheduler: ResultStage 1 (count at App.scala:16) finished in 0.122 s
15/11/24 15:04:15 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/11/24 15:04:15 INFO DAGScheduler: Job 1 finished: count at App.scala:16, took 0.147709 s
Lines with a: 4, Lines with b: 3
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
15/11/24 15:04:15 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
15/11/24 15:04:15 INFO SparkUI: Stopped Spark web UI at http://10.0.0.200:4041
15/11/24 15:04:15 INFO DAGScheduler: Stopping DAGScheduler
15/11/24 15:04:15 INFO YarnClientSchedulerBackend: Shutting down all executors
15/11/24 15:04:15 INFO YarnClientSchedulerBackend: Interrupting monitor thread
15/11/24 15:04:15 INFO YarnClientSchedulerBackend: Asking each executor to shut down
15/11/24 15:04:15 INFO YarnClientSchedulerBackend: Stopped
15/11/24 15:04:15 ERROR ErrorMonitor: AssociationError [akka.tcp://sparkDriver@10.0.0.200:50785] <- [akka.tcp://sparkExecutor@shgc03:58430]: Error [Shut down address: akka.tcp://sparkExecutor@shgc03:58430] [
akka.remote.ShutDownAssociation: Shut down address: akka.tcp://sparkExecutor@shgc03:58430
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down.
]
akka.event.Logging$Error$NoCause$
15/11/24 15:04:15 ERROR ErrorMonitor: AssociationError [akka.tcp://sparkDriver@10.0.0.200:50785] <- [akka.tcp://sparkExecutor@shgc02:34693]: Error [Shut down address: akka.tcp://sparkExecutor@shgc02:34693] [
akka.remote.ShutDownAssociation: Shut down address: akka.tcp://sparkExecutor@shgc02:34693
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down.
]
akka.event.Logging$Error$NoCause$

 

Anyone kindly give a hand?

7 REPLIES 7

Re: spark error after upgrade to cdh5.50

The example you're using is in fact completing. Note the lines:

 

 

15/11/24 15:04:15 INFO DAGScheduler: Job 1 finished: count at App.scala:16, took 0.147709 s
Lines with a: 4, Lines with b: 3

 

I've encountered the same InvalidAssociationException myself, but I believe it's occurring after the sparkContext is stopped and shouldn't affect program execution.

Highlighted

Re: spark error after upgrade to cdh5.50

New Contributor

Thank you. But is there anyway to get rid of this noisy bug? This happens even when using the spark-shell.

Re: spark error after upgrade to cdh5.50

New Contributor

We have a number of batch and streaming jobs in Spark.  All are having similar failures.  We are seeing the executor actors fail consistently when starting up a SparkContext and we are also seeing failures in the middle of execution.  Tests in spark-shell show this as well.

 

This is particularly challenging with our streaming processes.  Spark will retry the association to the Executor Actors but most times that fails as well.  I have kicked off the StreamProcessor around 10 times before it finally associates with the Executor actors and starts processing. 

 

There is clearly a problem with Spark on upgrade to 5.5.0.  I wonder if it's an issue that may have been fixed in Spark 1.5.2?  Why doesn't Cloudera run the patched version of Spark?  

Re: spark error after upgrade to cdh5.50

Master Collaborator
I've not seen this cause problems or actual result in failures; you
may have a different problem that is specific to your env. Why do you
think it's fixed in 1.5.2? 5.5.0 ~= 1.5.2 anyway. contact support if
you have a support contract.

Re: spark error after upgrade to cdh5.50

New Contributor

This is helpful in supressing the error messages.  I may have been barking up the wrong tree.

 

http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_rn_spark_ki.htm...

 

Sidenote: My comment about Spark 1.5.2 was speculative and I shouldn't have thrown it into this thread.  But I still do think Cloudera should include patch releases of upstream software.  The Cloudera docs and the maven repo all suggest 5.5.0 is spark 1.5.0 even though 1.5.2 has been out for a while.  With 5.4.x I used every patch release and Spark was always 1.3.0 even though 1.3.1 came out very early in the 5.4.x lifecycle.

 

No need to reply to that since this thread is not focused on versioning.

Re: spark error after upgrade to cdh5.50

Master Collaborator
Oh, we certainly include patch releases! That's most of the value we
provide. Read "1.5.0" as "1.5.x". Minor releases of CDH generally
rebase on minor releases of upstream projects. You'll see if you look
in the release notes that CDH 5.5 really includes "Spark 1.5.0 +
patches" and then lists the exact patches. The difference will be the
exact set of maintenance release patches. They'll all be patches from
upstream, but it's conceivable that CDH gets a maintenance patch
earlier than an upstream release (not the upstream patch). The correct
way to designate it is "1.5.0 + patches" since you can't strictly call
it "1.5.1" or "1.5.2" but it includes much of those maintenance
releases, and CDH 5.5.x releases continue to add more 1.5.x
maintenance patches.

Re: spark error after upgrade to cdh5.50

New Contributor

After I add this log4j.logger.org.apache.spark.rpc.akka.ErrorMonitor=FATAL to Spark log4j configuration file, and reload the cfg and restart thecluster, but did't work! cdh version 5.5.1