Support Questions

Amn_468 · ‎05-13-2020

Hello All,

We are running Spark jobs via yarn and its failing with the below error, any help / pointer to fix is much appericated.

Shell output: main : command provided 1
main : run as user is TEST1
main : requested yarn user is TEST1
Writing to tmp file /data/8/yarn/nm/nmPrivate/application_1587389136999_0013/container_e56_1587389136999_0013_01_000477/container_e56_1587389136999_0013_01_000477.pid.tmp
Writing to cgroup task files...


Container exited with a non-zero exit code 1

org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
	at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
	at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
	at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
	at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
	at scala.util.Try$.apply(Try.scala:192)
	at scala.util.Failure.recover(Try.scala:216)
	at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
	at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
	at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
	at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
	at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
	at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
	at scala.concurrent.Promise$class.complete(Promise.scala:55)
	at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
	at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
	at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
	at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
	at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
	at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
	at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
	at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
	at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
	at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
	at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
	at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
	at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
	at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
	... 8 more

Regards

Amn

Madhur · ‎05-14-2020

Hi @Amn_468 ,

Thank you for replying back.

Kindly try Increasing spark.rpc.askTimeout from default 120 seconds to a higher value in Ambari UI -> Spark Configs -> spark2-defaults. Recommendation is to increase it to at least 480 seconds and restart the necessary services.possibly the Driver and Executer are not able to get Heartbeat response in configured timeout. If you don’t want to do any cluster level change then you may try overriding this value in the job level.

For example: spark-submit by adding --conf spark.rpc.askTimeout=600s while submitting the job

Madhuri Adipudi, Technical Solutions Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

Community Guidelines
How to use the forum

View solution in original post

Madhur · ‎05-13-2020

Hello @Amn_468 ,

To better assist you with this issue, it would be great if you could please help to provide the following additional information:
1) Is this issue occurring for all jobs or only some jobs? If the issue has only started recently, does this coincide with any code or configuration changes in the job itself or configuration changes in the cluster?

Madhuri Adipudi, Technical Solutions Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

Community Guidelines
How to use the forum

Amn_468 · ‎05-14-2020

Hi @Madhur

This is happening with all Spark jobs, there has been no changes in the code or cluster, also the failure is random.

Madhur · ‎05-14-2020

Hi @Amn_468 ,

Thank you for replying back.

Kindly try Increasing spark.rpc.askTimeout from default 120 seconds to a higher value in Ambari UI -> Spark Configs -> spark2-defaults. Recommendation is to increase it to at least 480 seconds and restart the necessary services.possibly the Driver and Executer are not able to get Heartbeat response in configured timeout. If you don’t want to do any cluster level change then you may try overriding this value in the job level.

For example: spark-submit by adding --conf spark.rpc.askTimeout=600s while submitting the job

Madhuri Adipudi, Technical Solutions Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

Community Guidelines
How to use the forum

Amn_468 · ‎05-14-2020

Hi@Madhur

Appreciate your assistance, I am using CM, where would this setting be in CM for making the changes in Cluster level, and to confirm these values have to be passed in seconds ?

Could you also provide steps / document outlining how to change this while submitting spark jobs.

Cloudera Community

Support Questions

Spark jobs failing

Cloudera Data Engineering Spark Job with Python Wh...

Working with CDE Spark Job Parameters in Cloudera ...

Spark 3 legacy configurations list ( Spark 2 behav...

Creating a CDE Job with Spark Application Code loc...

Spark Python Supportability Matrix

Spark and Java versions Supportability Matrix

Spark job fails in cluster mode.

Spark job fails with below error when byte code gr...

Spark job failure after Kerberos is enabled

Spark Scala Version Compatibility Matrix