Spark jobs failing

Amn_468 — Thu, 14 May 2020 04:42:55 GMT

Hello All,

We are running Spark jobs via yarn and its failing with the below error, any help / pointer to fix is much appericated.

Shell output: main : command provided 1 main : run as user is TEST1 main : requested yarn user is TEST1 Writing to tmp file /data/8/yarn/nm/nmPrivate/application_1587389136999_0013/container_e56_1587389136999_0013_01_000477/container_e56_1587389136999_0013_01_000477.pid.tmp Writing to cgroup task files... Container exited with a non-zero exit code 1 org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216) at scala.util.Try$.apply(Try.scala:192) at scala.util.Failure.recover(Try.scala:216) at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at scala.concurrent.Promise$class.complete(Promise.scala:55) at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63) at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78) at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at scala.concurrent.Promise$class.tryFailure(Promise.scala:112) at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153) at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205) at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds ... 8 more

Regards

Amn

Re: Spark jobs failing

Madhur — Thu, 14 May 2020 06:40:10 GMT

Hello @Amn_468 ,

To better assist you with this issue, it would be great if you could please help to provide the following additional information:
1) Is this issue occurring for all jobs or only some jobs? If the issue has only started recently, does this coincide with any code or configuration changes in the job itself or configuration changes in the cluster?

Re: Spark jobs failing

Amn_468 — Thu, 14 May 2020 08:34:09 GMT

Hi @Madhur

This is happening with all Spark jobs, there has been no changes in the code or cluster, also the failure is random.

Re: Spark jobs failing

Madhur — Thu, 14 May 2020 11:29:57 GMT

Hi @Amn_468 ,

Thank you for replying back.

Kindly try Increasing spark.rpc.askTimeout from default 120 seconds to a higher value in Ambari UI -> Spark Configs -> spark2-defaults. Recommendation is to increase it to at least 480 seconds and restart the necessary services.possibly the Driver and Executer are not able to get Heartbeat response in configured timeout. If you don’t want to do any cluster level change then you may try overriding this value in the job level.

For example: spark-submit by adding --conf spark.rpc.askTimeout=600s while submitting the job

Re: Spark jobs failing

Amn_468 — Thu, 14 May 2020 13:02:34 GMT

Hi@Madhur

Appreciate your assistance, I am using CM, where would this setting be in CM for making the changes in Cluster level, and to confirm these values have to be passed in seconds ?

Could you also provide steps / document outlining how to change this while submitting spark jobs.

question Spark jobs failing in Support Questions

Spark jobs failing

Re: Spark jobs failing

Re: Spark jobs failing

Re: Spark jobs failing

Re: Spark jobs failing