Support Questions

Find answers, ask questions, and share your expertise

Spark shuffle is failing with connection exception when dynamic allocation is enabled

avatar
Contributor

Hi,

When dynamic allocation is enabled, most of the times we are facing issues while fetching the blocks

RetryingBlockFetcher: Retrying fetch (1/3) for 1 outstanding blocks after 5000ms

Error RetyringBlockFetcher: Exception while beginning fetch of 1 outstanding blocks (after 1 retries)

java.io.IOException: Failed to connect to <host>:<some port>

Caused by java.net.ConnectException: Connection refused: <host>:<some port>

We are seeing these errors continuously in the executors when we run a big spark jobs. During this time nothing is being processed and after some time these errors are getting disappeared and the processing gets resumed. This is impacting our job SLAs. Can any one help me on this

3 REPLIES 3

avatar
New Contributor

Hi Cloudera,

 

Can someone please help with this issue ?

I'm also facing this issue in our production and impacting our SLA.

avatar
Contributor

Hello ,

 

Can you check and increase the below parameters?

 

--conf spark.executor.memory=XXX 
increasing number of executors

Also, See below doc for tuning your spark jobs.

https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-2/

avatar
Explorer

Try to run the command adding " --deploy-mode cluster "

It should work, this seems to be a bug  

 

 

https://support.oracle.com/knowledge/Oracle%20Database%20Products/2498643_1.html