Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

Spark FetchFailed Connection Exception slowing down writes to hive


Hi, we are running a HDP 3.1 production cluster with 2 Master Nodes and 54 Data Nodes;Spark version is 2.3 and yarn is the cluster manager. Each data node has about 250GB RAM and 56 Cores. We are using a combination of Nifi and Spark to set up a Hive DWH as follows:

  1. Nifi picks input csv files from source and loads them to HDFS in parquet format
  2. Spark jobs picks these files and loads them to hive managed tables
  3. Another set of Spark jobs aggregates the data in the hive managed tables to hourly and daily hive managed tables.

I noticed that my Spark jobs were very slow at times when writing to hive. I ran a test where i ran the same spark-submit job several times and it took between 10-13 minutes (fast) to 35-40 minutes (slow) to run the same job with the exact same parameters.


My spark submit job:


sudo /usr/hdp/ --files /etc/hive/,/etc/hadoop/,/etc/hadoop/ --driver-class-path /usr/hdp/3.-78/spark2/jars/postgresql-42.2.5.jar,/usr/hdp/ --jars /usr/hdp/,/usr/hdp/ --class --master yarn --deploy-mode cluster --driver-memory 50G --driver-cores 40 --executor-memory 50G --num-executors 50 --executor-cores 10 --name spark_job_name --queue my_queue /home/nkimani/spark_job_name-1.0-SNAPSHOT.jar 2020-10-06 1 16 16



Yarn and Spark Logs indicated that 2 data nodes (data node 05 and 06) were consistently throwing the following error:



{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"FetchFailed","Block Manager Address":{"Executor ID":"32","Host":"","Port":38650},"Shuffle ID":0,"Map ID":546,"Reduce ID":132,"Message":"org.apache.spark.shuffle.FetchFailedException: Connection from closed\n\tat\n\tat 
{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"FetchFailed","Block Manager Address":{"Executor ID":"27","Host":"","Port":45584},"Shuffle ID":0,"Map ID":213,"Reduce ID":77,"Message":"org.apache.spark.shuffle.FetchFailedException: Connection from closed\n\tat\n\tat 

I have tried the following:

  1. Run a whole day ping test to the 'faulty' data nodes -> The packet drop was 0% ruling out network connectivity issues i think
  2. Checked CPU and Memory usage for the 2 nodes and it was below the available capacity
  3. Ensured that the two nodes are time synched to our freeipa server

I have run out of options and any help would be appreciated. What puzzles me is why these specific nodes. I would also like to add that the Hbase service (on Ambari) has also been reporting connection errors to one of these nodes





Answered my own question. Two of the data nodes could not communicate (ssh) with each other due to a network config issue.

View solution in original post


Answered my own question. Two of the data nodes could not communicate (ssh) with each other due to a network config issue.