Support Questions

Find answers, ask questions, and share your expertise

Spark FetchFailed Connection Exception slowing down writes to hive

avatar
Explorer

Hi, we are running a HDP 3.1 production cluster with 2 Master Nodes and 54 Data Nodes;Spark version is 2.3 and yarn is the cluster manager. Each data node has about 250GB RAM and 56 Cores. We are using a combination of Nifi and Spark to set up a Hive DWH as follows:

  1. Nifi picks input csv files from source and loads them to HDFS in parquet format
  2. Spark jobs picks these files and loads them to hive managed tables
  3. Another set of Spark jobs aggregates the data in the hive managed tables to hourly and daily hive managed tables.

I noticed that my Spark jobs were very slow at times when writing to hive. I ran a test where i ran the same spark-submit job several times and it took between 10-13 minutes (fast) to 35-40 minutes (slow) to run the same job with the exact same parameters.

 

My spark submit job:

 

sudo /usr/hdp/3.1.0.0-78/spark2/bin/spark-submit --files /etc/hive/3.1.0.0-78/0/hive-site.xml,/etc/hadoop/3.1.0.0-78/0/core-site.xml,/etc/hadoop/3.1.0.0-78/0/hdfs-site.xml --driver-class-path /usr/hdp/3.-78/spark2/jars/postgresql-42.2.5.jar,/usr/hdp/3.1.0.0-78/spark2/jars/config-1.3.4.jar --jars /usr/hdp/3.1.0.0-78/spark2/jars/postgresql-42.2.5.jar,/usr/hdp/3.1.0.0-78/spark2/jars/config-1.3.4.jar --class my.domain.net.spark_job_name --master yarn --deploy-mode cluster --driver-memory 50G --driver-cores 40 --executor-memory 50G --num-executors 50 --executor-cores 10 --name spark_job_name --queue my_queue /home/nkimani/spark_job_name-1.0-SNAPSHOT.jar 2020-10-06 1 16 16

 

 

Yarn and Spark Logs indicated that 2 data nodes (data node 05 and 06) were consistently throwing the following error:

 

 

{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"FetchFailed","Block Manager Address":{"Executor ID":"32","Host":"svdt8c2r14-hdpdata06.my.domain.net","Port":38650},"Shuffle ID":0,"Map ID":546,"Reduce ID":132,"Message":"org.apache.spark.shuffle.FetchFailedException: Connection from svdt8c2r14-hdpdata06.my.domain.net/10.197.26.16:38650 closed\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)\n\tat 
..
..
{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"FetchFailed","Block Manager Address":{"Executor ID":"27","Host":"svdt8c2r14-hdpdata05.my.domain.net","Port":45584},"Shuffle ID":0,"Map ID":213,"Reduce ID":77,"Message":"org.apache.spark.shuffle.FetchFailedException: Connection from svdt8c2r14-hdpdata05.my.domain.net/10.197.26.15:45584 closed\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)\n\tat 


I have tried the following:

  1. Run a whole day ping test to the 'faulty' data nodes -> The packet drop was 0% ruling out network connectivity issues i think
  2. Checked CPU and Memory usage for the 2 nodes and it was below the available capacity
  3. Ensured that the two nodes are time synched to our freeipa server

I have run out of options and any help would be appreciated. What puzzles me is why these specific nodes. I would also like to add that the Hbase service (on Ambari) has also been reporting connection errors to one of these nodes

 

Thanks,

Kim

1 ACCEPTED SOLUTION

avatar
Explorer
Answered my own question. Two of the data nodes could not communicate (ssh) with each other due to a network config issue.

View solution in original post

1 REPLY 1

avatar
Explorer
Answered my own question. Two of the data nodes could not communicate (ssh) with each other due to a network config issue.