- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Spark FetchFailed Connection Exception slowing down writes to hive
- Labels:
-
Apache Spark
Created 12-31-2020 02:02 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, we are running a HDP 3.1 production cluster with 2 Master Nodes and 54 Data Nodes;Spark version is 2.3 and yarn is the cluster manager. Each data node has about 250GB RAM and 56 Cores. We are using a combination of Nifi and Spark to set up a Hive DWH as follows:
- Nifi picks input csv files from source and loads them to HDFS in parquet format
- Spark jobs picks these files and loads them to hive managed tables
- Another set of Spark jobs aggregates the data in the hive managed tables to hourly and daily hive managed tables.
I noticed that my Spark jobs were very slow at times when writing to hive. I ran a test where i ran the same spark-submit job several times and it took between 10-13 minutes (fast) to 35-40 minutes (slow) to run the same job with the exact same parameters.
My spark submit job:
sudo /usr/hdp/3.1.0.0-78/spark2/bin/spark-submit --files /etc/hive/3.1.0.0-78/0/hive-site.xml,/etc/hadoop/3.1.0.0-78/0/core-site.xml,/etc/hadoop/3.1.0.0-78/0/hdfs-site.xml --driver-class-path /usr/hdp/3.-78/spark2/jars/postgresql-42.2.5.jar,/usr/hdp/3.1.0.0-78/spark2/jars/config-1.3.4.jar --jars /usr/hdp/3.1.0.0-78/spark2/jars/postgresql-42.2.5.jar,/usr/hdp/3.1.0.0-78/spark2/jars/config-1.3.4.jar --class my.domain.net.spark_job_name --master yarn --deploy-mode cluster --driver-memory 50G --driver-cores 40 --executor-memory 50G --num-executors 50 --executor-cores 10 --name spark_job_name --queue my_queue /home/nkimani/spark_job_name-1.0-SNAPSHOT.jar 2020-10-06 1 16 16
Yarn and Spark Logs indicated that 2 data nodes (data node 05 and 06) were consistently throwing the following error:
{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"FetchFailed","Block Manager Address":{"Executor ID":"32","Host":"svdt8c2r14-hdpdata06.my.domain.net","Port":38650},"Shuffle ID":0,"Map ID":546,"Reduce ID":132,"Message":"org.apache.spark.shuffle.FetchFailedException: Connection from svdt8c2r14-hdpdata06.my.domain.net/10.197.26.16:38650 closed\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)\n\tat
..
..
{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"FetchFailed","Block Manager Address":{"Executor ID":"27","Host":"svdt8c2r14-hdpdata05.my.domain.net","Port":45584},"Shuffle ID":0,"Map ID":213,"Reduce ID":77,"Message":"org.apache.spark.shuffle.FetchFailedException: Connection from svdt8c2r14-hdpdata05.my.domain.net/10.197.26.15:45584 closed\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)\n\tat
I have tried the following:
- Run a whole day ping test to the 'faulty' data nodes -> The packet drop was 0% ruling out network connectivity issues i think
- Checked CPU and Memory usage for the 2 nodes and it was below the available capacity
- Ensured that the two nodes are time synched to our freeipa server
I have run out of options and any help would be appreciated. What puzzles me is why these specific nodes. I would also like to add that the Hbase service (on Ambari) has also been reporting connection errors to one of these nodes
Thanks,
Kim
Created 02-21-2021 06:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created 02-21-2021 06:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
