<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark FetchFailed Connection Exception slowing down writes to hive in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Spark-FetchFailed-Connection-Exception-slowing-down-writes/m-p/311839#M224860</link>
    <description>Answered my own question. Two of the data nodes could not communicate (ssh) with each other due to a network config issue.</description>
    <pubDate>Sun, 21 Feb 2021 14:24:39 GMT</pubDate>
    <dc:creator>nkimani</dc:creator>
    <dc:date>2021-02-21T14:24:39Z</dc:date>
    <item>
      <title>Spark FetchFailed Connection Exception slowing down writes to hive</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-FetchFailed-Connection-Exception-slowing-down-writes/m-p/308551#M223610</link>
      <description>&lt;P&gt;Hi, we are running a HDP 3.1 production cluster with 2 Master Nodes and 54 Data Nodes;Spark version is 2.3 and yarn is the cluster manager. Each data node has about 250GB RAM and 56 Cores. We are using a combination of Nifi and Spark to set up a Hive DWH as follows:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Nifi picks input csv files from source and loads them to HDFS in parquet format&lt;/LI&gt;&lt;LI&gt;Spark jobs picks these files and loads them to hive managed tables&lt;/LI&gt;&lt;LI&gt;Another set of Spark jobs aggregates the data in the hive managed tables to hourly and daily hive managed tables.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;I noticed that my Spark jobs were very slow at times when writing to hive. I ran a test where i ran the same spark-submit job several times and it took between 10-13 minutes (fast) to 35-40 minutes (slow) to run the same job with the exact same parameters.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My spark submit job:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;sudo /usr/hdp/3.1.0.0-78/spark2/bin/spark-submit --files /etc/hive/3.1.0.0-78/0/hive-site.xml,/etc/hadoop/3.1.0.0-78/0/core-site.xml,/etc/hadoop/3.1.0.0-78/0/hdfs-site.xml --driver-class-path /usr/hdp/3.-78/spark2/jars/postgresql-42.2.5.jar,/usr/hdp/3.1.0.0-78/spark2/jars/config-1.3.4.jar --jars /usr/hdp/3.1.0.0-78/spark2/jars/postgresql-42.2.5.jar,/usr/hdp/3.1.0.0-78/spark2/jars/config-1.3.4.jar --class my.domain.net.spark_job_name --master yarn --deploy-mode cluster --driver-memory 50G --driver-cores 40 --executor-memory 50G --num-executors 50 --executor-cores 10 --name spark_job_name --queue my_queue /home/nkimani/spark_job_name-1.0-SNAPSHOT.jar 2020-10-06 1 16 16&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Yarn and Spark Logs indicated that 2 data nodes (data node 05 and 06) were consistently throwing the following error:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"FetchFailed","Block Manager Address":{"Executor ID":"32","Host":"svdt8c2r14-hdpdata06.my.domain.net","Port":38650},"Shuffle ID":0,"Map ID":546,"Reduce ID":132,"Message":"org.apache.spark.shuffle.FetchFailedException: Connection from svdt8c2r14-hdpdata06.my.domain.net/10.197.26.16:38650 closed\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)\n\tat 
..
..
{"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"FetchFailed","Block Manager Address":{"Executor ID":"27","Host":"svdt8c2r14-hdpdata05.my.domain.net","Port":45584},"Shuffle ID":0,"Map ID":213,"Reduce ID":77,"Message":"org.apache.spark.shuffle.FetchFailedException: Connection from svdt8c2r14-hdpdata05.my.domain.net/10.197.26.15:45584 closed\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)\n\tat &lt;/LI-CODE&gt;&lt;P&gt;&lt;BR /&gt;I have tried the following:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Run a whole day ping test to the 'faulty' data nodes -&amp;gt; The packet drop was 0% ruling out network connectivity issues i think&lt;/LI&gt;&lt;LI&gt;Checked CPU and Memory usage for the 2 nodes and it was below the available capacity&lt;/LI&gt;&lt;LI&gt;Ensured that the two nodes are time synched to our freeipa server&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;I have run out of options and any help would be appreciated. What puzzles me is why these specific nodes. I would also like to add that the Hbase service (on Ambari) has also been reporting connection errors to one of these nodes&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Kim&lt;/P&gt;</description>
      <pubDate>Thu, 31 Dec 2020 10:02:34 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-FetchFailed-Connection-Exception-slowing-down-writes/m-p/308551#M223610</guid>
      <dc:creator>nkimani</dc:creator>
      <dc:date>2020-12-31T10:02:34Z</dc:date>
    </item>
    <item>
      <title>Re: Spark FetchFailed Connection Exception slowing down writes to hive</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-FetchFailed-Connection-Exception-slowing-down-writes/m-p/311839#M224860</link>
      <description>Answered my own question. Two of the data nodes could not communicate (ssh) with each other due to a network config issue.</description>
      <pubDate>Sun, 21 Feb 2021 14:24:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-FetchFailed-Connection-Exception-slowing-down-writes/m-p/311839#M224860</guid>
      <dc:creator>nkimani</dc:creator>
      <dc:date>2021-02-21T14:24:39Z</dc:date>
    </item>
  </channel>
</rss>

