About nkimani

nkimani · ‎02-21-2021

Answered my own question. Two of the data nodes could not communicate (ssh) with each other due to a network config issue.

nkimani · ‎12-31-2020

Hi, we are running a HDP 3.1 production cluster with 2 Master Nodes and 54 Data Nodes;Spark version is 2.3 and yarn is the cluster manager. Each data node has about 250GB RAM and 56 Cores. We are using a combination of Nifi and Spark to set up a Hive DWH as follows: Nifi picks input csv files from source and loads them to HDFS in parquet format Spark jobs picks these files and loads them to hive managed tables Another set of Spark jobs aggregates the data in the hive managed tables to hourly and daily hive managed tables. I noticed that my Spark jobs were very slow at times when writing to hive. I ran a test where i ran the same spark-submit job several times and it took between 10-13 minutes (fast) to 35-40 minutes (slow) to run the same job with the exact same parameters. My spark submit job: sudo /usr/hdp/3.1.0.0-78/spark2/bin/spark-submit --files /etc/hive/3.1.0.0-78/0/hive-site.xml,/etc/hadoop/3.1.0.0-78/0/core-site.xml,/etc/hadoop/3.1.0.0-78/0/hdfs-site.xml --driver-class-path /usr/hdp/3.-78/spark2/jars/postgresql-42.2.5.jar,/usr/hdp/3.1.0.0-78/spark2/jars/config-1.3.4.jar --jars /usr/hdp/3.1.0.0-78/spark2/jars/postgresql-42.2.5.jar,/usr/hdp/3.1.0.0-78/spark2/jars/config-1.3.4.jar --class my.domain.net.spark_job_name --master yarn --deploy-mode cluster --driver-memory 50G --driver-cores 40 --executor-memory 50G --num-executors 50 --executor-cores 10 --name spark_job_name --queue my_queue /home/nkimani/spark_job_name-1.0-SNAPSHOT.jar 2020-10-06 1 16 16 Yarn and Spark Logs indicated that 2 data nodes (data node 05 and 06) were consistently throwing the following error: {"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"FetchFailed","Block Manager Address":{"Executor ID":"32","Host":"svdt8c2r14-hdpdata06.my.domain.net","Port":38650},"Shuffle ID":0,"Map ID":546,"Reduce ID":132,"Message":"org.apache.spark.shuffle.FetchFailedException: Connection from svdt8c2r14-hdpdata06.my.domain.net/10.197.26.16:38650 closed\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)\n\tat .. .. {"Event":"SparkListenerTaskEnd","Stage ID":6,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"FetchFailed","Block Manager Address":{"Executor ID":"27","Host":"svdt8c2r14-hdpdata05.my.domain.net","Port":45584},"Shuffle ID":0,"Map ID":213,"Reduce ID":77,"Message":"org.apache.spark.shuffle.FetchFailedException: Connection from svdt8c2r14-hdpdata05.my.domain.net/10.197.26.15:45584 closed\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:528)\n\tat I have tried the following: Run a whole day ping test to the 'faulty' data nodes -> The packet drop was 0% ruling out network connectivity issues i think Checked CPU and Memory usage for the 2 nodes and it was below the available capacity Ensured that the two nodes are time synched to our freeipa server I have run out of options and any help would be appreciated. What puzzles me is why these specific nodes. I would also like to add that the Hbase service (on Ambari) has also been reporting connection errors to one of these nodes Thanks, Kim

nkimani · ‎12-29-2018

I found the solution today. I modified my PutDatabaseRecord processor. I explicitly put the Oracle Schema table instead of putting it part of the Table Name. I am now using only 2 processors, ExecuteSQL and PutDatabaseRecord. No need for the AvroSplit processor.

nkimani · ‎12-26-2018

@mburgess I encountered a similar problem while trying to ingest data from one Oracle Table to another. It does not work if I use SplitAvro either. NiFi Version: 1.7.1 Source Table: CREATE TABLE IC_STAGE.TEMP_NIFI_1 ( ID_DATE VARCHAR2(500 BYTE), REC NUMBER ) Destination Table: CREATE TABLE IC_STAGE.TEMP_NIFI ( ID_DATE VARCHAR2(500 BYTE), REC NUMBER ) ExecuteSQL: SELECT * FROM IC_STAGE.TEMP_NIFI_1 PutDatabaseSQL Properties: Record Reader: Avro Reader Schema Access Strategy: Use Embedded Avro Schema My schema, when i view the details of the queue looks like this: {"type":"record","name":"NiFi_ExecuteSQL_Record","namespace":"any.data","fields":[{"name":"ID_DATE","type":["null","string"]},{"name":"REC","type":["null",{"type":"bytes","logicalType":"decimal","precision":10,"scale":0}]}]} I got the same error even when I used "Schema Text" as my access strategy.

Online	Offline
Last Visited	‎05-06-2021 08:23 AM

Member Since	‎12-26-2018 10:04 AM
Last Visited	‎05-06-2021 08:23 AM
Posts	4
Kudos received	1

Cloudera Community

Re: Spark FetchFailed Connection Exception slowing...

Re: Spark FetchFailed Connection Exception slowing...

Spark FetchFailed Connection Exception slowing dow...

Re: Apache NiFi PutDatabasRecord processor error: ...

Re: Apache NiFi PutDatabasRecord processor error: ...