Support Questions

Find answers, ask questions, and share your expertise

What can be caused by data duplication in the data flow of NiFi

New Contributor

Hi,

What can be caused by data duplication in the data flow of NiFi? I try to load data from the SQL Server database tables to Parquet files on HDFS. In the source tables, the data is unique. In QueryDatabaseTable, I have columns that are unique in the source.. I tried to delete Parquet files and restart the process but the result is the same. I check the contents of Parquet files by connecting them to Hive as an external table. I used 2 or 3 instance PutParquet and DistributeLoad to parallel the data loading process.

80599-dataflow.png

80600-distrubuteload.png

80601-querydatabasetable.png

Please help

@miro Ka

1 REPLY 1

Super Guru

@Miro Ka

The issue i can see in your flow is QueryDatabase Table processor is running on All nodes but it supposed to run on only Primary node and if we are running all nodes then each node will get same data from the source table which resulting duplication of data once you stored into HDFS.

Change the Execution to Primary Node in Scheduling Tab

80602-run.png

Once you changed to Primary Node your processor will show little P(as shown in the below screenshot) which indicates the processor running on primary node.

80603-qdt.png

By making this change run again then you don't get any duplicate data.