I am trying to remove duplicates from CSV files. For that, I am using below processors.
Extract text - To remove header from 2 flow files ( CSV file)
Split Text - To split the text from flow files
RouteOnContent - Route only the text, not the header
HashContent - To Calculate hash value for all the flow files ( which got split)
DetectDuplicate - To detect duplicate records and redirect non duplicate record to merge content.
MergeContent - To merge the flow files again, based on correlation attribute name.
In this case, sometimes am facing an issue, where a flow file is getting stuck between Detect duplicate & merge content. I have given max bin age as 5 mins, however even after 20 to 30 mins, it is not processing into merge content. Could someone share your inputs. (NOTE: This is standalone NiFi setup, not the cluster) Flow file expiration set to 5 hours.
I have attached screen shots of Merge content configuration.
One more thing, in production environment, we may get millions of record in two csv file. Is there any better way to find duplicate from CSV files, instead of splitting into text. If there is no other way, what configurations, I have to take care to avoid issues in production?