We are currently randomly receiving duplicate data of every event in our flow setup. I suspect it might have to do with our SplitText processors in the beginning but would like some professional feedback on this. My flows are in the screenshots below and I will relate any information or screenshots of configs upon request. The SplitText processors basically split the incoming data into 100k bundles then 10k bundles - this was implemented to ease the burden if we get a blast of data all at once (GBs worth) and it seems to work to handle that but now I fear its causing this issue. That being said there was times before when duplicate data occurred but not as frequent by a long stretch.
Could you expand on "track the lineage"? Do you mean, see what file it comes from because what I suspect to be duplicate data I see comes form the same file. Are there other data provenance elements to look at that might reveal something? Thanks.
I will say its odd because we also use HUNK and Im looking at the ten different flows and the duplication that occurred over the last 24 hours. It seems from 7 am - 8 am, it duplicated all the data just for that hour on all 10 independent flows. That seems significant to me - what are the odds of a fluke in 10 different processors causing duplication at the same time... does this hint to a culprit possibly?