so i have to compare 2 different xmls (today) (about 15k to 20k per day) against yesterday and then only process the ones with differences. Into the additional flows for processing.
so currently I have the python script that does the compare, but i have to receive all files in first. This delays processing for about 1 hour or more.
What i would like to do is as receiving files in, pull yesterday's file (if there is a match) then run through the xmldiff which is on a executestream processor. I cannot figure how to get both files into the python script so that the compare can be run against the 2 files.
I had found a cookbook script that will pass information to console via bash, but has not been able to pass to python script.
With NiFi each processor execution works on a FlowFile in isolation, so no access to more than one FlowFile at a time.
I guess the question here is, what is the end goal here?
You have a file from yesterday and today, I presume with the same filename?
So you want to pull in both these files and compare their content (which is xml) to see what the differences are between each day, correct?
Then what do you do intend to do with that information?
I suppose you could merge the two FlowFiles with same filename using some unique delimiter. The pass that one FlowFile with content from both yesterday and today to your script to parse the difference between those two delimited content sections. Then return whatever result you want to propagate on through your downstream dataflow(s). You could write that result to a FlowFile attribute or replace the existing content.
If you wrote it to an attribute, you could split the merged file back in to the two original xml files based on the unique delimiter you used earlier when they were merged.
Perhaps you can modify your script so it writes an attribute back on the source FlowFile to identify if differences exist.
So you have merged FlowFile which you pass to your script which compares the two delimited XML blocks for differences. If none are found it simply creates an attribute on FlowFile.
Attribute Name: Attribute value:
then have a routeOnAttribute processor delete any FlowFile where the diff-found attribute has been set to false; otherwise route the FlowFile to unmatched leading to further processing of only those that are different.