- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Nifi: Compare contents of two files
- Labels:
-
Apache NiFi
Created 02-13-2025 02:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am fairly new to Nifi and I am using 2 GetSFTP processors to get file from two remote servers, these are application.properties files. Now I want to compare the contents of these files and check for any differences.
Need a little guidance on how I can achieve this in NiFi.
Created on 02-18-2025 11:13 AM - edited 02-18-2025 11:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@mridul_tripathi
That is not exactly the dataflow I was trying to convey, but good attempt.
This is what I was envisioning:
It start with fetching the of files from "SFTP1" using the listSFTP and FetchSFTP processors. The ListSFTP processor will create a bunch of FlowFile attributes on the output FlowFile that can be used by the FetchSFTP to fetch the content and add it to the FlowFile. In the FetchSFTP processor you will specify the SFTP1 Hostname, Username, and Password. You will use NiFi Expression language to tell FetchSPT to fetch the specific content based in the FlowFile attributes created by ListSFTP:
Next the FlowFile (now with its content from SFTP1) is passed to the CryptographicHashContent processor that will create a new FlowFile Attribute (content_SHA-256) on the flowFile with the content hash. Unfortunately, we have no control over the FlowFile attribute name created by this processor.
Next The FlowFile is passed to an UpdateAttribute processor is used to move the (content_SHA-256) FlowFile to a new FlowFile attribute and remove the content_SHA-256 attribute completely so we can calculate it again later after fetch same file from SFTP2.
I created a new FlowFile Attribute (SFTP1_hash) where I copied over the hash. Clicking the "+" will allow you to add a dynamic property.
Next I pass the FlowFile to ModifyBytes processor to remove the content from the FlowFile.
Now it is time to fetch the content for this same Filename from SFTP2 by using another FetchSFTP processor. This FetchSFTP processor will be configured with the hostname for SFTP2, username for SFTP2, and password for SFT2. We still want to use the filename from the FlowFile to make sure we are fetching the same file contents from SFTP2. So you can still use "${path}/${filename}" assuming both SFTP1 and SFTP2 use the same path. If not, you will need to set path manually (<some SFTP2 path>/${filename}).
Now you pass the FlowFile to another CryptographicHashContent processor which will have the content fetched from SFPT2 for the same filename. At this point in time your FlowFile has a bunch of FlowFile attributes (including hash of both content from SFTP1 (SFTP1_hash) and SFTP2 (content_SHA256)and only the content from SFTP2. So you'll pass it
Now it is time to compare those two hash attribute values to make sure they are identical using an RouteOnAttribute processor. Here will create a NiFi Expression Language (NEL) expression to make this comparison. Clicking the "+" will allow you to add a dynamic property. Each Dynamic property added in this property becomes a new relationship on the processor.
${content_SHA-256:equals(${SFTP1_hash})}
This NEL will return the value/string from FlowFile's "content_SH256" attribute and check to see if it is equal to the value/string from the FlowFile's "SFTP1_hash" attribute. If true, the FlowFile will be routed to the new "Content-Match" relationship. If false, it will be routed to the exiting "unmatched" relationship.
Here you can decide if just want to auto-terminate the "Content-Match" relationship or do some further processing. The Unmatched relationship will contain any FlowFiles where the content for two files of the same filename have content that did not match. The FlowFile will contain the content from SFTP2.
Hope this helps.
Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 02-13-2025 06:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@mridul_tripathi, Welcome to our community! To help you get the best possible answer, I have tagged our NiFi experts @SAMSAL @Shelton @MattWho who may be able to assist you further.
Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.
Regards,
Vidya Sargur,Community Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created 02-13-2025 06:14 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@mridul_tripathi
The best way to check if two files have the exact same content is to generate a hash of the content and then compare that those two hashes to see if they are the same.
While comparing hash values allows you to detect if the content is the same between NiFi FlowFiles, it sounds like you want to determine what is different and not just that they are different? NiFi does not have a processor that is designed to do this function.
So what is the full use case here?
SFTP1 is source of truth always expected to have correct content?
SFTP2 is the backup or expected to have content that matches SFTP1?
Example use case:
You could pull a file from SFTP2 (File to be verified), create a FlowFile attribute containing the hash of this file (Hash-SFTP2), then zero out the content (Modify bytes), then pass this FlowFile to a FetchSFTP (used to fetch file of same filename from SFTP 1), create another FlowFile attribute (hash-sftp1), Now you can use a RouteOnAttribute that compares the two hash attributes to see if they are the equal. If false, route the FlowFile to PutSFTP to overwrite the file on SFTP2 withe FlowFiles current content from SFTP1 so that both SFTP server now have matching content for this filename.
Now if your use case is to somehow output a FlowFile containing all the difference in the content, that is more challenging and would likely require something custom (custom processor or some custom script)
Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 02-16-2025 08:12 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey @MattWho ,
Thanks for the input.
Can you please let me know more on how can I create a FlowFile attribute containing the hash of a file. Do I need to use 'CryptographicHashContent' processor ?
Created 02-16-2025 10:37 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @MattWho ,
Does this look proper ?
I am not sure how to properly configure 'RouteOnAttribute' processor as it always places the SFTP1 file in the new location.
Note: I am trying to compare SFTP1 File(original file) with SFTP2 file(file to be compared) and if there is any difference I am keeping the SFTP2 file to a local server where nifi is installed.
@MattWho , @SAMSAL , @Shelton please help.
Created on 02-18-2025 11:13 AM - edited 02-18-2025 11:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@mridul_tripathi
That is not exactly the dataflow I was trying to convey, but good attempt.
This is what I was envisioning:
It start with fetching the of files from "SFTP1" using the listSFTP and FetchSFTP processors. The ListSFTP processor will create a bunch of FlowFile attributes on the output FlowFile that can be used by the FetchSFTP to fetch the content and add it to the FlowFile. In the FetchSFTP processor you will specify the SFTP1 Hostname, Username, and Password. You will use NiFi Expression language to tell FetchSPT to fetch the specific content based in the FlowFile attributes created by ListSFTP:
Next the FlowFile (now with its content from SFTP1) is passed to the CryptographicHashContent processor that will create a new FlowFile Attribute (content_SHA-256) on the flowFile with the content hash. Unfortunately, we have no control over the FlowFile attribute name created by this processor.
Next The FlowFile is passed to an UpdateAttribute processor is used to move the (content_SHA-256) FlowFile to a new FlowFile attribute and remove the content_SHA-256 attribute completely so we can calculate it again later after fetch same file from SFTP2.
I created a new FlowFile Attribute (SFTP1_hash) where I copied over the hash. Clicking the "+" will allow you to add a dynamic property.
Next I pass the FlowFile to ModifyBytes processor to remove the content from the FlowFile.
Now it is time to fetch the content for this same Filename from SFTP2 by using another FetchSFTP processor. This FetchSFTP processor will be configured with the hostname for SFTP2, username for SFTP2, and password for SFT2. We still want to use the filename from the FlowFile to make sure we are fetching the same file contents from SFTP2. So you can still use "${path}/${filename}" assuming both SFTP1 and SFTP2 use the same path. If not, you will need to set path manually (<some SFTP2 path>/${filename}).
Now you pass the FlowFile to another CryptographicHashContent processor which will have the content fetched from SFPT2 for the same filename. At this point in time your FlowFile has a bunch of FlowFile attributes (including hash of both content from SFTP1 (SFTP1_hash) and SFTP2 (content_SHA256)and only the content from SFTP2. So you'll pass it
Now it is time to compare those two hash attribute values to make sure they are identical using an RouteOnAttribute processor. Here will create a NiFi Expression Language (NEL) expression to make this comparison. Clicking the "+" will allow you to add a dynamic property. Each Dynamic property added in this property becomes a new relationship on the processor.
${content_SHA-256:equals(${SFTP1_hash})}
This NEL will return the value/string from FlowFile's "content_SH256" attribute and check to see if it is equal to the value/string from the FlowFile's "SFTP1_hash" attribute. If true, the FlowFile will be routed to the new "Content-Match" relationship. If false, it will be routed to the exiting "unmatched" relationship.
Here you can decide if just want to auto-terminate the "Content-Match" relationship or do some further processing. The Unmatched relationship will contain any FlowFiles where the content for two files of the same filename have content that did not match. The FlowFile will contain the content from SFTP2.
Hope this helps.
Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
