Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to perform a reliable check of data integrity with NiFi ?

avatar

Hi,

I work on a NiFi flow getting data from a FTP server, sending this data to HDFS. I have to add to this flow the ability of checking data integrity between files fetched on the FTP, and files written on HDFS.

To do this, I use the HashContent NiFi processor with MD5 algorithm to compute MD5 hash of flowfiles from start and end of the flow (I can get MD5 hash of each file on FTP, second MD5 is computed after a PutHDFS retrieving files having been written).

Finally, I compare both values and consider data integrity is OK if they are equal.

Do you have a general advice about this practice?

Is this kind of check really useful with NiFi?

Thanks,

Benjamin

1 ACCEPTED SOLUTION

avatar
Super Mentor
@Benjamin Bouret

-

The listHDFS processor does not retrieve the actual content of the files. It produces 0 byte FlowFiles that have metadata about the target content. Any hash you produce on these files will not match what the hash produced on the original source ftp server.

-

If I am not following above correctly, I am not really clear on exactly where you are performing this second hash.

How you plan to compare the two hashes. Manually?

-

NiFi has guaranteed delivery when it writes data to HDFS. If the transfer fails for any reason the FlowFile is routed to failure.

79437-screen-shot-2018-07-10-at-75834-am.png

-

FetchFTP processor also has handling of failures in retrieving the Content:

79436-screen-shot-2018-07-10-at-75653-am.png

-

This check seems like a lot of overhead that should not be necessary.

-

Thank you,

Matt

-

When an "Answer" addresses/solves your question, please select "Accept" beneath that answer. This encourages user participation in this forum.

View solution in original post

2 REPLIES 2

avatar
Super Mentor
@Benjamin Bouret

-

The listHDFS processor does not retrieve the actual content of the files. It produces 0 byte FlowFiles that have metadata about the target content. Any hash you produce on these files will not match what the hash produced on the original source ftp server.

-

If I am not following above correctly, I am not really clear on exactly where you are performing this second hash.

How you plan to compare the two hashes. Manually?

-

NiFi has guaranteed delivery when it writes data to HDFS. If the transfer fails for any reason the FlowFile is routed to failure.

79437-screen-shot-2018-07-10-at-75834-am.png

-

FetchFTP processor also has handling of failures in retrieving the Content:

79436-screen-shot-2018-07-10-at-75653-am.png

-

This check seems like a lot of overhead that should not be necessary.

-

Thank you,

Matt

-

When an "Answer" addresses/solves your question, please select "Accept" beneath that answer. This encourages user participation in this forum.

avatar

Thank you for your answer,

The second hash is performed after a PutHDFS and not a ListHDFS (I have edited my post, sorry for the mistake).

If I understand you well, this check is not useful because PutHDFS and FetchFile processors are already able to catch corruption errors reliably?

I would compare both hashes with NiFi expression language (:equals function) inside a RouteOnAttribute processor.