When doing file ingestion through HDF and NiFi, I'm wondering when/if manual file integrity checks are needed.
In one scenario, a set of files (sql dumps and various log files) are put daily on a sftp server (not using NiFi or any HortonWorks product) for me to retrieve through NiFi and archive in HDFS.
I'm suggested that the files should have a checksum (SHA256 or similar) accompanying them, for me to check through hashContent in NiFi before putting on HDFS. Is this actually necessary? Can we encounter file transfer errors through sftp/NiFi that won't be caught by the underlaying technology? Does it matter if the files are compressed before uploaded to SFTP?
Is a file fetched by listsftp/fetchsftp guaranteed to be done writing, or do we need a manual scheme for renaming / placing signature/ok files to signal file is ready to fetch?
Please help with insights and arguments to when and why manual integrity schemes should be implemented, and when it's already taken care of "underneath". Would be glad of some best practices here, so I won't reinvent schemes already built-in. Thanks.
Hello @Henrik Olsen you should not need to do additional file integrity checks beyond what the transport protocols will do for you. However, your question about List/Fetch and guarantees that the data is done being written to... There are no guarantees. The most reliable model in the world of resolving race conditions with File IO between the producer (the thing writing the file) and the consumer (the thing grabbing the file) is to use file naming techniques such as prepending the file with a '.' while writing and removing it when complete. If you cannot establish such a model then you can resort to more complicated techniques like waiting to Fetch listed files after some intentional/artificial delay.
Thanks @jwitt. Sounds good if the underlaying transport protocols ensure file integrity. But can you help clarify though, more specifically what parts in the chain of "client sftp -> server sftp -> nifi listsftp -> nifi fetchsftp -> nifi puthdfs" ensures the integrity (we need to document it for the architecture)? Especially around sftp, and if it is that choice vs regular ftp that makes file integrity checks ok to skip. Could be great to know what technology/documention to refer to for documenting file integrity is ensured, if available.
I'll try and make it a standard that we require producers to do the dot rename technique when writing files on the (s)ftp servers we fetch data from. How do we document the need for this race condition avoidance need?
We do sometimes have an "ok" file accompanying files, that one could agree is only transfered _after_ the main file is done uploading, but it's a bit more involved checking that, than being able to trust a file when you see it (by filtering out dot prefixed files). So a good strategy with the rename, I think.
In case somebody insists on signature files (possible because some systems are still on regular ftp, if that makes a/the difference), I've seen the fine template at https://community.hortonworks.com/questions/85836/is-it-possible-in-nifi-to-check-an-input-file-for..... Been wondering though, what if some files are missing their signature file, how to catch that with a good flow design, to be able to react on it (like a time you're willing to wait for it, before something needs to happen).
I would greatly appreciate if I could get some pointers to my request (this thread Feb 28, comment to @jwitt) for how to document in architectural plans that we can skip manual file integration checks when moving files over sftp and NiFi into HDP. I'd love to avoid it these checks, but need to refer to solid docs/arguments showing file integrity it's already guaranteed by underlaying technology (e.g. SFTP vs FTP). Any hints on where to go with this? Thanks.