Hi All,
One of our NiFi dataflows ingests small files (1-5 KB each) at a rate of 100+ messages per second. The requirement is to store them in HDFS, we're using a MergeContent processor to bundle 1000 files into a new bigger file, which makes the files a bit bigger for HDFS, but not near the ideal size for HDFS storage. We could make the MergeContent wait for more files until the desired-sized merged file is ready, but the problem is, we do not want to wait too long within NiFi, we want to be able to send data to HDFS as close to "near real-time" as possible, and not wait a day in MergeContent processor, waiting for enough files to accumulate.
So, it appears PutHDFS "append" might work, where you write the files as they come in and append them to existing HDFS file until the desired HDFS file size is accomplished (have some questions on this approach, posted a question on that here - https://community.hortonworks.com/questions/99843/questions-on-nifi-puthdfs-append-option.html)
Another option, we're considering is, have a nightly job/dataflow that merges HDFS files at rest to desired size; this seems like a simpler approach.
Wanted to know which option would be better to address the too-many-small-files issue.
Thanks in advance.