Support Questions

Find answers, ask questions, and share your expertise

Options in addressing "too many small files in HDFS" issue in NiFi

avatar
Expert Contributor

Hi All,

One of our NiFi dataflows ingests small files (1-5 KB each) at a rate of 100+ messages per second. The requirement is to store them in HDFS, we're using a MergeContent processor to bundle 1000 files into a new bigger file, which makes the files a bit bigger for HDFS, but not near the ideal size for HDFS storage. We could make the MergeContent wait for more files until the desired-sized merged file is ready, but the problem is, we do not want to wait too long within NiFi, we want to be able to send data to HDFS as close to "near real-time" as possible, and not wait a day in MergeContent processor, waiting for enough files to accumulate.

So, it appears PutHDFS "append" might work, where you write the files as they come in and append them to existing HDFS file until the desired HDFS file size is accomplished (have some questions on this approach, posted a question on that here - https://community.hortonworks.com/questions/99843/questions-on-nifi-puthdfs-append-option.html)

Another option, we're considering is, have a nightly job/dataflow that merges HDFS files at rest to desired size; this seems like a simpler approach.

Wanted to know which option would be better to address the too-many-small-files issue.

Thanks in advance.

1 ACCEPTED SOLUTION

avatar
@Raj B

I posted an answer to your other question before I saw this question.

So, maybe a combination of the two methods would be the best approach.

You could set a time limit on the merge of say 30 minutes or an hour, then check the file size in HDFS and if it isn't the desired size append, if it is then start writing to a new file in HDFS. That way you are still having a small impact on HDFS checking the file size only once every 30 to 60 minutes instead of every time a new file comes in.

View solution in original post

1 REPLY 1

avatar
@Raj B

I posted an answer to your other question before I saw this question.

So, maybe a combination of the two methods would be the best approach.

You could set a time limit on the merge of say 30 minutes or an hour, then check the file size in HDFS and if it isn't the desired size append, if it is then start writing to a new file in HDFS. That way you are still having a small impact on HDFS checking the file size only once every 30 to 60 minutes instead of every time a new file comes in.