I have a kafka topic that contains a series of small (<1KB) messages and want to set up a consumer in NiFi to pull this data through and write to HDFS. I want to do minimal transformation on the data in NiFi.
I want to avoid the "small files problem" I've read so much about and I'm trying to come up with the best method of pushing the messages through to HDFS.
Documented here I've written a small template in NiFi that does it but it doesn't seem optimal. Basically I have a GetKafka consumer writing to a MergeContent processor with "Minimum Group Size" set to hold the data until it reaches a certain size.
The two problems I see with this are:
1. Latest Data is stuck in limbo until certain size is reached
2. I'm thinking the higher the better for the Minimum Group Size property, but the higher it is, the longer the data is stuck in limbo. But the smaller it is, the less optimal the structure will be in HDFS.
The other way I was playing around with is instead of holding the data, I can write the messages to files locally on my nifi instance and have a separate ExecuteProcess processor constantly running and appending the files to a single file in HDFS.
Any help is much appreciated!! Thank you.