Support Questions

Find answers, ask questions, and share your expertise

Using NiFi as a Kafka consumer to write to HDFS

avatar
Contributor

I have a kafka topic that contains a series of small (<1KB) messages and want to set up a consumer in NiFi to pull this data through and write to HDFS. I want to do minimal transformation on the data in NiFi.

I want to avoid the "small files problem" I've read so much about and I'm trying to come up with the best method of pushing the messages through to HDFS.

Documented here I've written a small template in NiFi that does it but it doesn't seem optimal. Basically I have a GetKafka consumer writing to a MergeContent processor with "Minimum Group Size" set to hold the data until it reaches a certain size.

The two problems I see with this are:

1. Latest Data is stuck in limbo until certain size is reached

2. I'm thinking the higher the better for the Minimum Group Size property, but the higher it is, the longer the data is stuck in limbo. But the smaller it is, the less optimal the structure will be in HDFS.

The other way I was playing around with is instead of holding the data, I can write the messages to files locally on my nifi instance and have a separate ExecuteProcess processor constantly running and appending the files to a single file in HDFS.

Any help is much appreciated!! Thank you.

1 ACCEPTED SOLUTION

avatar
Rising Star

With MergeContent, it is possible to specify a Max Bin Age that will prevent a data starvation condition that prevents the latest data from being held in limbo. Accordingly, you can make a best effort to get an appropriately sized file to place in HDFS but not at the cost of data being held indefinitely.

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

@Sean Byrne I would say writing to local file and then using the nifi processor to write to the hdfs is idle. Its similar to Kafka storing messages in Kafka.log.dir and the use Kafkaspout to read this messages and writing to HDFS/Hbase.

avatar
Explorer

@Sean Byrne if you aren't set on nifi, you could also consider other approaches for this - e.g. kafka-connect (http://kafka.apache.org/documentation.html#connect), camus or gobblin. These are all designed to do exactly what you say - probably kafka connect is the most "official" of them.

avatar
Rising Star

With MergeContent, it is possible to specify a Max Bin Age that will prevent a data starvation condition that prevents the latest data from being held in limbo. Accordingly, you can make a best effort to get an appropriately sized file to place in HDFS but not at the cost of data being held indefinitely.